Big data is a concept that's been as widely hyped as cloud computing, and perhaps just as misunderstood in regards to its capabilities and limitations. One of the aspects of big data that is not clearly understood is how existing databases can be used with data storage engines that are non-relational in nature. What's involved in moving data from a relational database management system (RDBMS) to distributed systems? And, perhaps more of interest to IT staff, what's the best way to learn about these big data systems, to determine the best way to use them in an organization?
Currently, the most popular example of a non-relational database management system (NDBMS) is probably Apache Hadoop, a distributed data framework that seems to be the poster child for big data and so-called NoSQL databases. But even those descriptions screen the true nature of Hadoop and how it works. What is Hadoop, really, and how can businesses and IT staffers start using it? Which businesses should use Hadoop and where can you find resources for implementing it?
These are some of the questions that will be answered in this three-part series, "The Road to Apache Hadoop." In part one, we'll examine how Hadoop is put together. Understanding how Hadoop generally works will give a more accurate picture of the skills DBAs and data analysts need to work with the Hadoop framework. And knowing the players in the Hadoop ecosystem will reveal some great sources for Hadoop training.
What Hadoop isn't
There are two elements that need to be cleared up about Hadoop right away: it's not a system that needs to be used exclusively with "big data," and it isn't really a NoSQL tool.
While it is true that Hadoop belongs to the non-relational class of data management systems, that doesn't preclude the use of the SQL query control language. NoSQL isn't even that; it's just a way to describe databases where SQL is not necessarily the only query system that can be used. In fact, SQL-like queries can be used fairly easily with Hadoop.
Then there's this notion of big data. Many people associate Hadoop with managing truly massive amounts of data. And with good reason: Hadoop storage is used by Facebook and Yahoo, which many people (rightly) associate with huge data sets. But Hadoop's use goes far beyond that of big data. One of Hadoop's strongest capabilities is its ability to scale, which puts it up in the big leagues with Yahoo and Facebook but also enables it to scale down to any company that needs inexpensive storage and data management.
To understand this broad capability of scale and the implications of scaling, it's important to understand how Hadoop works.
What Hadoop is
Arun Murthy is a man who knows Hadoop. As VP, Apache Hadoop at the Apache Software Foundation, he's the current leader of the project. Moreover, Murthy has been involved with Hadoop from its early days, when Yahoo adapted Google's open source data framework to their own purposes after it was invented by Doug Cutting to take advantage of Google's MapReduce data programming framework.
To say Yahoo is a big player in the Hadoop space is an understatement on many levels. Cutting, Murthy, and many early Hadoop contributors worked at Yahoo in the last decade. (Cutting now works with Cloudera, a commercial Hadoop vendor launched in 2009, and Murthy went on to co-found Hortonworks, Inc., in June of 2011 with several others on Yahoo's Hadoop team, including Eric Baldeschwieler, now Hortonworks' CEO.) Yahoo is also the largest user of Hadoop to date, having deployed a staggering 50,000-node Hadoop network.
With these credentials in mind, Murthy seemed the best person to ask about what Hadoop is and how it's put together.
As he explained it to me, the framework known as Hadoop can be composed of several components, but the big two are the aforementioned MapReduce data processing framework and a distributed filesystem for data storage, usually something like the Hadoop Distributed Filesystem (HDFS).
HDFS is, in many ways, the simplest Hadoop component to conceptualize (though not always the easiest to manage). It is pretty much what it's called: a distributed filesystem that shoves data onto any machine connected to the Hadoop network. Of course, there's a system to this, and it's not just all willy-nilly, but compared to the highly regimented storage infrastructure of a RDBMS, it's practically a pigsty.
Indeed, it's this flexibility that brings a lot of the value to the Hadoop proposition. While an RDBMS often needs finely tuned and dedicated machines, a Hadoop system can take advantage of commoditized servers with few good hard drives on board. Instead of dealing with the huge management overhead involved with storing data in relational database tables, Hadoop uses HDFS to store data across multiple machines and drives, automatically making data redundant on multiple-node Hadoop systems. If one node fails or slows down, the data is still accessible.
This introduces significant cost savings at the hardware and management levels. It should be noted that while HDFS is the usual filesystem used with Hadoop, it is by no means the only one. For its Elastic Compute Cloud (EC2) solutions, Amazon has adapted its S3 filesystem for Hadoop. DataStax' Brisk is a self-described Hadoop distribution that replaces HDFS with Apache Cassandra's CassandraFS and throws in the data query and analysis capabilities of Hive in for good measure to unify realtime storage and analytics capabilities. And such customization and adaptation is made all the easier thanks to Hadoop's open source nature.
MapReduce is a bit harder to conceptualize. Murthy describes it as a data processing, programming paradigm ... but what does that mean, exactly? As an illustration, it helps to think of MapReduce as analogous to the database engine, much as Jet is the engine for Microsoft Access (as many people do not recall).
When a request for information comes in, MapReduce uses two components: a JobTracker that sits on the Hadoop master node, and TaskTrackers that sit out on each node within the Hadoop network. The process is fairly linear. MapReduce will break data requests down into a discrete set of tasks, then use the JobTracker to send the MapReduce jobs out to the TaskTrackers. To cut down on network latency, jobs are assigned to the same node where the data lives, or at the very least to a node on the same rack.
There's more to Hadoop than just the distributed filesystem and MapReduce, as Figure 1 shows. A Hortonworks' representation of the Hadoop framework, this image shows other components that can be used with Hadoop, including:
HCatalog: A table and storage management service for Hadoop data.
Pig: A programming and data flow interface for MapReduce.
Hive: A data warehousing solution that makes use of a SQL-like language, HiveQL, to create queries for Hadoop data.
It is Hive, Murthy said, that makes Hadoop much easier to use than one might expect from a so-called NoSQL database. Using HiveQL, data analysts can pull out information from a Hadoop database with the same kind of queries they're used to using in a RDBMS. Moving to Hadoop will make for a transition, of course, as there are some differences between SQL and HiveQL, but these differences are not that great.
What do you need to know?
Data analysts won't have too much trouble adapting to Hadoop, but DBAs may face a steeper learning curve. That's because the distributed filesystem is a big departure from the traditional realm of database table storage in RDBMS.
The complexity of Hadoop is definitely a big hurdle to jump for prospective administrators, because the framework composition of all of the different Hadoop components means you have to manage a lot of different elements at once. Don't look for a shiny GUI to handle this, either. Hadoop, Hive, Sqoop, and other tools in the Hadoop ecosystem are controlled from the command line. Since Hadoop is Java-based, and MapReduce makes use of Java classes, a lot of the interaction is the kind where experience as a developer (and as a Java developer in particular) will be very handy.
Most Hadoop-related jobs typically call for experience with large-scale, distributed systems, and a clear understanding of system design and development through scaling, performance, and scheduling. In addition to experience in Java, programmers should be hands on and have a good background in data structures and parallel programming techniques. Cloud experience of any kind is a big plus.
This is a lot to have under your belt; so, for systems engineers and administrators who want to make the jump to Hadoop, Hortonworks will be offering a three-day Administering Apache Hadoop class. Cloudera has an active administration course now, as part of its Cloudera University curriculum. Courses on Hive, Pig, and developer training are also available. You can find additional coursework on the Hadoop Support wiki on the Apache site.
Moving down the road
Part 2 of "The Road to Hadoop" will look at the business implications moving to Hadoop. You'll see what businesses should be using Hadoop and how deployments usually happen. In Part 3, we'll examine the techniques and costs involved in moving to Hadoop from an existing RDBMS, as well as the tools used to analyze Hadoop data faster and more cheaply than any RDBMS.