If your organization seems to be a good fit for Hadoop, you can download the open source software that comprises the data framework and try it out with relative ease.
So far in this series, you've learned some of what it takes to be ready to administer Hadoop, and seen the benefits and drawbacks to using it. In this final installment, we'll examine the techniques and costs involved in moving to Hadoop from an existing RDBMS, see how companies are deploying Hadoop, and learn about tools you can use to analyze Hadoop data faster and more cheaply than any RDBMS.
Like many up-and-coming technologies, particularly those in the open source world, Hadoop has enjoyed the benefits of the do-it-yourself spirit of IT shops that want to take it out for a spin. Now that Hadoop is getting a lot of attention in technology media and conferences, so C-level executives are also getting into the Hadoop act, wanting to see just how much money Hadoop can save their companies. Two separate vectors of adoption -- from the trenches and from the bosses -- are common enough to warrant a closer look.
The Road to Hadoop: Read the whole series
Bottom up: The shadow knows
Shadow IT is either a blessing or a curse to an organization. Many's the time when an experimental or sandbox configuration has ended up paying off in big ways for the organization's bottom line. Linux, for instance, was one such beneficiary of shadow IT at the turn of the century.
Today, it's sometimes Hadoop's turn in the shadows, according to Arun Murthy, VP, Apache Hadoop at the Apache Software Foundation.
"In the bottom-up method of deployment, usually there's a couple of engineers who download and deploy Hadoop either on a single node or maybe a small cluster with four or five nodes," Murthy explained.
What tends to happen next is a pattern that Murthy has seen many times. Staffers using the Hadoop cluster start to notice the value of the toolset. Perhaps other divisions of the company set up their own Hadoop clusters. Eventually, the value of Hadoop rises significantly and (thanks to the scalability of the underlying distributed filesystem), the separate Hadoop clusters are combined into a single large cluster with perhaps 50 or so nodes.
According to Murthy, this is exactly what happened when Yahoo and Facebook first adopted Hadoop. Once the value of Hadoop for all of the separate teams and applications became clear, it became obvious that combining everything into one large Hadoop network would be ideal.
Of course, not many companies will need to scale up to the ten- or fifty-thousand-node systems deployed by Facebook and Yahoo, respectively, but the general principle is still the same.
Top down: CXO mandates
Another common way Hadoop is deployed is from the top down. A C-level executive watching trends will note the very low costs of storage on a Hadoop system and will begin to formally explore whether the Hadoop solution is the right thing for the company.
This is where vendors like Murthy's current employer, Hortonworks, Inc., comes in. Hortonworks, launched at the end of June of 2011, was founded by Murthy and several other members from Yahoo's Hadoop team, and provides open source Hadoop products as well as training, support, and deployment services.
Usually, Murthy explained, Hortonworks will work with a potential new client and make a small set of recommendations based on what the client needs. They will also deploy a small proof-of-concept Hadoop cluster, anywhere from 20 to 100 nodes, and let the customer see the value of Hadoop for themselves. This formal process is similar to what other Hadoop vendors, such as Cloudera and MapR, provide, so there you'll have a number of strong options to choose from when seeking Hadoop consulting and support.
Get the Sqoop
Whether you do it yourself, or employ help to do it, at some point you are going to need to migrate your data from its current storage location to Hadoop.
The best tool for doing this, especially from an RDBMS, is Cloudera's Sqoop ("SQL-to-Hadoop"). Sqoop is a command-line application that can import individual tables or whole databases into the Hadoop Distributed Filesystem (HDFS). Sqoop uses the DBInputFormat Java connector that enables MapReduce to pull in relational database data via the JDBC interface found in MySQL, Postgresql, Oracle, and most other popular databases.
Sqoop will also generate the Java classes needed for MapReduce to interact with the data, by deserializing record rows into discrete fields of information. You can also use Sqoop to import RDBMS data right into your Hive data warehouse.
Because of this functionality, there is very little you should have to do to prepare your data for a migration to Hadoop, other than common sense practices like deduping your data and keeping your RDBMS maintained.
Explore the Hive
As described in the first article in this series, Hive is the part of the Hadoop framework that enables analysts to structure and query data in the HDFS. Data can be summarized, queried, and analyzed using the Hive Query Language (HiveQL), which is similar enough to SQL to make such operations not too difficult for analysts to use.
Hive will also enable MapReduce programmers to directly plug in their custom data mappers and data reducers, should the HiveQL language prove to be unable to provide the information needed.
Care should be used when considering Hive: because Hadoop is a batch processing system, its jobs have high latency, which translates into very high latencies for Hive queries (as in minutes, not seconds). As such, Hive is not really a good system to use for real-time processing. If this is your need, consider working with Apache Cassandra, an open source distributed database management system that is much better for handling real-time needs.
Arriving at Hadoop
The migration path to Hadoop will vary, depending on your organization's needs, but Hadoop is a system that may surprise you in the value it can provide.
Hadoop is not strictly the purview of big data. It's for any organization that needs cheaper storage and the capability to analyze a lot of data efficiently. Is that organization yours?
Brian Proffitt writes ITworld's "Open for Discussion blog. He tweets as @TheTechScribe.