Saturday, July 25, 2009

Open source Hive: Large-scale, distributed data processing made easy

Thank heaven for Hive, a data analysis and query front end for Hadoop that makes Hadoop data files look like SQL tables

Suppose you want to run regular statistical analyses on your Web site's traffic log data -- several hundred terabytes, updated weekly. (Don't laugh. This is not unheard of for popular Web sites.) You're already familiar with Hadoop (see InfoWorld's review), the open source distributed processing system that would be ideal for this task. But you don't have time to code Hadoop map/reduce functions? Perhaps you're not the elite programmer that everyone in the office thinks you are.

What you'd like to do is dump all that information into a database, and execute a set of SQL queries on it. But the quantity of data would overwhelm even an enterprise-level RDBMS.

[ Read the InfoWorld Test Center's hands-on account of working with Amazon Elastic MapReduce and Amazon Web Services. | Keep abreast of cloud computing news by visiting InfoWorld's Cloud Computing channel. ]

This is precisely the problem that engineers at Facebook encountered. They became interested in Hadoop as a means of processing their Web site's traffic data that was generating terabytes per day, was growing, and was overtaxing their Oracle database. Though they were happy with Hadoop, they wanted to simplify its use so that engineers could express frequently used analysis operations in SQL. The resulting Hadoop-based data warehouse application became Hive, and it helps to process more than 10TB of Facebook data daily. Now Hive is available as an open source subproject of Apache Hadoop.

Inside the Hive
Written in Java, Hive is a specialized execution front end for Hadoop. Hive lets you write data queries in an SQL-like language -- the Hive Query Language (HQL) -- that are converted to map/reduced tasks, which are then executed by the Hadoop framework. You're using Hadoop, but it feels like you're talking SQL to an RDBMS.

Employing Hadoop's distributed file system (HDFS) as data storage, Hive inherits all of Hadoop's fault tolerance, scalability, and adeptness with huge data sets. When you run Hive, you are deposited into a shell, within which you can execute Hive Data Definition Language (DDL) and HQL commands. A future version of Hive will include JDBC and ODBC drivers, at which time you will be able to create fully executable "Hive applications" in much the same way that you can write a Java database application for your favorite RDBMS. (The current version of Hive -- 0.3.0 -- does have limited support for JDBC, but can only dispatch queries and fetch results.)

To install Hive, you simply install Hadoop and add a couple of download and configuration steps. (To install Hadoop, the best tutorial I've found is on Michael Noll's blog.) Or if you'd rather just get straight to testing Hive without all the installation nonsense, you can download a VMware virtual machine image with Hadoop and Hive pre-installed. The virtual machine image is featured in an excellent Hive tutorial video available at the same Web site.


Apache Hive is a specialized execution front end for Hadoop. Hive lets you write data queries in an SQL-like language -- the Hive Query Language (HQL) -- that are converted to map/reduced tasks, which are then executed by the Hadoop framework. You're using Hadoop, but it feels like you're talking SQL to an RDBMS.

Although Hive query language (HQL) commands are usually executed from within the Hive shell, you can launch the Hive Web Interface service and run HQL queries from within a browser. You can start multiple queries, and the Web interface will let you monitor the status of each.

I already had Hadoop running on an Ubuntu 8.10 system. To add Hive, I downloaded the gzip file from, and unpacked it into a folder next to the Hadoop home folder. Next, I defined a HIVE_HOME environment variable, and executed a few HDFS commands to create specific HDFS subdirectories that Hive requires. I launched the Hive shell and was ready to go. Total time was maybe 20 minutes. (This process is described in Hive's wiki, just off the Hive main Web page.)

Although Hive's principal goal is to provide an SQL-like query mechanism for Hadoop-based data, mimicry of SQL in such an environment can -- for a variety of reasons -- go only so far. First, HDFS was built for batchlike applications that pour large quantities of data into massive files that are subsequently processed by Hadoop map/reduce tasks. It is a write-once, read-often-and-sequentially file system. HDFS does not currently support random write operations and likely never will. Hence, HQL's closest approach to an SQL INSERT INTO command is INSERT OVERWRITE, which overwrites a table's existing content with new data. For example, suppose you have already created a Hive database table called TA, and you want to add new data to it from table TB. The HQL for this is:


The new data is added by overwriting the old table with the concatenation of its original content and the data in TB.

In addition, Hive does not store database tables in a specialized file format. Instead, it causes ordinary HDFS files to "appear" to be database files. This illusion becomes apparent when you export data into a Hive table from a file stored in a standard Linux file system. No special conversion takes place; the file is copied byte for byte into Hive from its source image in the Linux directory. This means that you have to describe the structure of the file at the time you CREATE it as a Hive table.

For example, suppose I had converted the entire Encyclopedia Britannica into a single, linear text file and processed that to produce a data file consisting of word/offset pairs. For each line in the file, the first field is the text of a given word in the encyclopedia, and the second field is the large integer offset of the word's position in the text file. (So, the line "bob 1293" indicates that "bob" was the 1,293rd word in the encyclopedia.) Assuming the file's fields are separated by tab characters and the lines by line feeds, I could create a table for this file:


The structure of the file is explicitly described in the CREATE command. And when I imported the data into Hive, it would simply be copied directly, with no structural changes.

Nevertheless, Hive is impressive, particularly when you consider what is going on behind the scenes. It is converting HQL expressions into compiled-and-executed map/reduce tasks. In addition, the conversion is not a brute-force operation; Hive applies some intelligence. For example, Hive knows when conversion is unnecessary, so the simple expression "SELECT * FROM TA" will execute directly. Hive also performs "pipelining" of complex queries where possible. That is, if a query is resolved into a linear sequence of map/reduce tasks, the intermediate output of the first map/reduce job is passed on to the next job in the series, even before the first job is completed -- and so on down the line. This significantly improves throughput, as different stages in the pipeline are able to execute in parallel.

More HQL tricks
HQL is designed to be easily mastered by anyone already familiar with SQL. Though HQL is definitely a subset of SQL, it provides a surprising amount of SQL-like functionality. Hive's DDL includes commands for creating and dropping tables as well as altering table structure (adding or replacing columns). Tables can also be created with partition specifiers, which -- if strategically arranged -- can accelerate some queries. HQL's SELECT clause supports subqueries, as well as GROUP BY and SORT BY clauses. Also, you can perform multiple JOIN operations in an HQL query (though only the equality operator can be used in the JOIN conditional).

Other HQL language features have no direct SQL counterpart, but are understandable deviations when you consider HQL's raison d'etre. For example, if you already have a large table imported into Hive and want to test a query you've just written, but would rather not wait the hour you suspect the query will take, you can use Hive's TABLESAMPLE clause. Applied in conjunction with the CREATE command's CLUSTERED BY clause, adding the TABLESAMPLE clause to a query's FROM clause will involve only a subset of the entire table's data in that query, thereby reducing query execution time significantly.

[ Stay up to date on the latest open source developments with InfoWorld's Technology: Open Source newsletter. ]

Finally, if you want to add a new, user-defined function to HQL, Hive provides a plug-in mechanism whereby you can write your function (it will have to be in Java), compile it into a JAR file, and register it with the Hive infrastructure. Restart Hive, and your function is ready to use in your Hive queries.

Join the Hive
Hive is easy to install, and HQL is easy to pick up if you already know even a modest amount of SQL. And Hive has a bright future; the road map of upcoming features includes more support for languages other than Java, a HAVING clause, improvements to Hive's JOIN capabilities, additional data types, indexes, and much more.

Hive, however, is not a replacement for an RDBMS. As already mentioned, Hive does not support random row insertion or deletion. The Hive Web site makes it clear that Hive is a tool for the analysis and summarization of large datasets; it is not meant for structured, randomly accessed content storage.

Hadoop is emerging as the current darling of the cloud computing crowd, and Hive certainly assists that ascent. Creating Hadoop map/reduce tasks demands programming skills that Hive does not require (though some map/reduce jobs will always necessitate hand-coding). Still, Hive is an ideal express-entry into the large-scale distributed data processing world of Hadoop. All the ease of SQL with all the power of Hadoop -- sounds good to me.

Thanks to Facebook engineers Joydeep Sen Sarma and Ashish Thusoo for their assistance with this article.


Posted via web from swathidharshananaidu's posterous

No comments:

Post a Comment