http://swathidharshananaidu.posterous.com/open-source-hive-large-scale-distributed-data
Table of Contents DISCLAIMER: This is a prototype version of Hive and is NOT production quality. However, we are working hard to make Hive a production quality system. Hive has only been tested on unix(linux) and mac systems using Java 1.6 for now - although it may very well work on other similar platforms. It does not work on Cygwin right now. Most of our testing has been on Hadoop 0.17.2 - so we would advise running it against this version of hadoop - even though it may compile/work against other versions www.png" height="11" alt="[WWW]" style="" width="11" /> Hive Introduction Video www.png" height="11" alt="[WWW]" style="" width="11" /> Hive Tutorial Video Java 1.6 Hadoop 0.17.x to 0.19.x. Support of Hadoop 0.20.x is in progress: [www.png" height="11" alt="[WWW]" style="" width="11" /> HIVE-487] Hive is available via SVN at: www.png" height="11" alt="[WWW]" style="" width="11" /> http://svn.apache.org/repos/asf/hadoop/hive/trunk In the rest of the page, we use build/dist and <install-dir> interchangeably. Hive uses hadoop that means: you must have hadoop in your path OR export HADOOP_HOME=<hadoop-install-dir> In addition, you must create /tmp and /user/hive/warehouse (aka hive.metastore.warehouse.dir) and set them chmod g+w in HDFS before a table can be created in Hive. Commands to perform this setup I also find it useful but not necessary to set HIVE_HOME To use hive command line interface (cli) from the shell: - hive default configuration is stored in <install-dir>/conf/hive-default.xml Configuration variables can be changed by (re-)defining them in <install-dir>/conf/hive-site.xml - log4j configuration is stored in <install-dir>/conf/hive-log4j.properties - hive configuration is an overlay on top of hadoop - meaning the hadoop configuration variables are inherited by default. - hive configuration can be manipulated by: editing hive-site.xml and defining any desired variables (including hadoop variables) in it from the cli using the set command (see below) by invoking hive using the syntax: $ bin/hive -hiveconf x1=y1 -hiveconf x2=y2 this sets the variables x1 and x2 to y1 and y2 respectively Hive uses log4j for logging. By default logs are not emitted to the console by the cli. They are stored in the file: - /tmp/{user.name}/hive.log If the user wishes - the logs can be emitted to the console by adding the arguments shown below: - bin/hive -hiveconf hive.root.logger=INFO,console Note that setting hive.root.logger via the 'set' command does not change logging properties since they are determined at initialization time. Error logs are very useful to debug problems. Please send them with any bugs (of which there are many!) to hive-dev@hadoop.apache.org. Creating Hive tables and browsing through them Creates a table called pokes with two columns, the first being an integer and the other a string Creates a table called invites with two columns and a partition column called ds. The partition column is a virtual column. It is not part of the data itself but is derived from the partition that a particular dataset is loaded into. By default, tables are assumed to be of text input format and the delimiters are assumed to be ^A(ctrl-a). lists all the tables lists all the table that end with 's'. The pattern matching follows Java regular expressions. Check out this link for documentation http://java.sun.com/javase/6/docs/api/java/util/regex/Pattern.html hive> DESCRIBE invites; shows the list of columns As for altering tables, table names can be changed and additional columns can be dropped: Dropping tables: Metadata is in an embedded Derby database whose disk storage location is determined by the hive configuration variable named javax.jdo.option.ConnectionURL. By default (see conf/hive-default.xml), this location is ./metastore_db Right now, in the default configuration, this metadata can only be seen by one user at a time. Metastore can be stored in any database that is supported by JPOX. The location and the type of the RDBMS can be controlled by the two variables 'javax.jdo.option.ConnectionURL' and 'javax.jdo.option.ConnectionDriverName'. Refer to JDO (or JPOX) documentation for more details on supported databases. The database schema is defined in JDO metadata annotations file package.jdo at src/contrib/hive/metastore/src/model. In the future, the metastore itself can be a standalone server. If you want to run the metastore as a network server so it can be accessed from multiple nodes try HiveDerbyServerMode. Loading data from flat files into Hive: Loads a file that contains two columns separated by ctrl-a into pokes table. 'local' signifies that the input file is on the local file system. If 'local' is omitted then it looks for the file in HDFS. The keyword 'overwrite' signifies that existing data in the table is deleted. If the 'overwrite' keyword is omitted, data files are appended to existing data sets. NOTES: NO verification of data against the schema is performed by the load command. If the file is in hdfs, it is moved into the Hive-controlled file system namespace. The root of the Hive directory is specified by the option 'hive.metastore.warehouse.dir' in hive-default.xml. We advise users to create this directory before trying to create tables via Hive. The two LOAD statements above load data into two different partitions of the table invites. Table invites must be created as partitioned by the key ds for this to succeed. The above command will load data from an HDFS file/directory to the table. Note that loading data from HDFS will result in moving the file/directory. As a result, the operation is almost instantaneous. Hive queries are executed using map-reduce queries and, therefore, the behavior of such queries can be controlled by the hadoop configuration variables. The cli command 'SET' can be used to set any hadoop (or hive) configuration variable. For example: The latter shows all the current settings. Without the -v option only the variables that differ from the base hadoop configuration are displayed In particular, the number of reducers should be set to a reasonable number to get good performance (the default is 1!) Some example queries are shown below. They are available in build/dist/examples/queries. More are available in the hive sources at ql/src/test/queries/positive selects column 'foo' from all rows of partition <DATE> of invites table. The results are not stored anywhere, but are displayed on the console. Note that in all the examples that follow, INSERT (into a hive table, local directory or HDFS directory) is optional. selects all rows from partition <DATE> OF invites table into an HDFS directory. The result data is in files (depending on the number of mappers) in that directory. NOTE: partition columns if any are selected by the use of *. They can also be specified in the projection clauses. Partitioned tables must always have a partition selected in the WHERE clause of the statement. Selects all rows from pokes table into a local directory Sum of a column. avg, min, max can also be used This streams the data in the map phase through the script /bin/cat (like hadoop streaming). Similarly - streaming can be used on the reduce side (please see the Hive Tutorial or examples) last edited 2009-07-24 08:28:25 by ZhengShaoHive introduction videos From Cloudera
Preparations
Requirements
Downloading and building
$ svn co http://svn.apache.org/repos/asf/hadoop/hive/trunk hive $ cd hive $ ant -Dhadoop.version="<your-hadoop-version>" package # For example $ ant -Dhadoop.version="0.17.2" package $ cd build/dist $ ls README.txt bin/ (all the shell scripts) lib/ (required jar files) conf/ (configuration files) examples/ (sample input and query files)
Running Hive
$ $HADOOP_HOME/bin/hadoop fs -mkdir /tmp $ $HADOOP_HOME/bin/hadoop fs -mkdir /user/hive/warehouse $ $HADOOP_HOME/bin/hadoop fs -chmod g+w /tmp $ $HADOOP_HOME/bin/hadoop fs -chmod g+w /user/hive/warehouse
$ export HIVE_HOME=<hive-install-dir>
$ $HIVE_HOME/bin/hive
Configuration management overview
Error Logs
DDL Operations
hive> CREATE TABLE pokes (foo INT, bar STRING);
hive> CREATE TABLE invites (foo INT, bar STRING) PARTITIONED BY (ds STRING);
hive> SHOW TABLES;
hive> SHOW TABLES '.*s';
hive> ALTER TABLE pokes ADD COLUMNS (new_col INT); hive> ALTER TABLE invites ADD COLUMNS (new_col2 INT COMMENT 'a comment'); hive> ALTER TABLE events RENAME TO 3koobecaf;
hive> DROP TABLE pokes;
Metadata Store
DML Operations
hive> LOAD DATA LOCAL INPATH './examples/files/kv1.txt' OVERWRITE INTO TABLE pokes;
hive> LOAD DATA LOCAL INPATH './examples/files/kv2.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-15'); hive> LOAD DATA LOCAL INPATH './examples/files/kv3.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-08');
hive> LOAD DATA INPATH '/user/myname/kv2.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-15');
SQL Operations
Runtime configuration
hive> SET mapred.job.tracker=myhost.mycompany.com:50030 hive> SET -v
Example Queries
SELECTS and FILTERS
hive> SELECT a.foo FROM invites a WHERE a.ds='<DATE>';
hive> INSERT OVERWRITE DIRECTORY '/tmp/hdfs_out' SELECT a.* FROM invites a WHERE a.ds='<DATE>';
hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/local_out' SELECT a.* FROM pokes a;
hive> INSERT OVERWRITE TABLE events SELECT a.* FROM profiles a; hive> INSERT OVERWRITE TABLE events SELECT a.* FROM profiles a WHERE a.key < 100; hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/reg_3' SELECT a.* FROM events a; hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_4' select a.invites, a.pokes FROM profiles a; hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_5' SELECT COUNT(1) FROM invites a WHERE a.ds='<DATE>'; hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_5' SELECT a.foo, a.bar FROM invites a; hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/sum' SELECT SUM(a.pc) FROM pc1 a;
GROUP BY
hive> FROM invites a INSERT OVERWRITE TABLE events SELECT a.bar, count(1) WHERE a.foo > 0 GROUP BY a.bar; hive> INSERT OVERWRITE TABLE events SELECT a.bar, count(1) FROM invites a WHERE a.foo > 0 GROUP BY a.bar;
JOIN
hive> FROM pokes t1 JOIN invites t2 ON (t1.bar = t2.bar) INSERT OVERWRITE TABLE events SELECT t1.bar, t1.foo, t2.foo;
MULTITABLE INSERT
FROM src INSERT OVERWRITE TABLE dest1 SELECT src.* WHERE src.key < 100 INSERT OVERWRITE TABLE dest2 SELECT src.key, src.value WHERE src.key >= 100 and src.key < 200 INSERT OVERWRITE TABLE dest3 PARTITION(ds='2008-04-08', hr='12') SELECT src.key WHERE src.key >= 200 and src.key < 300 INSERT OVERWRITE LOCAL DIRECTORY '/tmp/dest4.out' SELECT src.value WHERE src.key >= 300;
STREAMING
hive> FROM invites a INSERT OVERWRITE TABLE events SELECT TRANSFORM(a.foo, a.bar) AS (oof, rab) USING '/bin/cat' WHERE a.ds > '2008-08-09';
No comments:
Post a Comment