You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-commits@hadoop.apache.org by Apache Wiki <wi...@apache.org> on 2008/09/05 21:38:16 UTC
[Hadoop Wiki] Update of "Hive/GettingStarted" by RaghothamMurthy
Dear Wiki user,
You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.
The following page has been changed by RaghothamMurthy:
http://wiki.apache.org/hadoop/Hive/GettingStarted
------------------------------------------------------------------------------
our testing has been on Hadoop 0.17 - so we would advise running it against
this version of hadoop - even though it may compile/work against other versions
-
- '''Downloading and building'''
+ == Downloading and building ==
- ------------------------
You can either build hive jar files for your environment:
- - apply patch
+ * apply patch
- - cd hadoop/src/contrib/hive
+ * cd hadoop/src/contrib/hive
- - src/contrib/hive> ant -Dtarget.dir=<your-install-dir> package
+ * src/contrib/hive> ant -Dtarget.dir=<your-install-dir> package
Or you can use the pre-packaged jars that is available at the following location
and untar it:
http://mirror.facebook.com/facebook/hive/hadoop-0.17/dist.tar.gz
- $ wget http://mirror.facebook.com/facebook/hive/hadoop-0.17/dist.tar.gz
+ * $ wget http://mirror.facebook.com/facebook/hive/hadoop-0.17/dist.tar.gz
- $ tar zxvf dist.tar.gz
+ * $ tar zxvf dist.tar.gz
- $ cd dist
+ * $ cd dist
- $ ls
+ * $ ls
- README bin conf lib
+ o README bin conf lib
- bin/ (all the shell scripts)
+ * bin/ (all the shell scripts)
- lib/ (required jar files)
+ * lib/ (required jar files)
- conf/ (configuration files)
+ * conf/ (configuration files)
In the rest of the README, we use dist and <install-dir> interchangeably.
+ == Running Hive ==
- '''Running Hive'''
- ------------
Hive uses hadoop that means:
- - you must have hadoop in your path OR
+ * you must have hadoop in your path OR
- - export HADOOP=<hadoop-install-dir>/bin/hadoop
+ * export HADOOP=<hadoop-install-dir>/bin/hadoop
To use hive command line interface (cli) from the shell:
- $ bin/hive
+ ''$ bin/hive''
+ == Using Hive ==
- '''Using Hive'''
- ----------
-
- '''Configuration management overview'''
+ === Configuration management overview ===
- ---------------------------------
- hive configuration is stored in <install-dir>/conf/hive-default.xml
and log4j in hive-log4j.properties
@@ -63, +57 @@
this sets the variables x1 and x2 to y1 and y2
+ === Error Logs ===
- '''Error Logs'''
- ----------
Hive uses log4j for logging. By default logs are not emitted to the
console by the cli. They are stored in the file:
- /tmp/{user.name}/hive.log
@@ -80, +73 @@
any bugs (of which there are many!) to athusoo@facebok.com.
+ == DDL Operations ==
- '''DDL Operations'''
- --------------
Creating Hive tables and browsing through them
@@ -120, +112 @@
Altering tables. Table name can be changed and additional columns can be dropped
hive> ALTER TABLE pokes ADD COLUMNS (new_col INT);
+
hive> ALTER TABLE invites ADD COLUMNS (new_col2 INT COMMENT 'a comment');
+
hive> ALTER TABLE events RENAME TO 3koobecaf;
Dropping tables
hive> DROP TABLE pokes;
+ == Metadata Store ==
- '''Metadata Store'''
- --------------
Metadata is in an embedded Derby database whose location is determined by the
hive configuration variable named javax.jdo.option.ConnectionURL. By default
@@ -147, +140 @@
In the future - the metastore itself can be a standalone server.
+ == DML Operations ==
- '''DML Operations'''
- --------------
Loading data from flat files into Hive
@@ -162, +154 @@
If the 'overwrite' keyword is omitted - then data files are appended to existing data sets.
NOTES:
- - NO verification of data against the schema
+ * NO verification of data against the schema
- - if the file is in hdfs it is moved into hive controlled file system namespace.
+ * if the file is in hdfs it is moved into hive controlled file system namespace.
The root of the hive directory is specified by the option hive.metastore.warehouse.dir
in hive-default.xml. We would advise that this directory be pre-existing before
trying to create tables via Hive.
hive> LOAD DATA LOCAL INPATH './examples/files/kv2.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-15');
+
hive> LOAD DATA LOCAL INPATH './examples/files/kv3.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-08');
The two LOAD statements above load data into two different partitions of the table
invites. Table invites must be created as partitioned by the key ds for this to succeed.
- '''Loading/Extracting data using Queries'''
+ == Loading/Extracting data using Queries ==
- -------------------------------------
-
- '''Runtime configuration'''
+ === Runtime configuration ===
- ---------------------
- - Hives queries are executed using map-reduce queries and as such the behavior
+ * Hives queries are executed using map-reduce queries and as such the behavior
of such queries can be controlled by the hadoop configuration variables
- - The cli can be used to set any hadoop (or hive) configuration variable. For example:
+ * The cli can be used to set any hadoop (or hive) configuration variable. For example:
- o hive> SET mapred.job.tracker=myhost.mycompany.com:50030
+ o hive> SET mapred.job.tracker=myhost.mycompany.com:50030
+
- o hive> SET - v
+ o hive> SET - v
The latter shows all the current settings. Without the v option only the
variables that differ from the base hadoop configuration are displayed
- - In particular the number of reducers should be set to a reasonable number
+ * In particular the number of reducers should be set to a reasonable number
to get good performance (the default is 1!)
+ === Example Queries ===
- '''Example Queries'''
- ---------------
Some example queries are shown below. They are available in examples/queries.
More are available in the hive contrib sources src/test/queries/positive
- SELECTS and FILTERS
+ ==== SELECTS and FILTERS ====
- -------------------
-
hive> SELECT a.foo FROM invites a;
select column 'foo' from all rows of invites table. The results are not
@@ -221, +209 @@
Select all rows from pokes table into a local directory
hive> INSERT OVERWRITE TABLE events SELECT a.* FROM profiles a;
+
hive> INSERT OVERWRITE TABLE events SELECT a.* FROM profiles a WHERE a.key < 100;
+
hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/reg_3' SELECT a.* FROM events a;
+
hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_4' select a.invites, a.pokes FROM profiles a;
+
hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_5' SELECT COUNT(1) FROM invites a;
+
hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_5' SELECT a.foo, a.bar FROM invites a;
+
hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/sum' SELECT SUM(a.pc) FROM pc1 a;
Sum of a column. avg, min, max can also be used
@@ -233, +227 @@
NOTE: there are some flaws with the type system that cause doubles to be
returned with integer types would be expected. We expect to fix these in the coming week.
+ ==== GROUP BY ====
- GROUP BY
- --------
hive> FROM invites a INSERT OVERWRITE TABLE events SELECT a.bar, count(1) WHERE a.foo > 0 GROUP BY a.bar;
+
hive> INSERT OVERWRITE TABLE events SELECT a.bar, count(1) FROM invites a WHERE a.foo > 0 GROUP BY a.bar;
NOTE: Currently Hive always uses two stage map-reduce for groupby operation. This is
to handle skews in input data. We will be optimizing this in the coming weeks.
+ ==== JOIN ====
- JOIN
- ----
hive> FROM pokes t1 JOIN invites t2 ON (t1.bar = t2.bar) INSERT OVERWRITE TABLE events SELECT t1.bar, t1.foo, t2.foo
- MULTITABLE INSERT
+ ==== MULTITABLE INSERT ====
- -----------------
+
- FROM src
+ FROM src
- INSERT OVERWRITE TABLE dest1 SELECT src.* WHERE src.key < 100
+ INSERT OVERWRITE TABLE dest1 SELECT src.* WHERE src.key < 100
- INSERT OVERWRITE TABLE dest2 SELECT src.key, src.value WHERE src.key >= 100 and src.key < 200
+ INSERT OVERWRITE TABLE dest2 SELECT src.key, src.value WHERE src.key >= 100 and src.key < 200
- INSERT OVERWRITE TABLE dest3 PARTITION(ds='2008-04-08', hr='12') SELECT src.key WHERE src.key >= 200 and src.key < 300
+ INSERT OVERWRITE TABLE dest3 PARTITION(ds='2008-04-08', hr='12') SELECT src.key WHERE src.key >= 200 and src.key < 300
- INSERT OVERWRITE LOCAL DIRECTORY '/tmp/dest4.out' SELECT src.value WHERE src.key >= 300
+ INSERT OVERWRITE LOCAL DIRECTORY '/tmp/dest4.out' SELECT src.value WHERE src.key >= 300
- STREAMING
- ---------
+ ==== STREAMING ====
+
hive> FROM invites a INSERT OVERWRITE TABLE events SELECT TRANSFORM(a.foo, a.bar) AS (oof, rab) USING '/bin/cat' WHERE a.ds > '2008-08-09';
This streams the data in the map phase through the script /bin/cat (like hadoop streaming).