You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-commits@hadoop.apache.org by Apache Wiki <wi...@apache.org> on 2008/09/05 21:38:16 UTC

[Hadoop Wiki] Update of "Hive/GettingStarted" by RaghothamMurthy

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The following page has been changed by RaghothamMurthy:
http://wiki.apache.org/hadoop/Hive/GettingStarted

------------------------------------------------------------------------------
  our testing has been on Hadoop 0.17 - so we would advise running it against 
  this version of hadoop - even though it may compile/work against other versions
  
- 
- '''Downloading and building'''
+ == Downloading and building ==
- ------------------------
  
  You can either build hive jar files for your environment:
- - apply patch
+   * apply patch
- - cd hadoop/src/contrib/hive
+   * cd hadoop/src/contrib/hive
- - src/contrib/hive> ant -Dtarget.dir=<your-install-dir> package
+   * src/contrib/hive> ant -Dtarget.dir=<your-install-dir> package
  
  Or you can use the pre-packaged jars that is available at the following location
  and untar it: 
  http://mirror.facebook.com/facebook/hive/hadoop-0.17/dist.tar.gz
  
- $ wget http://mirror.facebook.com/facebook/hive/hadoop-0.17/dist.tar.gz
+   * $ wget http://mirror.facebook.com/facebook/hive/hadoop-0.17/dist.tar.gz
- $ tar zxvf dist.tar.gz
+   * $ tar zxvf dist.tar.gz
- $ cd dist
+   * $ cd dist
- $ ls
+   * $ ls
- README  bin conf  lib 
+     o README  bin conf  lib 
  
- bin/ (all the shell scripts)
+   * bin/ (all the shell scripts)
- lib/ (required jar files)
+   * lib/ (required jar files)
- conf/ (configuration files)
+   * conf/ (configuration files)
  
  In the rest of the README, we use dist and <install-dir> interchangeably.
  
+ == Running Hive ==
- '''Running Hive'''
- ------------
  
  Hive uses hadoop  that means:
- - you must have hadoop in your path OR
+   * you must have hadoop in your path OR
- - export HADOOP=<hadoop-install-dir>/bin/hadoop
+   * export HADOOP=<hadoop-install-dir>/bin/hadoop
  
  To use hive command line interface (cli) from the shell:
- $ bin/hive
+ ''$ bin/hive''
  
+ == Using Hive ==
- '''Using Hive'''
- ----------
- 
- '''Configuration management overview'''
+ === Configuration management overview ===
- ---------------------------------
  
  - hive configuration is stored in <install-dir>/conf/hive-default.xml 
    and log4j in hive-log4j.properties
@@ -63, +57 @@

      this sets the variables x1 and x2 to y1 and y2
  
  
+ === Error Logs ===
- '''Error Logs'''
- ----------
  Hive uses log4j for logging. By default logs are not emitted to the 
  console by the cli. They are stored in the file:
  - /tmp/{user.name}/hive.log
@@ -80, +73 @@

  any bugs (of which there are many!) to athusoo@facebok.com.
  
  
+ == DDL Operations ==
- '''DDL Operations'''
- --------------
  
  Creating Hive tables and browsing through them
  
@@ -120, +112 @@

  Altering tables. Table name can be changed and additional columns can be dropped
  
  hive> ALTER TABLE pokes ADD COLUMNS (new_col INT);
+ 
  hive> ALTER TABLE invites ADD COLUMNS (new_col2 INT COMMENT 'a comment');
+ 
  hive> ALTER TABLE events RENAME TO 3koobecaf;
  
  Dropping tables
  hive> DROP TABLE pokes;
  
  
+ == Metadata Store ==
- '''Metadata Store'''
- --------------
  
  Metadata is in an embedded Derby database whose location is determined by the 
  hive configuration variable named javax.jdo.option.ConnectionURL. By default 
@@ -147, +140 @@

  In the future - the metastore itself can be a standalone server.
  
  
+ == DML Operations ==
- '''DML Operations'''
- --------------
  
  Loading data from flat files into Hive
  
@@ -162, +154 @@

  If the 'overwrite' keyword is omitted - then data files are appended to existing data sets.
  
  NOTES:
- - NO verification of data against the schema
+   * NO verification of data against the schema
- - if the file is in hdfs it is moved into hive controlled file system namespace. 
+   * if the file is in hdfs it is moved into hive controlled file system namespace. 
    The root of the hive directory is specified by the option hive.metastore.warehouse.dir 
    in hive-default.xml. We would advise that this directory be pre-existing before 
    trying to create tables via Hive.
  
  hive> LOAD DATA LOCAL INPATH './examples/files/kv2.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-15');
+ 
  hive> LOAD DATA LOCAL INPATH './examples/files/kv3.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-08');
  
  The two LOAD statements above load data into two different partitions of the table
  invites. Table invites must be created as partitioned by the key ds for this to succeed.
  
- '''Loading/Extracting data using Queries'''
+ == Loading/Extracting data using Queries ==
- -------------------------------------
- 
- '''Runtime configuration'''
+ === Runtime configuration ===
- ---------------------
  
- - Hives queries are executed using map-reduce queries and as such the behavior 
+   * Hives queries are executed using map-reduce queries and as such the behavior 
    of such queries can be controlled by the hadoop configuration variables
  
- - The cli can be used to set any hadoop (or hive) configuration variable. For example:
+   * The cli can be used to set any hadoop (or hive) configuration variable. For example:
-    o hive> SET mapred.job.tracker=myhost.mycompany.com:50030
+     o hive> SET mapred.job.tracker=myhost.mycompany.com:50030
+ 
-    o hive> SET - v 
+     o hive> SET - v 
    The latter shows all the current settings. Without the v option only the 
    variables that differ from the base hadoop configuration are displayed
- - In particular the number of reducers should be set to a reasonable number 
+   * In particular the number of reducers should be set to a reasonable number 
    to get good performance (the default is 1!)
  
  
+ === Example Queries ===
- '''Example Queries'''
- ---------------
  
  Some example queries are shown below. They are available in examples/queries.
  More are available in the hive contrib sources src/test/queries/positive
  
- SELECTS and FILTERS
+ ==== SELECTS and FILTERS ====
- -------------------
- 
  hive> SELECT a.foo FROM invites a;
  
  select column 'foo' from all rows of invites table. The results are not
@@ -221, +209 @@

  Select all rows from pokes table into a local directory
  
  hive> INSERT OVERWRITE TABLE events SELECT a.* FROM profiles a;
+ 
  hive> INSERT OVERWRITE TABLE events SELECT a.* FROM profiles a WHERE a.key < 100; 
+ 
  hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/reg_3' SELECT a.* FROM events a;
+ 
  hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_4' select a.invites, a.pokes FROM profiles a;
+ 
  hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_5' SELECT COUNT(1) FROM invites a;
+ 
  hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_5' SELECT a.foo, a.bar FROM invites a;
+ 
  hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/sum' SELECT SUM(a.pc) FROM pc1 a;
  
  Sum of a column. avg, min, max can also be used
@@ -233, +227 @@

  NOTE: there are some flaws with the type system that cause doubles to be 
  returned with integer types would be expected. We expect to fix these in the coming week.
  
+ ==== GROUP BY ====
- GROUP BY
- --------
  
  hive> FROM invites a INSERT OVERWRITE TABLE events SELECT a.bar, count(1) WHERE a.foo > 0 GROUP BY a.bar;
+ 
  hive> INSERT OVERWRITE TABLE events SELECT a.bar, count(1) FROM invites a WHERE a.foo > 0 GROUP BY a.bar;
  
  NOTE: Currently Hive always uses two stage map-reduce for groupby operation. This is 
  to handle skews in input data. We will be optimizing this in the coming weeks.
  
+ ==== JOIN ====
- JOIN
- ----
  
  hive> FROM pokes t1 JOIN invites t2 ON (t1.bar = t2.bar) INSERT OVERWRITE TABLE events SELECT t1.bar, t1.foo, t2.foo
  
- MULTITABLE INSERT
+ ==== MULTITABLE INSERT ====
- -----------------
+ 
- FROM src
+   FROM src
- INSERT OVERWRITE TABLE dest1 SELECT src.* WHERE src.key < 100
+   INSERT OVERWRITE TABLE dest1 SELECT src.* WHERE src.key < 100
- INSERT OVERWRITE TABLE dest2 SELECT src.key, src.value WHERE src.key >= 100 and src.key < 200
+   INSERT OVERWRITE TABLE dest2 SELECT src.key, src.value WHERE src.key >= 100 and src.key < 200
- INSERT OVERWRITE TABLE dest3 PARTITION(ds='2008-04-08', hr='12') SELECT src.key WHERE src.key >= 200 and src.key < 300
+   INSERT OVERWRITE TABLE dest3 PARTITION(ds='2008-04-08', hr='12') SELECT src.key WHERE src.key >= 200 and src.key < 300
- INSERT OVERWRITE LOCAL DIRECTORY '/tmp/dest4.out' SELECT src.value WHERE src.key >= 300
+   INSERT OVERWRITE LOCAL DIRECTORY '/tmp/dest4.out' SELECT src.value WHERE src.key >= 300
  
- STREAMING
- ---------
+ ==== STREAMING ====
+ 
  hive> FROM invites a INSERT OVERWRITE TABLE events SELECT TRANSFORM(a.foo, a.bar) AS (oof, rab) USING '/bin/cat' WHERE a.ds > '2008-08-09';
  
  This streams the data in the map phase through the script /bin/cat (like hadoop streaming).