You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-commits@hadoop.apache.org by Apache Wiki <wi...@apache.org> on 2010/06/08 23:45:16 UTC

[Hadoop Wiki] Trivial Update of "Hive/HBaseBulkLoad" by CarlSteinbach

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The "Hive/HBaseBulkLoad" page has been changed by CarlSteinbach.
http://wiki.apache.org/hadoop/Hive/HBaseBulkLoad?action=diff&rev1=17&rev2=18

--------------------------------------------------

+ = Hive HBase Bulk Load =
+ 
+ <<TableOfContents>>
+ 
  This page explains how to use Hive to bulk load data into a new (empty) HBase table per [[https://issues.apache.org/jira/browse/HIVE-1295|HIVE-1295]].  (If you're not using a build which contains this functionality yet, you'll need to build from source and make sure this patch and HIVE-1321 are both applied.)
  
- = Overview =
+ == Overview ==
  
  Ideally, bulk load from Hive into HBase would be part of [[Hive/HBaseIntegration]], making it as simple as this:
  
@@ -32, +36 @@

  
  The rest of this page explains each step in greater detail.
  
- = Decide on Target HBase Schema =
+ == Decide on Target HBase Schema ==
  
  Currently there are a number of constraints here:
  
@@ -42, +46 @@

  
  Besides dealing with these constraints, probably the most important work here is deciding on how you want to assign an HBase row key to each row coming from Hive.  To avoid inconsistencies between lexical and binary comparators, it is simplest to design a string row key and use it consistently all the way through.  If you want to combine multiple columns into the key, use Hive's string concat expression for this purpose.  You can use CREATE VIEW to tack on your rowkey logically without having to update any existing data in Hive.
  
- = Estimate Resources Needed =
+ == Estimate Resources Needed ==
  
  TBD:  provide some example numbers based on Facebook experiments; also reference [[http://www.hpl.hp.com/hosted/sortbenchmark/YahooHadoop.pdf|Hadoop Terasort]]
  
- = Prepare Range Partitioning =
+ == Prepare Range Partitioning ==
  
  In order to perform a parallel sort on the data, we need to range-partition it.  The idea is to divide the space of row keys up into nearly equal-sized ranges, one per reducer.  The details will vary according to your source data, and you may need to run a number of exploratory Hive queries in order to come up with a good enough set of ranges.  As a highly contrived example, suppose your row keys are sequence-generated transaction ID strings (possibly with gaps), you have a year's worth of data starting from January, your data growth is constant month-over-month, and you want to run 12 reducers.  In that case, you could use a query such as this one:
  
@@ -95, +99 @@

  dfs -cp /tmp/hb_range_keys/* /tmp/hb_range_key_list;
  }}}
  
- = Prepare Staging Location =
+ == Prepare Staging Location ==
  
  The sort is going to produce a lot of data, so make sure you have sufficient space in your HDFS cluster, and choose the location where the files will be staged.  We'll use {{{/tmp/hbsort}}} in this example.
  
@@ -106, +110 @@

  dfs -mkdir /tmp/hbsort;
  }}}
  
- = Sort Data =
+ == Sort Data ==
  
  Now comes the big step:  running a sort over all of the data to be bulk loaded.  Make sure that your Hive instance has the HBase jars available on its auxpath.
  
@@ -138, +142 @@

  
  The first column in the SELECT list is interpreted as the rowkey; subsequent columns become cell values (all in a single column family, so their column names are important).
  
- = Run HBase Script =
+ == Run HBase Script ==
  
  Once the sort job completes successfully, one final step is required for importing the result files into HBase.
  
@@ -154, +158 @@

  
  After this script finishes, you may need to wait a minute or two for the new table to be picked up by the HBase meta scanner.  Use the hbase shell to verify that the new table was created correctly, and do some sanity queries to locate individual cells and make sure they can be found.
  
- = Map New Table Back Into Hive =
+ == Map New Table Back Into Hive ==
  
  Finally, if you'd like to access the HBase table you just created via Hive:
  
@@ -165, +169 @@

  TBLPROPERTIES("hbase.table.name" = "transactions");
  }}}
  
- = Followups Needed =
+ == Followups Needed ==
  
   * Support sparse tables
   * Support loading binary data representations once HIVE-1245 is fixed