You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-commits@hadoop.apache.org by Apache Wiki <wi...@apache.org> on 2010/08/16 23:42:12 UTC
[Hadoop Wiki] Update of "Hive/HBaseBulkLoad" by JohnSichi

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The "Hive/HBaseBulkLoad" page has been changed by JohnSichi.
http://wiki.apache.org/hadoop/Hive/HBaseBulkLoad?action=diff&rev1=18&rev2=19

--------------------------------------------------

  
  == Prepare Range Partitioning ==
  
- In order to perform a parallel sort on the data, we need to range-partition it.  The idea is to divide the space of row keys up into nearly equal-sized ranges, one per reducer.  The details will vary according to your source data, and you may need to run a number of exploratory Hive queries in order to come up with a good enough set of ranges.  As a highly contrived example, suppose your row keys are sequence-generated transaction ID strings (possibly with gaps), you have a year's worth of data starting from January, your data growth is constant month-over-month, and you want to run 12 reducers.  In that case, you could use a query such as this one:
+ In order to perform a parallel sort on the data, we need to range-partition it.  The idea is to divide the space of row keys up into nearly equal-sized ranges, one per reducer which will be used in the parallel sort.  The details will vary according to your source data, and you may need to run a number of exploratory Hive queries in order to come up with a good enough set of ranges.  Here's one example:
  
  {{{
+ set mapred.reduce.tasks=1;
+ create temporary function row_sequence as 
+ 'org.apache.hadoop.hive.contrib.udf.UDFRowSequence';
+ select transaction_id from
- select transaction_id
+ (select transaction_id
- from
- (select month,max(transaction_id) as transaction_id
-  from transactions
+ from transactions
-  group by month) m
+ tablesample(bucket 1 out of 10000 on transaction_id) s 
- order by transaction_id
+ order by transaction_id 
+ limit 10000000) x
+ where (row_sequence() % 910000)=0
- limit 11
+ limit 11;
  }}}
  
+ This works by ordering all of the rows in a sample of the table (using a single reducer), and then selecting every nth row (here n=910000).  The value of n is chosen by dividing the total number of rows in the table by the desired number of ranges, e.g. 12 in this case (one more than the number of partitioning keys produced by the LIMIT clause).  The assumption here is that the distribution in the sample matches the overall distribution in the table; if this is not the case, the resulting partition keys will lead to skew in the parallel sort.
- Note that we only want 11 values for breaking the data into 12 ranges, so we drop the max timestamp for the last month.  Also note that the ORDER BY is necessary for producing the range start keys in ascending order.
- 
- ''Important:'' there are usually much cheaper ways to come up with good split keys; '''this is just an example to give you an idea of the kind of result your sampling query should produce'''.
  
  Once you have your sampling query defined, the next step is to save its results to a properly formatted file which will be used in a subsequent step.  To do this, run commands like the following:
  
@@ -82, +84 @@

  location '/tmp/hb_range_keys';
  
  insert overwrite table hb_range_keys
+ select transaction_id from
- select transaction_id
+ (select transaction_id
- from
- (select month,max(transaction_id) as transaction_id
-  from transactions
+ from transactions
-  group by month) m
+ tablesample(bucket 1 out of 10000 on transaction_id) s 
- order by transaction_id
+ order by transaction_id 
+ limit 10000000) x
+ where (row_sequence() % 910000)=0
  limit 11;
  }}}