You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-commits@hadoop.apache.org by Apache Wiki <wi...@apache.org> on 2011/02/04 21:49:14 UTC
[Hadoop Wiki] Update of "Hive/HBaseBulkLoad" by JohnSichi
Dear Wiki user,
You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.
The "Hive/HBaseBulkLoad" page has been changed by JohnSichi.
http://wiki.apache.org/hadoop/Hive/HBaseBulkLoad?action=diff&rev1=21&rev2=22
--------------------------------------------------
In order to perform a parallel sort on the data, we need to range-partition it. The idea is to divide the space of row keys up into nearly equal-sized ranges, one per reducer which will be used in the parallel sort. The details will vary according to your source data, and you may need to run a number of exploratory Hive queries in order to come up with a good enough set of ranges. Here's one example:
{{{
+ add jar lib/hive_contrib.jar;
set mapred.reduce.tasks=1;
create temporary function row_sequence as
'org.apache.hadoop.hive.contrib.udf.UDFRowSequence';