You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Renaud Delbru <re...@deri.org> on 2010/09/23 13:16:17 UTC

Creating a Table using HFileOutputFormat

  Hi,

we are trying to create a hbase table from scratch using map-reduce and 
HFileOutputFormat. However, we haven't really find examples or tutorials 
on how to do this, and there is some aspects which are still unclear for 
us. We are using hbase 0.20.x.

First, what is the correct way to use HFileOutputFormat and to create 
HFile ?
We are simply using a map function which output <ImmutableBytesWritable 
(key), Put (value)>, an identity reducer, and we configure the job to 
use HFileOutputFormat as an output format class.
However, we have seen in hbase 0.89.x a more complex way to do it, 
involving sorting (KeyValueSortReducer, or PutSortReducer) and a 
partitioner (TotalOrderPartitioner). The HFileOutputFormat provides a 
convenience method, configureIncrementalLoad, to automatically configure 
the hadoop job. Is this method needed in our case ? Ir is this only 
necessary in the case where the table already exists (incremental bulk 
load) ?
Do we have to reimplement this for 0.20.x ?

Then, one time the table creation job is successful, how do we import 
the hfiles into hbase ? Is it by using the hbase cli import command ?

Thanks in advance for your answers,
Regards
-- 
Renaud Delbru

Re: Creating a Table using HFileOutputFormat

Posted by Renaud Delbru <re...@deri.org>.
  On 24/09/10 16:55, Ted Yu wrote:
>  From TotalOrderPartitioner:
>        K[] splitPoints = readPartitions(fs, partFile, keyClass, conf);
>        if (splitPoints.length != job.getNumReduceTasks() - 1) {
> Partition list can be empty if you use 1 reducer.
>
> But this is not what you want I guess.
Yes, this is not what we want since we want to create x regions.
But, we just found that there is a tool, InputSampler, in the hadoop 
library for this task. It will sample an arbitrary dataset, and create 
the partition splits. We will try first this approach. My guess is that, 
even if these partitions are an approximation, it should be ok for 
hbase. The size of the regions will be not totally identical, but it 
should not be a problem since the larger regions will be the first ones 
split into smaller regions by hbase. Can somebody confirm this assumption ?
-- 
Renaud Delbru

Re: Creating a Table using HFileOutputFormat

Posted by Ted Yu <yu...@gmail.com>.
>From TotalOrderPartitioner:
      K[] splitPoints = readPartitions(fs, partFile, keyClass, conf);
      if (splitPoints.length != job.getNumReduceTasks() - 1) {
Partition list can be empty if you use 1 reducer.

But this is not what you want I guess.

On Fri, Sep 24, 2010 at 4:54 AM, Renaud Delbru <re...@deri.org>wrote:

>  Hi Stack,
>
>
> On 23/09/10 19:25, Renaud Delbru wrote:
>
>>  On 23/09/10 19:22, Stack wrote:
>>
>>> Will the TotalOrderPartitioner found in the hadoop library not work for
>>>> 0.20.x ?
>>>>
>>> You might have to do what Todd did in TRUNK where he brought over the
>>> 'mapred' TotalOrderPartitioner to go against the new 'mapreduce' API
>>> (The bulk load is done against the hadoop 'new' API 'mapreduce' as
>>> opposed to 'mapred' package).  You might even be able to just copy
>>> what Todd has done in trunk over to your 0.20 install?
>>>
>> Yes, it is what we did, and it seems to work.
>>
> The job has failed because the TotalOrderPartitioner requires a
> partitions.lst file, which should contains the list of start keys for each
> region. However, in our case, since we are building the table from scratch,
> we don't know the start keys of each partition. Is there a way to bypass
> this, or do we first need to run a scan on our data collection to create
> this partition list ?
> --
> Renaud Delbru
>
>

Re: Creating a Table using HFileOutputFormat

Posted by Renaud Delbru <re...@deri.org>.
  Hi Stack,

On 23/09/10 19:25, Renaud Delbru wrote:
>  On 23/09/10 19:22, Stack wrote:
>>> Will the TotalOrderPartitioner found in the hadoop library not work for
>>> 0.20.x ?
>> You might have to do what Todd did in TRUNK where he brought over the
>> 'mapred' TotalOrderPartitioner to go against the new 'mapreduce' API
>> (The bulk load is done against the hadoop 'new' API 'mapreduce' as
>> opposed to 'mapred' package).  You might even be able to just copy
>> what Todd has done in trunk over to your 0.20 install?
> Yes, it is what we did, and it seems to work.
The job has failed because the TotalOrderPartitioner requires a 
partitions.lst file, which should contains the list of start keys for 
each region. However, in our case, since we are building the table from 
scratch, we don't know the start keys of each partition. Is there a way 
to bypass this, or do we first need to run a scan on our data collection 
to create this partition list ?
-- 
Renaud Delbru


Re: Creating a Table using HFileOutputFormat

Posted by Renaud Delbru <re...@deri.org>.
  On 23/09/10 19:22, Stack wrote:
>> Will the TotalOrderPartitioner found in the hadoop library not work for
>> 0.20.x ?
> You might have to do what Todd did in TRUNK where he brought over the
> 'mapred' TotalOrderPartitioner to go against the new 'mapreduce' API
> (The bulk load is done against the hadoop 'new' API 'mapreduce' as
> opposed to 'mapred' package).  You might even be able to just copy
> what Todd has done in trunk over to your 0.20 install?
Yes, it is what we did, and it seems to work.
Thanks.
-- 
Renaud Delbru

Re: Creating a Table using HFileOutputFormat

Posted by Stack <st...@duboce.net>.
On Thu, Sep 23, 2010 at 9:50 AM, Renaud Delbru <re...@deri.org> wrote:
>
> Will the TotalOrderPartitioner found in the hadoop library not work for
> 0.20.x ?
>

You might have to do what Todd did in TRUNK where he brought over the
'mapred' TotalOrderPartitioner to go against the new 'mapreduce' API
(The bulk load is done against the hadoop 'new' API 'mapreduce' as
opposed to 'mapred' package).  You might even be able to just copy
what Todd has done in trunk over to your 0.20 install?

St.Ack

Re: Creating a Table using HFileOutputFormat

Posted by Renaud Delbru <re...@deri.org>.
  Hi Stack,

On 23/09/10 17:13, Stack wrote:
> You've seen this documentation for bulk import in 0.20.x:
> http://hbase.apache.org/docs/r0.20.6/api/org/apache/hadoop/hbase/mapreduce/package-summary.html#bulk?
>   (Make sure you are on 0.20.6).
No, I missed this one. Thanks for pointing me this one.
> In TRUNK bulk import was revamped.  Its all fancy and robust now.  See
> http://hbase.apache.org/docs/r0.89.20100726/bulk-loads.html
Yes,  I see this one, but we are using the 0.20.x version.
> In both versions a partitioner is required.  In TRUNK the hadoop total
> order partitioner is brought local and should work for most key types.
>   In 0.20.x you'd need to write your own.
Will the TotalOrderPartitioner found in the hadoop library not work for 
0.20.x ?
> In 0.20.x, there is no support for incremental loading.  It will only
> load a fresh table.  Incremental is a feature of the TRUNK version.
Ok.
> In 0.20.x, you use the loadtable.rb script.  In TRUNK, you run a
> little java program.
Ok, thanks.
All is more clear now.

Best,
-- 
Renaud Delbru

Re: Creating a Table using HFileOutputFormat

Posted by Stack <st...@duboce.net>.
Hello Renaud:

You've seen this documentation for bulk import in 0.20.x:
http://hbase.apache.org/docs/r0.20.6/api/org/apache/hadoop/hbase/mapreduce/package-summary.html#bulk?
 (Make sure you are on 0.20.6).

In TRUNK bulk import was revamped.  Its all fancy and robust now.  See
http://hbase.apache.org/docs/r0.89.20100726/bulk-loads.html

In both versions a partitioner is required.  In TRUNK the hadoop total
order partitioner is brought local and should work for most key types.
 In 0.20.x you'd need to write your own.

In 0.20.x, there is no support for incremental loading.  It will only
load a fresh table.  Incremental is a feature of the TRUNK version.

In 0.20.x, you use the loadtable.rb script.  In TRUNK, you run a
little java program.

St.Ack


On Thu, Sep 23, 2010 at 4:16 AM, Renaud Delbru <re...@deri.org> wrote:
>  Hi,
>
> we are trying to create a hbase table from scratch using map-reduce and
> HFileOutputFormat. However, we haven't really find examples or tutorials on
> how to do this, and there is some aspects which are still unclear for us. We
> are using hbase 0.20.x.
>
> First, what is the correct way to use HFileOutputFormat and to create HFile
> ?
> We are simply using a map function which output <ImmutableBytesWritable
> (key), Put (value)>, an identity reducer, and we configure the job to use
> HFileOutputFormat as an output format class.
> However, we have seen in hbase 0.89.x a more complex way to do it, involving
> sorting (KeyValueSortReducer, or PutSortReducer) and a partitioner
> (TotalOrderPartitioner). The HFileOutputFormat provides a convenience
> method, configureIncrementalLoad, to automatically configure the hadoop job.
> Is this method needed in our case ? Ir is this only necessary in the case
> where the table already exists (incremental bulk load) ?
> Do we have to reimplement this for 0.20.x ?
>
> Then, one time the table creation job is successful, how do we import the
> hfiles into hbase ? Is it by using the hbase cli import command ?
>
> Thanks in advance for your answers,
> Regards
> --
> Renaud Delbru
>