You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Nanheng Wu <na...@gmail.com> on 2011/01/06 00:54:54 UTC

Bulk load using HFileOutputFormat.RecordWriter

Hi,

  I am new to HBase and Hadoop and I am trying to find the best way to
bulk load a table from HDFS to HBase. I don't mind creating a new
table for each batch and what I understand using HFileOutputFormat
directly in a MR job is the most efficient method. My input data set
is already in sorted order, it seems to me that I don't need to use
reducers, which require me to do a globally sort already sorted data.
I tried to use HFileOutputFormat.getRecordWriter in my mapper and 0
reducers but the output directory has a only a _temporary directory
with my outputs in each subdirectory. That doesn't seem be be what the
loadtable script expects  (a column family directory with HFiles). Can
someone tell me if what I am doing makes sense in general or how to do
this properly? Thanks!

Re: Bulk load using HFileOutputFormat.RecordWriter

Posted by Stack <st...@duboce.net>.
On Fri, Jan 7, 2011 at 9:31 AM, Nanheng Wu <na...@gmail.com> wrote:
> Also do you see a problem with dropping tables while serving queries
> for other tables?

You mean in shell doing disable, drop?  That functionality is kinda
flakey in 0.20.  It does not work reliably.  The disable action runs
through all regions and asks all the regionservers to close out
regions of the table.  You can run the disable while serving queries
but the close up of the old regions will put a load on the system --
perhaps dragging down latency of reads -- as regions are usually
flushed and compacted before the close can complete (We need a
facility for saying just close -- no flush, no compact -- for case of
a table we know we don't want to keep).  The drop then deletes entries
from .META. and deletes content in HDFS.

You might be better off writing a few scripts of your own that did a
slow motion remove of the old table.  They'd pick a region off .META.,
disable the individual region, check it had happened, then did remove
from .META. and HDFS.  You'd run the table to delete slowly to ensure
serving was not effected.

St.Ack


> We are using hbase 20 right now and using this bulk
> load method we are getting great performance but it does require using
> a new table for each load. We want to clean up older data by dropping
> their tables either when a new table is loaded or by a cron job. What
> kind of impact would this approach have on reads? Thanks!
>

Re: Bulk load using HFileOutputFormat.RecordWriter

Posted by Nanheng Wu <na...@gmail.com>.
Also do you see a problem with dropping tables while serving queries
for other tables? We are using hbase 20 right now and using this bulk
load method we are getting great performance but it does require using
a new table for each load. We want to clean up older data by dropping
their tables either when a new table is loaded or by a cron job. What
kind of impact would this approach have on reads? Thanks!

On Thursday, January 6, 2011, Stack <st...@duboce.net> wrote:
> There is no such thing really in 0.20.6.  I suppose you could do a
> pre-scan of the table entries up in .META. and if you couldn't find
> the region with the empty end key, then you'd know the table wasn't on
> line.  There is nothing in 0.90.0 either but in 0.90.0 there is a
> notion of 'is_enabling' and we could set this flag up in zk while the
> table is coming online.  Your client could poll for is_enabled.
> St.Ack
>
> On Thu, Jan 6, 2011 at 3:12 PM, Nanheng Wu <na...@gmail.com> wrote:
>> Yes, it only seconds. Just for several seconds I can see the table in
>> the HBase UI but when I clicked through it I got an error about no
>> entries were found in the .META. table. I guess it's not too bad since
>> it's only a few seconds but a mechanism to know for sure when all the
>> entries are loaded in .META. would be very helpful.
>>
>> On Thu, Jan 6, 2011 at 2:42 PM, Stack <st...@duboce.net> wrote:
>>> On Thu, Jan 6, 2011 at 10:17 AM, Nanheng Wu <na...@gmail.com> wrote:
>>>> Thanks for the answer Todd. I realized that I was making my life
>>>> harder by using the low level record writer directly. Instead I just
>>>> made the mapper output a <ImmutableBytesWriteable, KeyValue> pair and
>>>> set the output format to HFileOutputFormat. It works really great! I
>>>> have a follow up question, after I run the loadtable.rb script it
>>>> looks a little while before the table is actually ready to be queried.
>>>> Is there a way to programmatically test if the table is "ready"? I am
>>>> using hbase-0.20.6. Thanks!
>>>>
>>>
>>> What is taking the time?  Is it that there are a bunch of a regions
>>> and they don't come on atomically but rather one at a time.  When you
>>> say 'while' in the above, you are talking about seconds, right?
>>> (IIRC, all loadtable is doing is adding entries to .META. -- though it
>>> may also be moving files into place).
>>>
>>> St.Ack
>>>
>>
>

Re: Bulk load using HFileOutputFormat.RecordWriter

Posted by Stack <st...@duboce.net>.
There is no such thing really in 0.20.6.  I suppose you could do a
pre-scan of the table entries up in .META. and if you couldn't find
the region with the empty end key, then you'd know the table wasn't on
line.  There is nothing in 0.90.0 either but in 0.90.0 there is a
notion of 'is_enabling' and we could set this flag up in zk while the
table is coming online.  Your client could poll for is_enabled.
St.Ack

On Thu, Jan 6, 2011 at 3:12 PM, Nanheng Wu <na...@gmail.com> wrote:
> Yes, it only seconds. Just for several seconds I can see the table in
> the HBase UI but when I clicked through it I got an error about no
> entries were found in the .META. table. I guess it's not too bad since
> it's only a few seconds but a mechanism to know for sure when all the
> entries are loaded in .META. would be very helpful.
>
> On Thu, Jan 6, 2011 at 2:42 PM, Stack <st...@duboce.net> wrote:
>> On Thu, Jan 6, 2011 at 10:17 AM, Nanheng Wu <na...@gmail.com> wrote:
>>> Thanks for the answer Todd. I realized that I was making my life
>>> harder by using the low level record writer directly. Instead I just
>>> made the mapper output a <ImmutableBytesWriteable, KeyValue> pair and
>>> set the output format to HFileOutputFormat. It works really great! I
>>> have a follow up question, after I run the loadtable.rb script it
>>> looks a little while before the table is actually ready to be queried.
>>> Is there a way to programmatically test if the table is "ready"? I am
>>> using hbase-0.20.6. Thanks!
>>>
>>
>> What is taking the time?  Is it that there are a bunch of a regions
>> and they don't come on atomically but rather one at a time.  When you
>> say 'while' in the above, you are talking about seconds, right?
>> (IIRC, all loadtable is doing is adding entries to .META. -- though it
>> may also be moving files into place).
>>
>> St.Ack
>>
>

Re: Bulk load using HFileOutputFormat.RecordWriter

Posted by Nanheng Wu <na...@gmail.com>.
Yes, it only seconds. Just for several seconds I can see the table in
the HBase UI but when I clicked through it I got an error about no
entries were found in the .META. table. I guess it's not too bad since
it's only a few seconds but a mechanism to know for sure when all the
entries are loaded in .META. would be very helpful.

On Thu, Jan 6, 2011 at 2:42 PM, Stack <st...@duboce.net> wrote:
> On Thu, Jan 6, 2011 at 10:17 AM, Nanheng Wu <na...@gmail.com> wrote:
>> Thanks for the answer Todd. I realized that I was making my life
>> harder by using the low level record writer directly. Instead I just
>> made the mapper output a <ImmutableBytesWriteable, KeyValue> pair and
>> set the output format to HFileOutputFormat. It works really great! I
>> have a follow up question, after I run the loadtable.rb script it
>> looks a little while before the table is actually ready to be queried.
>> Is there a way to programmatically test if the table is "ready"? I am
>> using hbase-0.20.6. Thanks!
>>
>
> What is taking the time?  Is it that there are a bunch of a regions
> and they don't come on atomically but rather one at a time.  When you
> say 'while' in the above, you are talking about seconds, right?
> (IIRC, all loadtable is doing is adding entries to .META. -- though it
> may also be moving files into place).
>
> St.Ack
>

Re: Bulk load using HFileOutputFormat.RecordWriter

Posted by Stack <st...@duboce.net>.
On Thu, Jan 6, 2011 at 10:17 AM, Nanheng Wu <na...@gmail.com> wrote:
> Thanks for the answer Todd. I realized that I was making my life
> harder by using the low level record writer directly. Instead I just
> made the mapper output a <ImmutableBytesWriteable, KeyValue> pair and
> set the output format to HFileOutputFormat. It works really great! I
> have a follow up question, after I run the loadtable.rb script it
> looks a little while before the table is actually ready to be queried.
> Is there a way to programmatically test if the table is "ready"? I am
> using hbase-0.20.6. Thanks!
>

What is taking the time?  Is it that there are a bunch of a regions
and they don't come on atomically but rather one at a time.  When you
say 'while' in the above, you are talking about seconds, right?
(IIRC, all loadtable is doing is adding entries to .META. -- though it
may also be moving files into place).

St.Ack

Re: Bulk load using HFileOutputFormat.RecordWriter

Posted by Nanheng Wu <na...@gmail.com>.
Thanks for the answer Todd. I realized that I was making my life
harder by using the low level record writer directly. Instead I just
made the mapper output a <ImmutableBytesWriteable, KeyValue> pair and
set the output format to HFileOutputFormat. It works really great! I
have a follow up question, after I run the loadtable.rb script it
looks a little while before the table is actually ready to be queried.
Is there a way to programmatically test if the table is "ready"? I am
using hbase-0.20.6. Thanks!

On Wed, Jan 5, 2011 at 6:48 PM, Todd Lipcon <to...@cloudera.com> wrote:
> Hi Nanheng,
>
> It sounds like you're on the right path. It sounds like you're missing the
> "commit" step when using the output format.
>
> The layout of the output dir should look something like:
> output/
> output/colfam/
> output/colfam/234923423
> output/colfam/349593453  <-- these are just unique IDs
>
> Thanks
> -Todd
>
>
>
> On Wed, Jan 5, 2011 at 3:54 PM, Nanheng Wu <na...@gmail.com> wrote:
>
>> Hi,
>>
>>  I am new to HBase and Hadoop and I am trying to find the best way to
>> bulk load a table from HDFS to HBase. I don't mind creating a new
>> table for each batch and what I understand using HFileOutputFormat
>> directly in a MR job is the most efficient method. My input data set
>> is already in sorted order, it seems to me that I don't need to use
>> reducers, which require me to do a globally sort already sorted data.
>> I tried to use HFileOutputFormat.getRecordWriter in my mapper and 0
>> reducers but the output directory has a only a _temporary directory
>> with my outputs in each subdirectory. That doesn't seem be be what the
>> loadtable script expects  (a column family directory with HFiles). Can
>> someone tell me if what I am doing makes sense in general or how to do
>> this properly? Thanks!
>>
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>

Re: Bulk load using HFileOutputFormat.RecordWriter

Posted by Todd Lipcon <to...@cloudera.com>.
Hi Nanheng,

It sounds like you're on the right path. It sounds like you're missing the
"commit" step when using the output format.

The layout of the output dir should look something like:
output/
output/colfam/
output/colfam/234923423
output/colfam/349593453  <-- these are just unique IDs

Thanks
-Todd



On Wed, Jan 5, 2011 at 3:54 PM, Nanheng Wu <na...@gmail.com> wrote:

> Hi,
>
>  I am new to HBase and Hadoop and I am trying to find the best way to
> bulk load a table from HDFS to HBase. I don't mind creating a new
> table for each batch and what I understand using HFileOutputFormat
> directly in a MR job is the most efficient method. My input data set
> is already in sorted order, it seems to me that I don't need to use
> reducers, which require me to do a globally sort already sorted data.
> I tried to use HFileOutputFormat.getRecordWriter in my mapper and 0
> reducers but the output directory has a only a _temporary directory
> with my outputs in each subdirectory. That doesn't seem be be what the
> loadtable script expects  (a column family directory with HFiles). Can
> someone tell me if what I am doing makes sense in general or how to do
> this properly? Thanks!
>



-- 
Todd Lipcon
Software Engineer, Cloudera