You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Ashish Shinde <as...@strandls.com> on 2011/01/19 05:46:10 UTC

Bulk upload with multiple reducers with hbase-0.90.0

Hi,

I am new to hbase and to hadoop as well so forgive me if the following
is naive.

I am trying to bulk upload large amounts of data (billions of rows with
15-20 columns) into an empty hbase table using two column families.

The approach I tried was to use MR. The code is copied over and
modified from to ImportTsv.java.

I did not get good performance because the code used
TotalOrderPartioner which I gathered looked at the current number of
regions and decided to use a single reducer on an empty table. 

I then tried SimpleTotalOrderPartioner with conservatively large start
and end keys which then ended up dividing unequally over our 10 node
cluster.

Questions

1. Can bulk upload use totalorderpartioner with multiple reducers ?

2. I don't have a handle of the min and max row key from the data
unless I collect it over the MAP phase. Is it possible to reconfigure
the partioner after map phase is over ?

3. I would need to frequently load datasets with billions of rows
(450-800GB) to hbase as the solution is part of a data processing
pipeline. My estimate (optimistic) on a 10 node cluster is 7 hours . Is
this reasonable. Would hbase scale to say 100s of such datasets, giving
I can add disk spsace and nodes to the cluster.

Thanks,

 - Ashish

Re: Bulk upload with multiple reducers with hbase-0.90.0

Posted by Marc Limotte <ms...@gmail.com>.

Ashish,

I had similar experiences with our data.  You do have to explicitly turn on
compression for importtsv, it doesn't pick up the config for the family
automatically.  If you don't do that, then you have to wait for
major_compaction to go through an compress everything.

For inporttsv, you can use command line options like:

>  -Dhfile.compression=gz -Dhfile.compress.output=true
>
Which is for GZIP compression, but if you have the LZO codec installed, you
should be able to change it to "lzo".  I haven't tried this yet.  The
"hfile.compress.output" setting may not be necessary, as specifying
"hfile.compression" alone might be sufficient.

We also have large, immutable data loads in big batches.  For new tables, it
was important to pre-split the regions.  I wound up running the first job
and then sampling the output keys to get 200 to 300 keys in order.  Only
have to do this once for a new table.  I also found it helpful to run major
compaction right after a bulk load (since the keys for the initial
partitions were only sampled, this helps smooth it out.

As a performance data point, for about 12gb (uncompressed), 39 M records, on
a table that had 1,405 regions; our load throughput was around 40k rows /
second.  This is an 11 node cluster (8 core).  With only 300 regions (when
the table was initially empty), we achieved 200k rows / second.  It seems to
me that the loads go faster when there are fewer regions (as long as there
are not too few, causing split-churn); so, as Stack suggested, it might make
sense to increase the hfile size to 1 GB.

Marc


On Fri, Jan 21, 2011 at 10:17 PM, Ashish Shinde <as...@strandls.com> wrote:

> Yes I did try LZO compression it helped, however the resultant disk
> usage was on par with uncompressed text size.
>
> Writing out our serialized data records as batches as a single row with
> a single column and LZO compression enabled resulted in the data
> getting compressed to 25-30% of original size.
>
> The impressive thing was that with the above approach the number of
> rows on our test data reduced from 12 million to 12 K. However the
> insert times were very similar again indicating that hbase inserts
> times are sort of irrespective of the current table size.
>
> On a side note I needed to run major_compaction to get the data
> compressed. Bulk upload did not write out data compressed.
>
> Am I missing something?
>
> Thanks and regards,
>  - Ashish
>
>
>  On Thu, 20 Jan 2011 21:23:03 -0800
> Ted Dunning <td...@maprtech.com> wrote:
>
> > Were you using LZO?  Repetitive keys should compress almost to
> > nothing.
> >
> > On Thu, Jan 20, 2011 at 8:48 PM, Ashish Shinde <as...@strandls.com>
> > wrote:
> >
> > > Hi Stack,
> > >
> > > Yes makes sense. Will approach it from our needs perspective.
> > >
> > > I tried using a prebaked table and a reasonable partioner with very
> > >  promising results in terms of insert times.
> > >
> > > However the size of a 1.6 GB test file after import resulted in a
> > > hbase folder roughly 6 GB. Although in most cases people are not
> > > disk size sensitive, we would really like to keep disk usage at a
> > > minimum.
> > >
> > > The nature of the data required me to create a rowkey that was 100
> > > bytes long. An examination of the table's datablock's revealed that
> > > every column in the datablock is proceeded by the rowkey, and in our
> > > case this results an overhead of 6 times. Am I doing something
> > > obviously wrong?
> > >
> > > Serializing the row into a single hbase column brought the disk
> > > usage under wraps. Another approach I tried was to club a number of
> > > rows into a single hbase row and used a different indexing scheme
> > > with a simple long rowkey. This provided the best performance and
> > > the used the least amount of disk space.
> > >
> > > Our data is immutable at least as much as I can for see. Is the
> > > serialized row the best option I have? Does the number of rows in a
> > > table affect read performance. If this is the case then clubbing
> > > rows seems the be a reasonable option.
> > >
> > > Thanks and regards,
> > >  - Ashish
> > >
> > >
> > > On Wed, 19 Jan 2011 22:16:33 -0800
> > > Stack <st...@duboce.net> wrote:
> > >
> > > > On Wed, Jan 19, 2011 at 9:50 PM, Ashish Shinde
> > > > <as...@strandls.com> wrote:
> > > > > I have to say I am might impressed with hadoop and hbase, the
> > > > > overall philosophy and the architecture and have decided to
> > > > > contribute as much as time permits. Already looking at the
> > > > > "noob" issues on hbase jira :)
> > > > >
> > > >
> > > > I'd say work on your particular need rather than on noob issues.
> > > > Thats probably the best contrib. you could make.  Figure out the
> > > > blockers -- we'll help out -- that get in the way of your sizeable
> > > > incremental bulk uploads.  Your use case makes for a good story.
> > > >
> > > > Good luck Ashish,
> > > > St.Ack
> > >
> > >
>
>

Re: Bulk upload with multiple reducers with hbase-0.90.0

Posted by Ashish Shinde <as...@strandls.com>.

Yes I did try LZO compression it helped, however the resultant disk
usage was on par with uncompressed text size. 

Writing out our serialized data records as batches as a single row with
a single column and LZO compression enabled resulted in the data
getting compressed to 25-30% of original size. 

The impressive thing was that with the above approach the number of
rows on our test data reduced from 12 million to 12 K. However the
insert times were very similar again indicating that hbase inserts
times are sort of irrespective of the current table size. 

On a side note I needed to run major_compaction to get the data
compressed. Bulk upload did not write out data compressed. 

Am I missing something?

Thanks and regards,
 - Ashish


 On Thu, 20 Jan 2011 21:23:03 -0800
Ted Dunning <td...@maprtech.com> wrote:

> Were you using LZO?  Repetitive keys should compress almost to
> nothing.
> 
> On Thu, Jan 20, 2011 at 8:48 PM, Ashish Shinde <as...@strandls.com>
> wrote:
> 
> > Hi Stack,
> >
> > Yes makes sense. Will approach it from our needs perspective.
> >
> > I tried using a prebaked table and a reasonable partioner with very
> >  promising results in terms of insert times.
> >
> > However the size of a 1.6 GB test file after import resulted in a
> > hbase folder roughly 6 GB. Although in most cases people are not
> > disk size sensitive, we would really like to keep disk usage at a
> > minimum.
> >
> > The nature of the data required me to create a rowkey that was 100
> > bytes long. An examination of the table's datablock's revealed that
> > every column in the datablock is proceeded by the rowkey, and in our
> > case this results an overhead of 6 times. Am I doing something
> > obviously wrong?
> >
> > Serializing the row into a single hbase column brought the disk
> > usage under wraps. Another approach I tried was to club a number of
> > rows into a single hbase row and used a different indexing scheme
> > with a simple long rowkey. This provided the best performance and
> > the used the least amount of disk space.
> >
> > Our data is immutable at least as much as I can for see. Is the
> > serialized row the best option I have? Does the number of rows in a
> > table affect read performance. If this is the case then clubbing
> > rows seems the be a reasonable option.
> >
> > Thanks and regards,
> >  - Ashish
> >
> >
> > On Wed, 19 Jan 2011 22:16:33 -0800
> > Stack <st...@duboce.net> wrote:
> >
> > > On Wed, Jan 19, 2011 at 9:50 PM, Ashish Shinde
> > > <as...@strandls.com> wrote:
> > > > I have to say I am might impressed with hadoop and hbase, the
> > > > overall philosophy and the architecture and have decided to
> > > > contribute as much as time permits. Already looking at the
> > > > "noob" issues on hbase jira :)
> > > >
> > >
> > > I'd say work on your particular need rather than on noob issues.
> > > Thats probably the best contrib. you could make.  Figure out the
> > > blockers -- we'll help out -- that get in the way of your sizeable
> > > incremental bulk uploads.  Your use case makes for a good story.
> > >
> > > Good luck Ashish,
> > > St.Ack
> >
> >

Re: Bulk upload with multiple reducers with hbase-0.90.0

Posted by Ted Dunning <td...@maprtech.com>.

Were you using LZO?  Repetitive keys should compress almost to nothing.

On Thu, Jan 20, 2011 at 8:48 PM, Ashish Shinde <as...@strandls.com> wrote:

> Hi Stack,
>
> Yes makes sense. Will approach it from our needs perspective.
>
> I tried using a prebaked table and a reasonable partioner with very
>  promising results in terms of insert times.
>
> However the size of a 1.6 GB test file after import resulted in a hbase
> folder roughly 6 GB. Although in most cases people are not disk size
> sensitive, we would really like to keep disk usage at a minimum.
>
> The nature of the data required me to create a rowkey that was 100
> bytes long. An examination of the table's datablock's revealed that
> every column in the datablock is proceeded by the rowkey, and in our
> case this results an overhead of 6 times. Am I doing something obviously
> wrong?
>
> Serializing the row into a single hbase column brought the disk
> usage under wraps. Another approach I tried was to club a number of
> rows into a single hbase row and used a different indexing scheme with a
> simple long rowkey. This provided the best performance and the used the
> least amount of disk space.
>
> Our data is immutable at least as much as I can for see. Is the
> serialized row the best option I have? Does the number of rows in a
> table affect read performance. If this is the case then clubbing rows
> seems the be a reasonable option.
>
> Thanks and regards,
>  - Ashish
>
>
> On Wed, 19 Jan 2011 22:16:33 -0800
> Stack <st...@duboce.net> wrote:
>
> > On Wed, Jan 19, 2011 at 9:50 PM, Ashish Shinde <as...@strandls.com>
> > wrote:
> > > I have to say I am might impressed with hadoop and hbase, the
> > > overall philosophy and the architecture and have decided to
> > > contribute as much as time permits. Already looking at the "noob"
> > > issues on hbase jira :)
> > >
> >
> > I'd say work on your particular need rather than on noob issues.
> > Thats probably the best contrib. you could make.  Figure out the
> > blockers -- we'll help out -- that get in the way of your sizeable
> > incremental bulk uploads.  Your use case makes for a good story.
> >
> > Good luck Ashish,
> > St.Ack
>
>

Re: Bulk upload with multiple reducers with hbase-0.90.0

Posted by Ashish Shinde <as...@strandls.com>.

Hi Stack,

Yes makes sense. Will approach it from our needs perspective.

I tried using a prebaked table and a reasonable partioner with very
 promising results in terms of insert times. 

However the size of a 1.6 GB test file after import resulted in a hbase
folder roughly 6 GB. Although in most cases people are not disk size
sensitive, we would really like to keep disk usage at a minimum.  

The nature of the data required me to create a rowkey that was 100
bytes long. An examination of the table's datablock's revealed that
every column in the datablock is proceeded by the rowkey, and in our
case this results an overhead of 6 times. Am I doing something obviously
wrong?

Serializing the row into a single hbase column brought the disk
usage under wraps. Another approach I tried was to club a number of
rows into a single hbase row and used a different indexing scheme with a
simple long rowkey. This provided the best performance and the used the
least amount of disk space. 

Our data is immutable at least as much as I can for see. Is the
serialized row the best option I have? Does the number of rows in a
table affect read performance. If this is the case then clubbing rows
seems the be a reasonable option. 

Thanks and regards,
 - Ashish

On Wed, 19 Jan 2011 22:16:33 -0800
Stack <st...@duboce.net> wrote:

> On Wed, Jan 19, 2011 at 9:50 PM, Ashish Shinde <as...@strandls.com>
> wrote:
> > I have to say I am might impressed with hadoop and hbase, the
> > overall philosophy and the architecture and have decided to
> > contribute as much as time permits. Already looking at the "noob"
> > issues on hbase jira :)
> >
> 
> I'd say work on your particular need rather than on noob issues.
> Thats probably the best contrib. you could make.  Figure out the
> blockers -- we'll help out -- that get in the way of your sizeable
> incremental bulk uploads.  Your use case makes for a good story.
> 
> Good luck Ashish,
> St.Ack

Re: Bulk upload with multiple reducers with hbase-0.90.0

Posted by Stack <st...@duboce.net>.

On Wed, Jan 19, 2011 at 9:50 PM, Ashish Shinde <as...@strandls.com> wrote:
> I have to say I am might impressed with hadoop and hbase, the overall
> philosophy and the architecture and have decided to contribute as much
> as time permits. Already looking at the "noob" issues on hbase jira :)
>

I'd say work on your particular need rather than on noob issues.
Thats probably the best contrib. you could make.  Figure out the
blockers -- we'll help out -- that get in the way of your sizeable
incremental bulk uploads.  Your use case makes for a good story.

Good luck Ashish,
St.Ack

Re: Bulk upload with multiple reducers with hbase-0.90.0

Posted by Ashish Shinde <as...@strandls.com>.

Hi,

Yes will try guessing the partitions beforehand and prebaking regions.

I have to say I am might impressed with hadoop and hbase, the overall
philosophy and the architecture and have decided to contribute as much
as time permits. Already looking at the "noob" issues on hbase jira :)

Thanks and regards, 
 - Ashish

On Wed, 19 Jan 2011 15:03:14 -0800
Stack <st...@duboce.net> wrote:

> On Tue, Jan 18, 2011 at 8:46 PM, Ashish Shinde <as...@strandls.com>
> wrote:
> > Questions
> >
> > 1. Can bulk upload use totalorderpartioner with multiple reducers ?
> >
> 
> Yes.
> 
> Try guessing the partitions for your keys.  Premake a bunch of regions
> in your table.
> 
> 
> > 2. I don't have a handle of the min and max row key from the data
> > unless I collect it over the MAP phase. Is it possible to
> > reconfigure the partioner after map phase is over ?
> >
> 
> It will take the current table regions as input -- hence the
> suggestion above to premake regions.  If the current regions are
> insufficient, then on import of the bulk load, we'll start splitting
> the regions to bring them down under the maximum configured size.
> This can make for a lot of churn.  You might want to start out with
> big regions > 1G, rather than the default 256M.
> 
> You can not redo the partitioning post MR job (you probably weren't
> asking this but in case you were).
> 
> 
> > 3. I would need to frequently load datasets with billions of rows
> > (450-800GB) to hbase as the solution is part of a data processing
> > pipeline. My estimate (optimistic) on a 10 node cluster is 7
> > hours . Is this reasonable. Would hbase scale to say 100s of such
> > datasets, giving I can add disk spsace and nodes to the cluster.
> >
> 
> Bulk load would be the way to go for sure for this kinda of
> incremental bulk loading.
> 
> Can older versions be allowed age out or do you want to keep all
> versions?
> 
> I don't know of anyone currently up in the 100s of TBs of data on an
> HBase cluster but I do know that there are a bunch of us trying to get
> there.  You could join the group and help out where you can scaling it
> up.
> 
> Keep asking questions.
> 
> St.Ack

Re: Bulk upload with multiple reducers with hbase-0.90.0

Posted by Stack <st...@duboce.net>.

On Tue, Jan 18, 2011 at 8:46 PM, Ashish Shinde <as...@strandls.com> wrote:
> Questions
>
> 1. Can bulk upload use totalorderpartioner with multiple reducers ?
>

Yes.

Try guessing the partitions for your keys.  Premake a bunch of regions
in your table.

> 2. I don't have a handle of the min and max row key from the data
> unless I collect it over the MAP phase. Is it possible to reconfigure
> the partioner after map phase is over ?
>

It will take the current table regions as input -- hence the
suggestion above to premake regions.  If the current regions are
insufficient, then on import of the bulk load, we'll start splitting
the regions to bring them down under the maximum configured size.
This can make for a lot of churn.  You might want to start out with
big regions > 1G, rather than the default 256M.

You can not redo the partitioning post MR job (you probably weren't
asking this but in case you were).

> 3. I would need to frequently load datasets with billions of rows
> (450-800GB) to hbase as the solution is part of a data processing
> pipeline. My estimate (optimistic) on a 10 node cluster is 7 hours . Is
> this reasonable. Would hbase scale to say 100s of such datasets, giving
> I can add disk spsace and nodes to the cluster.
>

Bulk load would be the way to go for sure for this kinda of
incremental bulk loading.

Can older versions be allowed age out or do you want to keep all versions?

I don't know of anyone currently up in the 100s of TBs of data on an
HBase cluster but I do know that there are a bunch of us trying to get
there.  You could join the group and help out where you can scaling it
up.

Keep asking questions.

St.Ack

Re: Bulk upload with multiple reducers with hbase-0.90.0

Posted by Stack <st...@duboce.net>.

Sorry Ashish, I didn't grok your answer to Ted.  Are you using
compression?  If so, as Ted says, it should mostly wash away.
St.Ack

On Thu, Jan 20, 2011 at 9:25 PM, Ashish Shinde <as...@strandls.com> wrote:
> Yes I a picked out bits and pieces from ImportTsv.java
>
> Thanks and regards,
>  - Ashish
>
> On Tue, 18 Jan 2011 23:14:47 -0800
> Ted Dunning <td...@maprtech.com> wrote:
>
>> Have you seen the bulk loader?
>>
>> On Tue, Jan 18, 2011 at 8:46 PM, Ashish Shinde <as...@strandls.com>
>> wrote:
>>
>> > Hi,
>> >
>> > I am new to hbase and to hadoop as well so forgive me if the
>> > following is naive.
>> >
>> > I am trying to bulk upload large amounts of data (billions of rows
>> > with 15-20 columns) into an empty hbase table using two column
>> > families.
>> >
>> > The approach I tried was to use MR. The code is copied over and
>> > modified from to ImportTsv.java.
>> >
>> > I did not get good performance because the code used
>> > TotalOrderPartioner which I gathered looked at the current number of
>> > regions and decided to use a single reducer on an empty table.
>> >
>> > I then tried SimpleTotalOrderPartioner with conservatively large
>> > start and end keys which then ended up dividing unequally over our
>> > 10 node cluster.
>> >
>> > Questions
>> >
>> > 1. Can bulk upload use totalorderpartioner with multiple reducers ?
>> >
>> > 2. I don't have a handle of the min and max row key from the data
>> > unless I collect it over the MAP phase. Is it possible to
>> > reconfigure the partioner after map phase is over ?
>> >
>> > 3. I would need to frequently load datasets with billions of rows
>> > (450-800GB) to hbase as the solution is part of a data processing
>> > pipeline. My estimate (optimistic) on a 10 node cluster is 7
>> > hours . Is this reasonable. Would hbase scale to say 100s of such
>> > datasets, giving I can add disk spsace and nodes to the cluster.
>> >
>> > Thanks,
>> >
>> >  - Ashish
>> >
>> >
>
>

Re: Bulk upload with multiple reducers with hbase-0.90.0

Posted by Ashish Shinde <as...@strandls.com>.

Yes I a picked out bits and pieces from ImportTsv.java

Thanks and regards,
 - Ashish

On Tue, 18 Jan 2011 23:14:47 -0800
Ted Dunning <td...@maprtech.com> wrote:

> Have you seen the bulk loader?
> 
> On Tue, Jan 18, 2011 at 8:46 PM, Ashish Shinde <as...@strandls.com>
> wrote:
> 
> > Hi,
> >
> > I am new to hbase and to hadoop as well so forgive me if the
> > following is naive.
> >
> > I am trying to bulk upload large amounts of data (billions of rows
> > with 15-20 columns) into an empty hbase table using two column
> > families.
> >
> > The approach I tried was to use MR. The code is copied over and
> > modified from to ImportTsv.java.
> >
> > I did not get good performance because the code used
> > TotalOrderPartioner which I gathered looked at the current number of
> > regions and decided to use a single reducer on an empty table.
> >
> > I then tried SimpleTotalOrderPartioner with conservatively large
> > start and end keys which then ended up dividing unequally over our
> > 10 node cluster.
> >
> > Questions
> >
> > 1. Can bulk upload use totalorderpartioner with multiple reducers ?
> >
> > 2. I don't have a handle of the min and max row key from the data
> > unless I collect it over the MAP phase. Is it possible to
> > reconfigure the partioner after map phase is over ?
> >
> > 3. I would need to frequently load datasets with billions of rows
> > (450-800GB) to hbase as the solution is part of a data processing
> > pipeline. My estimate (optimistic) on a 10 node cluster is 7
> > hours . Is this reasonable. Would hbase scale to say 100s of such
> > datasets, giving I can add disk spsace and nodes to the cluster.
> >
> > Thanks,
> >
> >  - Ashish
> >
> >

Re: Bulk upload with multiple reducers with hbase-0.90.0

Posted by Ted Dunning <td...@maprtech.com>.

Have you seen the bulk loader?

On Tue, Jan 18, 2011 at 8:46 PM, Ashish Shinde <as...@strandls.com> wrote:

> Hi,
>
> I am new to hbase and to hadoop as well so forgive me if the following
> is naive.
>
> I am trying to bulk upload large amounts of data (billions of rows with
> 15-20 columns) into an empty hbase table using two column families.
>
> The approach I tried was to use MR. The code is copied over and
> modified from to ImportTsv.java.
>
> I did not get good performance because the code used
> TotalOrderPartioner which I gathered looked at the current number of
> regions and decided to use a single reducer on an empty table.
>
> I then tried SimpleTotalOrderPartioner with conservatively large start
> and end keys which then ended up dividing unequally over our 10 node
> cluster.
>
> Questions
>
> 1. Can bulk upload use totalorderpartioner with multiple reducers ?
>
> 2. I don't have a handle of the min and max row key from the data
> unless I collect it over the MAP phase. Is it possible to reconfigure
> the partioner after map phase is over ?
>
> 3. I would need to frequently load datasets with billions of rows
> (450-800GB) to hbase as the solution is part of a data processing
> pipeline. My estimate (optimistic) on a 10 node cluster is 7 hours . Is
> this reasonable. Would hbase scale to say 100s of such datasets, giving
> I can add disk spsace and nodes to the cluster.
>
> Thanks,
>
>  - Ashish
>
>