You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Adam Phelps <am...@opendns.com> on 2011/03/31 03:32:52 UTC

Speeding up LoadIncrementalHFiles?

Does anyone have any suggestions for speeding up LoadIncrementalHFiles?

We have M/R jobs that directly generate HFiles and are then loaded into 
HBase via LoadIncrementalHFiles.  We're attempting to maintain a backup 
of our production HBase on a backup Hadoop cluster by copying the HFiles 
there and then loading them there.

The problem we're running into is that we want the backup cluster to use 
a good number fewer nodes than the primary cluster, however despite 
having a pretty low load (CPU, disk IO, etc) it isn't keeping up well. 
We'd rather not dedicate more nodes from the overall pool to this 
purpose if at all possible.  Are there any settings that can be adjusted 
to improve the performance of the bulk load?

Alternate suggestions for maintaining an HBase backup would also be of 
interest.

- Adam

Re: Speeding up LoadIncrementalHFiles?

Posted by Ted Yu <yu...@gmail.com>.
Adam:
I approached the problem in two steps.
See my patch, 3721-v2.txt, on
https://issues.apache.org/jira/browse/HBASE-3721

Cheers

On Thu, Mar 31, 2011 at 12:41 PM, Ted Yu <yu...@gmail.com> wrote:

> Adam:
> I logged https://issues.apache.org/jira/browse/HBASE-3721
>
> Feel free to comment on that JIRA.
>
>
> On Thu, Mar 31, 2011 at 11:14 AM, Adam Phelps <am...@opendns.com> wrote:
>
>> On 3/30/11 8:39 PM, Stack wrote:
>>
>>> What is slow?  The running of the LoadIncrementHFiles or the copy?
>>>
>>
>> Its the LoadIncrementHFiles portion.
>>
>>
>>  If
>>> the former, is it because the table its loading into has different
>>> boundaries than those of the HFiles so the HFiles have to be split?
>>>
>>
>> I'm sure that could be one aspect of it, however from the logs it looks
>> like <1% of the hfiles we're loading have to be split.  Looking at the code
>> for LoadIncrementHFiles (hbase v0.90.1), I'm actually thinking our problem
>> is that this code loads the hfiles sequentially.  Our largest table has over
>> 2500 regions and the data being loaded is fairly well distributed across
>> them, so there end up being around 2500 HFiles for each load period.  At 1-2
>> seconds per HFile that means the loading process is very time consuming.
>>
>> On the primary cluster (16 regionservers) one of this set of HFiles loads
>> in ~350s vs ~3200s on the backup (with 4 regionservers).  Overall the nodes
>> on the backup cluster are running at around 5% CPU (and similarly minimal
>> disk and network usage).  So we have plenty of resources to throw at the
>> problem, its just a matter of determining what we can do here other than
>> adding additional nodes to the cluster.
>>
>> My first thoughts are to try to add some parallelism, either by splitting
>> the HFiles into multiple chunks for separate load instances, or to change
>> LoadIncrementHFiles itself to use multiple loading threads.
>>
>>
>>  Is your data only coming in via bulk load?
>>>
>>
>> Yes, everything we put into hbase is via bulk load.  We found it to be a
>> huge improvement over doing individual Puts from the the M/R jobs.
>>
>> - Adam
>>
>
>

Re: Speeding up LoadIncrementalHFiles?

Posted by Adam Phelps <am...@opendns.com>.
On 3/31/11 12:41 PM, Ted Yu wrote:
> Adam:
> I logged https://issues.apache.org/jira/browse/HBASE-3721

Thanks for opening that.  I haven't delved much into the HBase code 
previously, but I may take a look into this since it is causing us some 
trouble currently.

- Adam

Re: Speeding up LoadIncrementalHFiles?

Posted by Ted Yu <yu...@gmail.com>.
Adam:
I logged https://issues.apache.org/jira/browse/HBASE-3721

Feel free to comment on that JIRA.

On Thu, Mar 31, 2011 at 11:14 AM, Adam Phelps <am...@opendns.com> wrote:

> On 3/30/11 8:39 PM, Stack wrote:
>
>> What is slow?  The running of the LoadIncrementHFiles or the copy?
>>
>
> Its the LoadIncrementHFiles portion.
>
>
>  If
>> the former, is it because the table its loading into has different
>> boundaries than those of the HFiles so the HFiles have to be split?
>>
>
> I'm sure that could be one aspect of it, however from the logs it looks
> like <1% of the hfiles we're loading have to be split.  Looking at the code
> for LoadIncrementHFiles (hbase v0.90.1), I'm actually thinking our problem
> is that this code loads the hfiles sequentially.  Our largest table has over
> 2500 regions and the data being loaded is fairly well distributed across
> them, so there end up being around 2500 HFiles for each load period.  At 1-2
> seconds per HFile that means the loading process is very time consuming.
>
> On the primary cluster (16 regionservers) one of this set of HFiles loads
> in ~350s vs ~3200s on the backup (with 4 regionservers).  Overall the nodes
> on the backup cluster are running at around 5% CPU (and similarly minimal
> disk and network usage).  So we have plenty of resources to throw at the
> problem, its just a matter of determining what we can do here other than
> adding additional nodes to the cluster.
>
> My first thoughts are to try to add some parallelism, either by splitting
> the HFiles into multiple chunks for separate load instances, or to change
> LoadIncrementHFiles itself to use multiple loading threads.
>
>
>  Is your data only coming in via bulk load?
>>
>
> Yes, everything we put into hbase is via bulk load.  We found it to be a
> huge improvement over doing individual Puts from the the M/R jobs.
>
> - Adam
>

Re: Speeding up LoadIncrementalHFiles?

Posted by Adam Phelps <am...@opendns.com>.
On 3/30/11 8:39 PM, Stack wrote:
> What is slow?  The running of the LoadIncrementHFiles or the copy?

Its the LoadIncrementHFiles portion.

> If
> the former, is it because the table its loading into has different
> boundaries than those of the HFiles so the HFiles have to be split?

I'm sure that could be one aspect of it, however from the logs it looks 
like <1% of the hfiles we're loading have to be split.  Looking at the 
code for LoadIncrementHFiles (hbase v0.90.1), I'm actually thinking our 
problem is that this code loads the hfiles sequentially.  Our largest 
table has over 2500 regions and the data being loaded is fairly well 
distributed across them, so there end up being around 2500 HFiles for 
each load period.  At 1-2 seconds per HFile that means the loading 
process is very time consuming.

On the primary cluster (16 regionservers) one of this set of HFiles 
loads in ~350s vs ~3200s on the backup (with 4 regionservers).  Overall 
the nodes on the backup cluster are running at around 5% CPU (and 
similarly minimal disk and network usage).  So we have plenty of 
resources to throw at the problem, its just a matter of determining what 
we can do here other than adding additional nodes to the cluster.

My first thoughts are to try to add some parallelism, either by 
splitting the HFiles into multiple chunks for separate load instances, 
or to change LoadIncrementHFiles itself to use multiple loading threads.

> Is your data only coming in via bulk load?

Yes, everything we put into hbase is via bulk load.  We found it to be a 
huge improvement over doing individual Puts from the the M/R jobs.

- Adam

Re: Speeding up LoadIncrementalHFiles?

Posted by Stack <st...@duboce.net>.
What is slow?  The running of the LoadIncrementHFiles or the copy?  If
the former, is it because the table its loading into has different
boundaries than those of the HFiles so the HFiles have to be split?

Is your data only coming in via bulk load?

St.Ack



On Wed, Mar 30, 2011 at 6:32 PM, Adam Phelps <am...@opendns.com> wrote:
> Does anyone have any suggestions for speeding up LoadIncrementalHFiles?
>
> We have M/R jobs that directly generate HFiles and are then loaded into
> HBase via LoadIncrementalHFiles.  We're attempting to maintain a backup of
> our production HBase on a backup Hadoop cluster by copying the HFiles there
> and then loading them there.
>
> The problem we're running into is that we want the backup cluster to use a
> good number fewer nodes than the primary cluster, however despite having a
> pretty low load (CPU, disk IO, etc) it isn't keeping up well. We'd rather
> not dedicate more nodes from the overall pool to this purpose if at all
> possible.  Are there any settings that can be adjusted to improve the
> performance of the bulk load?
>
> Alternate suggestions for maintaining an HBase backup would also be of
> interest.
>
> - Adam
>