You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Adam Phelps <am...@opendns.com> on 2011/03/31 03:32:52 UTC
Speeding up LoadIncrementalHFiles?
Does anyone have any suggestions for speeding up LoadIncrementalHFiles?
We have M/R jobs that directly generate HFiles and are then loaded into
HBase via LoadIncrementalHFiles. We're attempting to maintain a backup
of our production HBase on a backup Hadoop cluster by copying the HFiles
there and then loading them there.
The problem we're running into is that we want the backup cluster to use
a good number fewer nodes than the primary cluster, however despite
having a pretty low load (CPU, disk IO, etc) it isn't keeping up well.
We'd rather not dedicate more nodes from the overall pool to this
purpose if at all possible. Are there any settings that can be adjusted
to improve the performance of the bulk load?
Alternate suggestions for maintaining an HBase backup would also be of
interest.
- Adam
Re: Speeding up LoadIncrementalHFiles?
Posted by Ted Yu <yu...@gmail.com>.
Adam:
I approached the problem in two steps.
See my patch, 3721-v2.txt, on
https://issues.apache.org/jira/browse/HBASE-3721
Cheers
On Thu, Mar 31, 2011 at 12:41 PM, Ted Yu <yu...@gmail.com> wrote:
> Adam:
> I logged https://issues.apache.org/jira/browse/HBASE-3721
>
> Feel free to comment on that JIRA.
>
>
> On Thu, Mar 31, 2011 at 11:14 AM, Adam Phelps <am...@opendns.com> wrote:
>
>> On 3/30/11 8:39 PM, Stack wrote:
>>
>>> What is slow? The running of the LoadIncrementHFiles or the copy?
>>>
>>
>> Its the LoadIncrementHFiles portion.
>>
>>
>> If
>>> the former, is it because the table its loading into has different
>>> boundaries than those of the HFiles so the HFiles have to be split?
>>>
>>
>> I'm sure that could be one aspect of it, however from the logs it looks
>> like <1% of the hfiles we're loading have to be split. Looking at the code
>> for LoadIncrementHFiles (hbase v0.90.1), I'm actually thinking our problem
>> is that this code loads the hfiles sequentially. Our largest table has over
>> 2500 regions and the data being loaded is fairly well distributed across
>> them, so there end up being around 2500 HFiles for each load period. At 1-2
>> seconds per HFile that means the loading process is very time consuming.
>>
>> On the primary cluster (16 regionservers) one of this set of HFiles loads
>> in ~350s vs ~3200s on the backup (with 4 regionservers). Overall the nodes
>> on the backup cluster are running at around 5% CPU (and similarly minimal
>> disk and network usage). So we have plenty of resources to throw at the
>> problem, its just a matter of determining what we can do here other than
>> adding additional nodes to the cluster.
>>
>> My first thoughts are to try to add some parallelism, either by splitting
>> the HFiles into multiple chunks for separate load instances, or to change
>> LoadIncrementHFiles itself to use multiple loading threads.
>>
>>
>> Is your data only coming in via bulk load?
>>>
>>
>> Yes, everything we put into hbase is via bulk load. We found it to be a
>> huge improvement over doing individual Puts from the the M/R jobs.
>>
>> - Adam
>>
>
>
Re: Speeding up LoadIncrementalHFiles?
Posted by Adam Phelps <am...@opendns.com>.
On 3/31/11 12:41 PM, Ted Yu wrote:
> Adam:
> I logged https://issues.apache.org/jira/browse/HBASE-3721
Thanks for opening that. I haven't delved much into the HBase code
previously, but I may take a look into this since it is causing us some
trouble currently.
- Adam
Re: Speeding up LoadIncrementalHFiles?
Posted by Ted Yu <yu...@gmail.com>.
Adam:
I logged https://issues.apache.org/jira/browse/HBASE-3721
Feel free to comment on that JIRA.
On Thu, Mar 31, 2011 at 11:14 AM, Adam Phelps <am...@opendns.com> wrote:
> On 3/30/11 8:39 PM, Stack wrote:
>
>> What is slow? The running of the LoadIncrementHFiles or the copy?
>>
>
> Its the LoadIncrementHFiles portion.
>
>
> If
>> the former, is it because the table its loading into has different
>> boundaries than those of the HFiles so the HFiles have to be split?
>>
>
> I'm sure that could be one aspect of it, however from the logs it looks
> like <1% of the hfiles we're loading have to be split. Looking at the code
> for LoadIncrementHFiles (hbase v0.90.1), I'm actually thinking our problem
> is that this code loads the hfiles sequentially. Our largest table has over
> 2500 regions and the data being loaded is fairly well distributed across
> them, so there end up being around 2500 HFiles for each load period. At 1-2
> seconds per HFile that means the loading process is very time consuming.
>
> On the primary cluster (16 regionservers) one of this set of HFiles loads
> in ~350s vs ~3200s on the backup (with 4 regionservers). Overall the nodes
> on the backup cluster are running at around 5% CPU (and similarly minimal
> disk and network usage). So we have plenty of resources to throw at the
> problem, its just a matter of determining what we can do here other than
> adding additional nodes to the cluster.
>
> My first thoughts are to try to add some parallelism, either by splitting
> the HFiles into multiple chunks for separate load instances, or to change
> LoadIncrementHFiles itself to use multiple loading threads.
>
>
> Is your data only coming in via bulk load?
>>
>
> Yes, everything we put into hbase is via bulk load. We found it to be a
> huge improvement over doing individual Puts from the the M/R jobs.
>
> - Adam
>
Re: Speeding up LoadIncrementalHFiles?
Posted by Adam Phelps <am...@opendns.com>.
On 3/30/11 8:39 PM, Stack wrote:
> What is slow? The running of the LoadIncrementHFiles or the copy?
Its the LoadIncrementHFiles portion.
> If
> the former, is it because the table its loading into has different
> boundaries than those of the HFiles so the HFiles have to be split?
I'm sure that could be one aspect of it, however from the logs it looks
like <1% of the hfiles we're loading have to be split. Looking at the
code for LoadIncrementHFiles (hbase v0.90.1), I'm actually thinking our
problem is that this code loads the hfiles sequentially. Our largest
table has over 2500 regions and the data being loaded is fairly well
distributed across them, so there end up being around 2500 HFiles for
each load period. At 1-2 seconds per HFile that means the loading
process is very time consuming.
On the primary cluster (16 regionservers) one of this set of HFiles
loads in ~350s vs ~3200s on the backup (with 4 regionservers). Overall
the nodes on the backup cluster are running at around 5% CPU (and
similarly minimal disk and network usage). So we have plenty of
resources to throw at the problem, its just a matter of determining what
we can do here other than adding additional nodes to the cluster.
My first thoughts are to try to add some parallelism, either by
splitting the HFiles into multiple chunks for separate load instances,
or to change LoadIncrementHFiles itself to use multiple loading threads.
> Is your data only coming in via bulk load?
Yes, everything we put into hbase is via bulk load. We found it to be a
huge improvement over doing individual Puts from the the M/R jobs.
- Adam
Re: Speeding up LoadIncrementalHFiles?
Posted by Stack <st...@duboce.net>.
What is slow? The running of the LoadIncrementHFiles or the copy? If
the former, is it because the table its loading into has different
boundaries than those of the HFiles so the HFiles have to be split?
Is your data only coming in via bulk load?
St.Ack
On Wed, Mar 30, 2011 at 6:32 PM, Adam Phelps <am...@opendns.com> wrote:
> Does anyone have any suggestions for speeding up LoadIncrementalHFiles?
>
> We have M/R jobs that directly generate HFiles and are then loaded into
> HBase via LoadIncrementalHFiles. We're attempting to maintain a backup of
> our production HBase on a backup Hadoop cluster by copying the HFiles there
> and then loading them there.
>
> The problem we're running into is that we want the backup cluster to use a
> good number fewer nodes than the primary cluster, however despite having a
> pretty low load (CPU, disk IO, etc) it isn't keeping up well. We'd rather
> not dedicate more nodes from the overall pool to this purpose if at all
> possible. Are there any settings that can be adjusted to improve the
> performance of the bulk load?
>
> Alternate suggestions for maintaining an HBase backup would also be of
> interest.
>
> - Adam
>