You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Ted Yu <yu...@gmail.com> on 2011/04/03 09:27:29 UTC
Re: Speeding up LoadIncrementalHFiles?

Adam:
I approached the problem in two steps.
See my patch, 3721-v2.txt, on
https://issues.apache.org/jira/browse/HBASE-3721

Cheers

On Thu, Mar 31, 2011 at 12:41 PM, Ted Yu <yu...@gmail.com> wrote:

> Adam:
> I logged https://issues.apache.org/jira/browse/HBASE-3721
>
> Feel free to comment on that JIRA.
>
>
> On Thu, Mar 31, 2011 at 11:14 AM, Adam Phelps <am...@opendns.com> wrote:
>
>> On 3/30/11 8:39 PM, Stack wrote:
>>
>>> What is slow?  The running of the LoadIncrementHFiles or the copy?
>>>
>>
>> Its the LoadIncrementHFiles portion.
>>
>>
>>  If
>>> the former, is it because the table its loading into has different
>>> boundaries than those of the HFiles so the HFiles have to be split?
>>>
>>
>> I'm sure that could be one aspect of it, however from the logs it looks
>> like <1% of the hfiles we're loading have to be split.  Looking at the code
>> for LoadIncrementHFiles (hbase v0.90.1), I'm actually thinking our problem
>> is that this code loads the hfiles sequentially.  Our largest table has over
>> 2500 regions and the data being loaded is fairly well distributed across
>> them, so there end up being around 2500 HFiles for each load period.  At 1-2
>> seconds per HFile that means the loading process is very time consuming.
>>
>> On the primary cluster (16 regionservers) one of this set of HFiles loads
>> in ~350s vs ~3200s on the backup (with 4 regionservers).  Overall the nodes
>> on the backup cluster are running at around 5% CPU (and similarly minimal
>> disk and network usage).  So we have plenty of resources to throw at the
>> problem, its just a matter of determining what we can do here other than
>> adding additional nodes to the cluster.
>>
>> My first thoughts are to try to add some parallelism, either by splitting
>> the HFiles into multiple chunks for separate load instances, or to change
>> LoadIncrementHFiles itself to use multiple loading threads.
>>
>>
>>  Is your data only coming in via bulk load?
>>>
>>
>> Yes, everything we put into hbase is via bulk load.  We found it to be a
>> huge improvement over doing individual Puts from the the M/R jobs.
>>
>> - Adam
>>
>
>