You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Jean-Daniel Cryans <jd...@apache.org> on 2011/09/07 20:12:56 UTC

Calculating the optimal number of regions (WAS -> Re: big compaction queue size)

(Branching this discussion since it's not directly relevant to the other thread)

I think if we ever come up with a formula, it needs to come with a big
"your mileage may vary" sign. The reasons being:

 - If only a subset of the regions are getting written to, then only
those regions need to be accounted for (I think this is what you
referred to by Active Regions)
 - If the load is read heavy then you'd want to flush as little as
possible, meaning a very few regions (possibly forcing them to be less
than the theoretical maximum)
 - Not all tables may have the same flush size.
 - Some regions might be more active than others and may flush a lot
more, and since we keep both active and inactive data in the HLogs
then you might be churning more than you need to.
 - Same for families.

Now on the formula:

> If( (Hlognumber*hdfsblock) > (HBASE_HEAPSIZE *memstore.lowerLimit) )

That's ok.

>   Active Regions  = (HBASE_HEAPSIZE *memstore.lowerLimit )/( flush.size / (2~3))

Could you explain the division by 2 or 3? I'm not sure I'm following
that. Also I don't remember if the flush size by region was fixed (it
should be by family), but this would have an effect too.

> Else
>   Active Regions  =  (Hlognumber*hdfsblock)/ (flush.size / (2~3))

Same comments.

J-D

2011/9/6 Gaojinchao <ga...@huawei.com>:
> Hi J-D
> Should we can give a formula about active regions per node and up to book ?  I think many people encounter the same problem.
>
> I think the formula is:
> If( (Hlognumber*hdfsblock) > (HBASE_HEAPSIZE *memstore.lowerLimit) )
>   Active Regions  = (HBASE_HEAPSIZE *memstore.lowerLimit )/( flush.size / (2~3))
> Else
>   Active Regions  =  (Hlognumber*hdfsblock)/ (flush.size / (2~3))
>
>
> If I am wrong, please correct. Thanks.

Re: Calculating the optimal number of regions (WAS -> Re: big compaction queue size)

Posted by Jean-Daniel Cryans <jd...@apache.org>.
But the pre-compressed size is still the one that's using heap right?
Same for space in the HLogs, so you shouldn't lower the impact of the
flush size.

J-D

On Thu, Sep 8, 2011 at 2:11 AM, Gaojinchao <ga...@huawei.com> wrote:
> J-D:
> Thanks a lot. You are right.
> I may not take into account some factors. My case is writing heavy, So I don't want to flush the little file.
>
> "2 or 3"is a experience value that means the smallest memstore should be.
> eg: if flush.size = 128M,  the hfile size is 128M/3/ compression ratio, probably more than ten megabytes that is very little than region size(that is 1 G or more).
> About my case , I want to reduce pressure of compaction(that is only one thread)
>
>
>
> -----邮件原件-----
> 发件人: jdcryans@gmail.com [mailto:jdcryans@gmail.com] 代表 Jean-Daniel Cryans
> 发送时间: 2011年9月8日 2:13
> 收件人: user@hbase.apache.org
> 主题: Calculating the optimal number of regions (WAS -> Re: big compaction queue size)
>
> (Branching this discussion since it's not directly relevant to the other thread)
>
> I think if we ever come up with a formula, it needs to come with a big
> "your mileage may vary" sign. The reasons being:
>
>  - If only a subset of the regions are getting written to, then only
> those regions need to be accounted for (I think this is what you
> referred to by Active Regions)
>  - If the load is read heavy then you'd want to flush as little as
> possible, meaning a very few regions (possibly forcing them to be less
> than the theoretical maximum)
>  - Not all tables may have the same flush size.
>  - Some regions might be more active than others and may flush a lot
> more, and since we keep both active and inactive data in the HLogs
> then you might be churning more than you need to.
>  - Same for families.
>
> Now on the formula:
>
>> If( (Hlognumber*hdfsblock) > (HBASE_HEAPSIZE *memstore.lowerLimit) )
>
> That's ok.
>
>>   Active Regions  = (HBASE_HEAPSIZE *memstore.lowerLimit )/( flush.size / (2~3))
>
> Could you explain the division by 2 or 3? I'm not sure I'm following
> that. Also I don't remember if the flush size by region was fixed (it
> should be by family), but this would have an effect too.
>
>> Else
>>   Active Regions  =  (Hlognumber*hdfsblock)/ (flush.size / (2~3))
>
> Same comments.
>
> J-D
>
> 2011/9/6 Gaojinchao <ga...@huawei.com>:
>> Hi J-D
>> Should we can give a formula about active regions per node and up to book ?  I think many people encounter the same problem.
>>
>> I think the formula is:
>> If( (Hlognumber*hdfsblock) > (HBASE_HEAPSIZE *memstore.lowerLimit) )
>>   Active Regions  = (HBASE_HEAPSIZE *memstore.lowerLimit )/( flush.size / (2~3))
>> Else
>>   Active Regions  =  (Hlognumber*hdfsblock)/ (flush.size / (2~3))
>>
>>
>> If I am wrong, please correct. Thanks.
>

Re: Calculating the optimal number of regions (WAS -> Re: big compaction queue size)

Posted by Gaojinchao <ga...@huawei.com>.
J-D: 
Thanks a lot. You are right.
I may not take into account some factors. My case is writing heavy, So I don't want to flush the little file.

"2 or 3"is a experience value that means the smallest memstore should be.
eg: if flush.size = 128M,  the hfile size is 128M/3/ compression ratio, probably more than ten megabytes that is very little than region size(that is 1 G or more).
About my case , I want to reduce pressure of compaction(that is only one thread)



-----邮件原件-----
发件人: jdcryans@gmail.com [mailto:jdcryans@gmail.com] 代表 Jean-Daniel Cryans
发送时间: 2011年9月8日 2:13
收件人: user@hbase.apache.org
主题: Calculating the optimal number of regions (WAS -> Re: big compaction queue size)

(Branching this discussion since it's not directly relevant to the other thread)

I think if we ever come up with a formula, it needs to come with a big
"your mileage may vary" sign. The reasons being:

 - If only a subset of the regions are getting written to, then only
those regions need to be accounted for (I think this is what you
referred to by Active Regions)
 - If the load is read heavy then you'd want to flush as little as
possible, meaning a very few regions (possibly forcing them to be less
than the theoretical maximum)
 - Not all tables may have the same flush size.
 - Some regions might be more active than others and may flush a lot
more, and since we keep both active and inactive data in the HLogs
then you might be churning more than you need to.
 - Same for families.

Now on the formula:

> If( (Hlognumber*hdfsblock) > (HBASE_HEAPSIZE *memstore.lowerLimit) )

That's ok.

>   Active Regions  = (HBASE_HEAPSIZE *memstore.lowerLimit )/( flush.size / (2~3))

Could you explain the division by 2 or 3? I'm not sure I'm following
that. Also I don't remember if the flush size by region was fixed (it
should be by family), but this would have an effect too.

> Else
>   Active Regions  =  (Hlognumber*hdfsblock)/ (flush.size / (2~3))

Same comments.

J-D

2011/9/6 Gaojinchao <ga...@huawei.com>:
> Hi J-D
> Should we can give a formula about active regions per node and up to book ?  I think many people encounter the same problem.
>
> I think the formula is:
> If( (Hlognumber*hdfsblock) > (HBASE_HEAPSIZE *memstore.lowerLimit) )
>   Active Regions  = (HBASE_HEAPSIZE *memstore.lowerLimit )/( flush.size / (2~3))
> Else
>   Active Regions  =  (Hlognumber*hdfsblock)/ (flush.size / (2~3))
>
>
> If I am wrong, please correct. Thanks.