You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Tao Xie <xi...@gmail.com> on 2011/01/18 03:20:33 UTC

impact of total region numbers?

For example, I have total some data and I can tune
hbase.hregion.max.filesize to increase/decrease total region number, rite?
I want to know if the region number has performance impact to random read
tests. I observed that in my ycsb test,  with larger hfile size, I got
better tput and smaller latency.
Anybody can give me hints. Thanks.

Tao

Re: impact of total region numbers?

Posted by Stack <st...@duboce.net>.

Along with Tatsuya, I thank you for sharing this interesting result.

I too wonder why the bigger block makes a difference -- 25%
improvement is a bunch -- since we set up a socket on each random read
and seek the block (we do not currently reuse connection if correct
block is already in the breach)?

Thanks for trying this experiment.
St.Ack

On Mon, Jan 17, 2011 at 7:17 PM, Tao Xie <xi...@gmail.com> wrote:
> Thanks for response.
> I tuned the values of dfs.block.size and hbase.hregion.max.filesize for my
> tests (pure read tests) and had below results:
> Test    dfs.block.size         hbase.hregion.max.filesize
>  requests/sec           latency
>   1          32                                1024
>                                  ~4000                          24
>   2         256                              256
>                                  ~4500                          22
>   3         1024                            1024
>                               ~5000                          20
>
> My understanding to the results is that,  with less hdfs blocks hfile can
> speed up the lookup for a random row, avoiding jumping from one block to
> another (Test 1 vs. Test2);  with less but bigger regions performance will
> also be better? (Test2 vs. Test3).
> Sure, I believe number of HFiles per region will have impact, but I truly
> all did major compaction using the command line:
> major_compact 'mytable'
> and checked each region has only one storefile.
>
> Is that correct?
>
>
>
> 2011/1/18 Tatsuya Kawano <ta...@gmail.com>
>
>> Hi Tao,
>>
>> I think the number of regions won't have much impact to random read
>> throughput and latency. But the number of generations (HFiles) per region
>> will do.
>>
>> If this is the case, try to run major compaction on the table. This will
>> merge HFile generations so the read throughput and latency will be
>> recovered. You can do this from the hbase shell.
>>
>> Also, you might want to increase  hbase.region.mstore.flush.size to keep
>> the number of HFile generations smaller.
>>
>> Thanks,
>>
>> --
>> Tatsuya Kawano (Mr.)
>> Tokyo, Japan
>>
>>
>> On Jan 18, 2011, at 11:20 AM, Tao Xie <xi...@gmail.com> wrote:
>>
>> > For example, I have total some data and I can tune
>> > hbase.hregion.max.filesize to increase/decrease total region number,
>> rite?
>> > I want to know if the region number has performance impact to random read
>> > tests. I observed that in my ycsb test,  with larger hfile size, I got
>> > better tput and smaller latency.
>> > Anybody can give me hints. Thanks.
>> >
>> > Tao
>>
>>
>

Re: impact of total region numbers?

Posted by Tatsuya Kawano <ta...@gmail.com>.

Hi Tao, 

Thanks for sharing the test result. 

> but I truly
> all did major compaction using the command line:
> major_compact 'mytable'
> and checked each region has only one storefile.

Yes, that's what I mean. So that isn't your case.


> My understanding to the results is that,  with less hdfs blocks hfile can
> speed up the lookup for a random row, avoiding jumping from one block to
> another (Test 1 vs. Test2) 

I can't tell if this is correct just becasuse of my limited knowledge on HDFS. But I think less number of HDFS blocks could make the hard drives to seek the data quicker because HDFS tries to save all bytes in a block in the continuous location of a disk. Less blocks (less fragments) on the hard drives will improve the seek latency especially when multiple threads are trying to access to the same drives.

Thanks, 

--
Tatsuya Kawano (Mr.)
Tokyo, Japan


On Jan 18, 2011, at 12:17 PM, Tao Xie <xi...@gmail.com> wrote:

> Thanks for response.
> I tuned the values of dfs.block.size and hbase.hregion.max.filesize for my
> tests (pure read tests) and had below results:
> Test    dfs.block.size         hbase.hregion.max.filesize
> requests/sec           latency
>  1          32                                1024
>                                 ~4000                          24
>  2         256                              256
>                                 ~4500                          22
>  3         1024                            1024
>                              ~5000                          20
> 
> My understanding to the results is that,  with less hdfs blocks hfile can
> speed up the lookup for a random row, avoiding jumping from one block to
> another (Test 1 vs. Test2);  with less but bigger regions performance will
> also be better? (Test2 vs. Test3).
> Sure, I believe number of HFiles per region will have impact, but I truly
> all did major compaction using the command line:
> major_compact 'mytable'
> and checked each region has only one storefile.
> 
> Is that correct?
> 
> 
> 
> 2011/1/18 Tatsuya Kawano <ta...@gmail.com>
> 
>> Hi Tao,
>> 
>> I think the number of regions won't have much impact to random read
>> throughput and latency. But the number of generations (HFiles) per region
>> will do.
>> 
>> If this is the case, try to run major compaction on the table. This will
>> merge HFile generations so the read throughput and latency will be
>> recovered. You can do this from the hbase shell.
>> 
>> Also, you might want to increase  hbase.region.mstore.flush.size to keep
>> the number of HFile generations smaller.
>> 
>> Thanks,
>> 
>> --
>> Tatsuya Kawano (Mr.)
>> Tokyo, Japan
>> 
>> 
>> On Jan 18, 2011, at 11:20 AM, Tao Xie <xi...@gmail.com> wrote:
>> 
>>> For example, I have total some data and I can tune
>>> hbase.hregion.max.filesize to increase/decrease total region number,
>> rite?
>>> I want to know if the region number has performance impact to random read
>>> tests. I observed that in my ycsb test,  with larger hfile size, I got
>>> better tput and smaller latency.
>>> Anybody can give me hints. Thanks.
>>> 
>>> Tao
>> 
>>

Re: impact of total region numbers?

Posted by Tao Xie <xi...@gmail.com>.

Thanks for response.
I tuned the values of dfs.block.size and hbase.hregion.max.filesize for my
tests (pure read tests) and had below results:
Test    dfs.block.size         hbase.hregion.max.filesize
 requests/sec           latency
   1          32                                1024
                                  ~4000                          24
   2         256                              256
                                  ~4500                          22
   3         1024                            1024
                               ~5000                          20

My understanding to the results is that,  with less hdfs blocks hfile can
speed up the lookup for a random row, avoiding jumping from one block to
another (Test 1 vs. Test2);  with less but bigger regions performance will
also be better? (Test2 vs. Test3).
Sure, I believe number of HFiles per region will have impact, but I truly
all did major compaction using the command line:
major_compact 'mytable'
and checked each region has only one storefile.

Is that correct?



2011/1/18 Tatsuya Kawano <ta...@gmail.com>

> Hi Tao,
>
> I think the number of regions won't have much impact to random read
> throughput and latency. But the number of generations (HFiles) per region
> will do.
>
> If this is the case, try to run major compaction on the table. This will
> merge HFile generations so the read throughput and latency will be
> recovered. You can do this from the hbase shell.
>
> Also, you might want to increase  hbase.region.mstore.flush.size to keep
> the number of HFile generations smaller.
>
> Thanks,
>
> --
> Tatsuya Kawano (Mr.)
> Tokyo, Japan
>
>
> On Jan 18, 2011, at 11:20 AM, Tao Xie <xi...@gmail.com> wrote:
>
> > For example, I have total some data and I can tune
> > hbase.hregion.max.filesize to increase/decrease total region number,
> rite?
> > I want to know if the region number has performance impact to random read
> > tests. I observed that in my ycsb test,  with larger hfile size, I got
> > better tput and smaller latency.
> > Anybody can give me hints. Thanks.
> >
> > Tao
>
>

Re: impact of total region numbers?

Posted by Tatsuya Kawano <ta...@gmail.com>.

Hi Tao, 

I think the number of regions won't have much impact to random read throughput and latency. But the number of generations (HFiles) per region will do. 

If this is the case, try to run major compaction on the table. This will merge HFile generations so the read throughput and latency will be recovered. You can do this from the hbase shell. 

Also, you might want to increase  hbase.region.mstore.flush.size to keep the number of HFile generations smaller.

Thanks, 

--
Tatsuya Kawano (Mr.)
Tokyo, Japan

On Jan 18, 2011, at 11:20 AM, Tao Xie <xi...@gmail.com> wrote:

> For example, I have total some data and I can tune
> hbase.hregion.max.filesize to increase/decrease total region number, rite?
> I want to know if the region number has performance impact to random read
> tests. I observed that in my ycsb test,  with larger hfile size, I got
> better tput and smaller latency.
> Anybody can give me hints. Thanks.
> 
> Tao