You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Cai Sijie <ca...@fujixerox.co.jp> on 2008/10/28 08:11:22 UTC

hbase performance period

Hi all,

Our environment is 4 machines ( 1 is also slave).
They are all region servers.

First I write 1000 item data to the hbase. Then I reading data according to 
writing sequence. And then I  check the data reading performance.
I draw a graph (x-ray is data item number, y-ray is data reading's time 
costing). I found it was a period graph and the period was 128.
Every time's testing is similar to this.  E.g. 128th item data, 256th item 
data...want to cost most time. And it increase time costing from 1st ~128th, 
129th~256th.
I think it is strange and 128 is  a very unique number. I have tried many 
methods to know why but has no result.
Who can tell me the reason?

Thanks and best regards. 

Re: hbase performance period

Posted by Michael Stack <st...@duboce.net>.
CaiSijie wrote:
> Thank you for your replay.
> My version of HBase is 0.18.0.  Yes, I read data in series.
>
> But what I see is that reading 1st data cost least time and reading 128th
> data cost most time. It means that time increase from reading 1st to 128th
> data item. Then when reading 129th data item, time becomes less and similar
> as reading 1st data item time. So the period is 128. 128th data, 256th data,
> 384th data... need most time and 1st data, 129th data, 257th data, 385th
> data... need least time.
>
> I have tested for many times but it always exists this phenomenon. I can be
> sure that the mapfile index interval is 32.
> And what i should say is that every item data is 100 KiloByte.
>
> I am confused...
Thanks for spending more time on this.

So, what I think is happening is that when you do a get on a key that is 
in the MapFile index, the seek goes directly to the correct offset.  
Otherwise, we seek to an index key that sorts before the asked-for key 
and then we call the SequenceFile.next until we hit the requested key 
(See around line 432 in this file: *http://tinyurl.com/63ejru*.  The 
core of the hbase get is the getClosest method in Hadoop MapFile which 
calls this internalSeek method).

If the asked-for key is in index, its fastest and gets steadily slower 
as we search forward by next'ing through the data file until we hit the 
next index entry.

Because your values are 100k, the progression is noticeable.

I didn't understand why the interval was 128 rather than 32, but I just 
added logging and see that our attempt at setting it to 32, at least in 
this mapreduce context that I tested in, is broken.  I opened an issue, 
HBASE-981 (Thanks for finding this one).

St.Ack



> Sijie Cai 
>
>
>
>
> stack-3 wrote:
>   
>> The only thing that comes to mind is that by default in hadoop, the
>> mapfile index interval is 128; every 128th entry in mapfile gets an
>> entry in the mapfile index. Only, in hbase, we change the default
>> interval to be 32. Check to make sure you are picking up
>> hbase.io.index.interval of 32.
>>
>> Otherwise, I'm not sure as to why you would see the below. Are you
>> saying that there is a step every 128 intervals? That the 129th read
>> takes longer than the read at position 1 and that the read at position
>> 257 takes longer than the read at position 129?
>>
>> The fact that it takes increasingly longer as you read from position 0
>> up to 128 makes sense -- if the index interval is every 128 -- because
>> we do serial search forward from the closest index position.
>>
>> What version of hbase are you using?
>>
>> You are doing your reads in series?
>>
>> This is really interesting stuff. Can you dig in some more and try and
>> figure whats going on?
>>
>> Thanks Cai.
>>
>> St.Ack
>>
>>
>>     
>
>   


Re: hbase performance period

Posted by CaiSijie <ca...@fujixerox.co.jp>.
Thank you for your replay.
My version of HBase is 0.18.0.  Yes, I read data in series.

But what I see is that reading 1st data cost least time and reading 128th
data cost most time. It means that time increase from reading 1st to 128th
data item. Then when reading 129th data item, time becomes less and similar
as reading 1st data item time. So the period is 128. 128th data, 256th data,
384th data... need most time and 1st data, 129th data, 257th data, 385th
data... need least time.

I have tested for many times but it always exists this phenomenon. I can be
sure that the mapfile index interval is 32.
And what i should say is that every item data is 100 KiloByte.

I am confused...

Sijie Cai 




stack-3 wrote:
> 
> The only thing that comes to mind is that by default in hadoop, the
> mapfile index interval is 128; every 128th entry in mapfile gets an
> entry in the mapfile index. Only, in hbase, we change the default
> interval to be 32. Check to make sure you are picking up
> hbase.io.index.interval of 32.
> 
> Otherwise, I'm not sure as to why you would see the below. Are you
> saying that there is a step every 128 intervals? That the 129th read
> takes longer than the read at position 1 and that the read at position
> 257 takes longer than the read at position 129?
> 
> The fact that it takes increasingly longer as you read from position 0
> up to 128 makes sense -- if the index interval is every 128 -- because
> we do serial search forward from the closest index position.
> 
> What version of hbase are you using?
> 
> You are doing your reads in series?
> 
> This is really interesting stuff. Can you dig in some more and try and
> figure whats going on?
> 
> Thanks Cai.
> 
> St.Ack
> 
> 

-- 
View this message in context: http://www.nabble.com/hbase-performance-period-tp20209042p20315108.html
Sent from the HBase User mailing list archive at Nabble.com.


Re: hbase performance period

Posted by stack <st...@duboce.net>.
The only thing that comes to mind is that by default in hadoop, the
mapfile index interval is 128; every 128th entry in mapfile gets an
entry in the mapfile index. Only, in hbase, we change the default
interval to be 32. Check to make sure you are picking up
hbase.io.index.interval of 32.

Otherwise, I'm not sure as to why you would see the below. Are you
saying that there is a step every 128 intervals? That the 129th read
takes longer than the read at position 1 and that the read at position
257 takes longer than the read at position 129?

The fact that it takes increasingly longer as you read from position 0
up to 128 makes sense -- if the index interval is every 128 -- because
we do serial search forward from the closest index position.

What version of hbase are you using?

You are doing your reads in series?

This is really interesting stuff. Can you dig in some more and try and
figure whats going on?

Thanks Cai.

St.Ack


Cai Sijie wrote:
> Hi all,
>
> Our environment is 4 machines ( 1 is also slave).
> They are all region servers.
>
> First I write 1000 item data to the hbase. Then I reading data according to 
> writing sequence. And then I  check the data reading performance.
> I draw a graph (x-ray is data item number, y-ray is data reading's time 
> costing). I found it was a period graph and the period was 128.
> Every time's testing is similar to this.  E.g. 128th item data, 256th item 
> data...want to cost most time. And it increase time costing from 1st ~128th, 
> 129th~256th.
> I think it is strange and 128 is  a very unique number. I have tried many 
> methods to know why but has no result.
> Who can tell me the reason?
>
> Thanks and best regards. 
>
>