You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Oleg Ruchovets <or...@gmail.com> on 2010/11/11 12:15:26 UTC

scan performance improvement

Hi ,
   To improve client performance I  changed
hbase.client.scanner.caching from 1 to 50.
After running client with new value( hbase.client.scanner.caching from = 50
) it didn't improve execution time at all.

I have ~ 9 million small records.
I have to do full scan  , so it brings all 9 million records to client .
My assumption -- this change have to bring significant improvement , but it
is not.

Additional Information.
I scan table which has 100 regions
5 server
20 map
4  concurrent map
Scan process takes 5.5 - 6 hours. As for me it is too much time? Am I write?
and how can I improve it


I changed the value in all hbase-site.xml files and restart hbase.

Any suggestions.

Re: scan performance improvement

Posted by Friso van Vollenhoven <fv...@xebia.com>.

The 256M = default MAX_FILE_SIZE
64K = default HBase block size
64M = HDFS default block size

If you look at a table definition in the HBase master UI you can see settings for your table. Like this:
{NAME => 'inrdb_rir_stats', MAX_FILESIZE => '268435456', FAMILIES => [{NAME => 'data', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', COMPRESSION => 'LZO', VERSIONS => '1', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}, {NAME => 'meta', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', COMPRESSION => 'LZO', VERSIONS => '1', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}

Also, have a look here to see how HBase stores data: http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html




On 11 nov 2010, at 14:11, Michael Segel wrote:

> 
> Correct me if I'm wrong, but isn't hbase's default block size 256MB while hadoop's default blocksize is 64MB?
> 
> 
>> From: fvanvollenhoven@xebia.com
>> To: user@hbase.apache.org
>> Subject: Re: scan performance improvement
>> Date: Thu, 11 Nov 2010 13:08:56 +0000
>> 
>> Not that block size (that's the HDFS one), but the HBase block size. You set it at table creation or it uses the default of 64K.
>> 
>> The description of hbase.client.scanner.caching says:
>> Number of rows that will be fetched when calling next
>> on a scanner if it is not served from memory. Higher caching values
>> will enable faster scanners but will eat up more memory and some
>> calls of next may take longer and longer times when the cache is empty.
>> 
>> That means that it will pre-fetch that number of rows, if the next row does not come from memory. So if your rows are small enough to fit 100 of them in one block, it doesn't matter whether you pre-fetch 1, 50 or 99, because it will only go to disk when it exhausts the whole block, which sticks in block cache. So, it will still fetch the same amount of data from disk every time. If you increase the number to a value that is certain to load multiple blocks at a time from disk, it will increase performance.
>> 
>> 
>> 
>> On 11 nov 2010, at 12:55, Oleg Ruchovets wrote:
>> 
>>> Yes , I thought about large number , so you said it depends on block size.
>>> Good point.
>>> 
>>> I have one recored ~ 4k ,
>>> block size is:
>>> 
>>> <property>
>>> <name>dfs.block.size</name>
>>> <value>268435456</value>
>>> <description>HDFS blocksize of 256MB for large file-systems.
>>> </description>
>>> </property>
>>> 
>>> what is the number that I have choose? Assuming
>>> I am afraid that using number which is equal one block brings to
>>> socketTimeOutException? Am I write?
>>> 
>>> Thanks Oleg.
>>> 
>>> 
>>> 
>>> 
>>> On Thu, Nov 11, 2010 at 1:30 PM, Friso van Vollenhoven <
>>> fvanvollenhoven@xebia.com> wrote:
>>> 
>>>> How small is small? If it is bytes, then setting the value to 50 is not so
>>>> much different from 1, I suppose. If 50 rows fit in one block, it will just
>>>> fetch one block whether the setting is 1 or 50. You might want to try a
>>>> larger value. It should be fine if the records are small and you need them
>>>> all on the client side anyway.
>>>> 
>>>> It also depends on the block size, of course. When you only ever do full
>>>> scans on a table and little random access, you might want to increase that.
>>>> 
>>>> Friso
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On 11 nov 2010, at 12:15, Oleg Ruchovets wrote:
>>>> 
>>>>> Hi ,
>>>>> To improve client performance I  changed
>>>>> hbase.client.scanner.caching from 1 to 50.
>>>>> After running client with new value( hbase.client.scanner.caching from =
>>>> 50
>>>>> ) it didn't improve execution time at all.
>>>>> 
>>>>> I have ~ 9 million small records.
>>>>> I have to do full scan  , so it brings all 9 million records to client .
>>>>> My assumption -- this change have to bring significant improvement , but
>>>> it
>>>>> is not.
>>>>> 
>>>>> Additional Information.
>>>>> I scan table which has 100 regions
>>>>> 5 server
>>>>> 20 map
>>>>> 4  concurrent map
>>>>> Scan process takes 5.5 - 6 hours. As for me it is too much time? Am I
>>>> write?
>>>>> and how can I improve it
>>>>> 
>>>>> 
>>>>> I changed the value in all hbase-site.xml files and restart hbase.
>>>>> 
>>>>> Any suggestions.
>>>> 
>>>> 
>> 
>

RE: scan performance improvement

Posted by Michael Segel <mi...@hotmail.com>.

Correct me if I'm wrong, but isn't hbase's default block size 256MB while hadoop's default blocksize is 64MB?


> From: fvanvollenhoven@xebia.com
> To: user@hbase.apache.org
> Subject: Re: scan performance improvement
> Date: Thu, 11 Nov 2010 13:08:56 +0000
> 
> Not that block size (that's the HDFS one), but the HBase block size. You set it at table creation or it uses the default of 64K.
> 
> The description of hbase.client.scanner.caching says:
> Number of rows that will be fetched when calling next
> on a scanner if it is not served from memory. Higher caching values
> will enable faster scanners but will eat up more memory and some
> calls of next may take longer and longer times when the cache is empty.
> 
> That means that it will pre-fetch that number of rows, if the next row does not come from memory. So if your rows are small enough to fit 100 of them in one block, it doesn't matter whether you pre-fetch 1, 50 or 99, because it will only go to disk when it exhausts the whole block, which sticks in block cache. So, it will still fetch the same amount of data from disk every time. If you increase the number to a value that is certain to load multiple blocks at a time from disk, it will increase performance.
> 
> 
> 
> On 11 nov 2010, at 12:55, Oleg Ruchovets wrote:
> 
> > Yes , I thought about large number , so you said it depends on block size.
> > Good point.
> > 
> > I have one recored ~ 4k ,
> > block size is:
> > 
> > <property>
> >  <name>dfs.block.size</name>
> >  <value>268435456</value>
> >  <description>HDFS blocksize of 256MB for large file-systems.
> > </description>
> > </property>
> > 
> > what is the number that I have choose? Assuming
> > I am afraid that using number which is equal one block brings to
> > socketTimeOutException? Am I write?
> > 
> > Thanks Oleg.
> > 
> > 
> > 
> > 
> > On Thu, Nov 11, 2010 at 1:30 PM, Friso van Vollenhoven <
> > fvanvollenhoven@xebia.com> wrote:
> > 
> >> How small is small? If it is bytes, then setting the value to 50 is not so
> >> much different from 1, I suppose. If 50 rows fit in one block, it will just
> >> fetch one block whether the setting is 1 or 50. You might want to try a
> >> larger value. It should be fine if the records are small and you need them
> >> all on the client side anyway.
> >> 
> >> It also depends on the block size, of course. When you only ever do full
> >> scans on a table and little random access, you might want to increase that.
> >> 
> >> Friso
> >> 
> >> 
> >> 
> >> 
> >> On 11 nov 2010, at 12:15, Oleg Ruchovets wrote:
> >> 
> >>> Hi ,
> >>>  To improve client performance I  changed
> >>> hbase.client.scanner.caching from 1 to 50.
> >>> After running client with new value( hbase.client.scanner.caching from =
> >> 50
> >>> ) it didn't improve execution time at all.
> >>> 
> >>> I have ~ 9 million small records.
> >>> I have to do full scan  , so it brings all 9 million records to client .
> >>> My assumption -- this change have to bring significant improvement , but
> >> it
> >>> is not.
> >>> 
> >>> Additional Information.
> >>> I scan table which has 100 regions
> >>> 5 server
> >>> 20 map
> >>> 4  concurrent map
> >>> Scan process takes 5.5 - 6 hours. As for me it is too much time? Am I
> >> write?
> >>> and how can I improve it
> >>> 
> >>> 
> >>> I changed the value in all hbase-site.xml files and restart hbase.
> >>> 
> >>> Any suggestions.
> >> 
> >> 
>

Re: scan performance improvement

Posted by Oleg Ruchovets <or...@gmail.com>.

Hi

I didn't change a block size ( it is still 64k).
Running test configured with caching size of 3600.
The test is still running , but I already see that there is NO performance
improvement.
    How can I check that hbase works with changed  caching size.
Can I see it from logs or some debugging?

Thanks
Oleg.

On Thu, Nov 11, 2010 at 8:03 PM, Ryan Rawson <ry...@gmail.com> wrote:

> I'd be careful about adjusting HFile block size, we took 64k after
> benchmarking a bunch of things, and it seemed to e a good performance
> point.
>
> As for scanning small rows, I'd go with a caching size of 1000-3000.
> When I set my scanners to that, I can pull 50k+ rows/sec from 1
> client.
>
> On Thu, Nov 11, 2010 at 7:36 AM, Friso van Vollenhoven
> <fv...@xebia.com> wrote:
> >> Great , thank you for the explanation.
> >>
> >>  my table schema is:
> >>
> >>         {NAME => 'URLs_sanity', FAMILIES => [{NAME => 'gs', VERSIONS =>
> >> '1', COMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE => '65536',
> >> IN_MEMORY => 'false', BLOCKCACHE => 'true'}, {NAME => 'meta-data',
> VERSIONS
> >> => '1', COMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE =>
> '65536',
> >> IN_MEMORY => 'false', BLOCKCACHE => 'true'}, {NAME => 'snt', VERSIONS =>
> >> '1', COMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE => '65536',
> >> IN_MEMORY => 'false', BLOCKCACHE => 'true'}]
> >>
> >> couple of questions:
> >>     1) How can I know what is the optimal size of BlockSize? What is the
> >> best practice regarding this issue
> >
> > Check the link I sent. There is an explanation on this setting in there.
> >
> >>     2) Assuming that I have a record 4 k and changed to 50 --> 4*50 =
> 200
> >> and it is ~ 3 blocks , so performance had to be improved , but execution
> >> time was the same.
> >
> > There is of course more involved than just this. And also, you may be
> already getting the most of what your hardware can give you. You should also
> try to find out what bottleneck you have (IO or CPU or network). Hadoop and
> HBase have many settings. There is no magic single knob that makes things
> fast or slow.
> >
> >>
> >> Oleg.
> >>
> >>
> >> On Thu, Nov 11, 2010 at 3:08 PM, Friso van Vollenhoven <
> >> fvanvollenhoven@xebia.com> wrote:
> >>
> >>> Not that block size (that's the HDFS one), but the HBase block size.
> You
> >>> set it at table creation or it uses the default of 64K.
> >>>
> >>> The description of hbase.client.scanner.caching says:
> >>> Number of rows that will be fetched when calling next
> >>> on a scanner if it is not served from memory. Higher caching values
> >>> will enable faster scanners but will eat up more memory and some
> >>> calls of next may take longer and longer times when the cache is empty.
> >>>
> >>> That means that it will pre-fetch that number of rows, if the next row
> does
> >>> not come from memory. So if your rows are small enough to fit 100 of
> them in
> >>> one block, it doesn't matter whether you pre-fetch 1, 50 or 99, because
> it
> >>> will only go to disk when it exhausts the whole block, which sticks in
> block
> >>> cache. So, it will still fetch the same amount of data from disk every
> time.
> >>> If you increase the number to a value that is certain to load multiple
> >>> blocks at a time from disk, it will increase performance.
> >>>
> >>>
> >>>
> >>> On 11 nov 2010, at 12:55, Oleg Ruchovets wrote:
> >>>
> >>>> Yes , I thought about large number , so you said it depends on block
> >>> size.
> >>>> Good point.
> >>>>
> >>>> I have one recored ~ 4k ,
> >>>> block size is:
> >>>>
> >>>> <property>
> >>>> <name>dfs.block.size</name>
> >>>> <value>268435456</value>
> >>>> <description>HDFS blocksize of 256MB for large file-systems.
> >>>> </description>
> >>>> </property>
> >>>>
> >>>> what is the number that I have choose? Assuming
> >>>> I am afraid that using number which is equal one block brings to
> >>>> socketTimeOutException? Am I write?
> >>>>
> >>>> Thanks Oleg.
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> On Thu, Nov 11, 2010 at 1:30 PM, Friso van Vollenhoven <
> >>>> fvanvollenhoven@xebia.com> wrote:
> >>>>
> >>>>> How small is small? If it is bytes, then setting the value to 50 is
> not
> >>> so
> >>>>> much different from 1, I suppose. If 50 rows fit in one block, it
> will
> >>> just
> >>>>> fetch one block whether the setting is 1 or 50. You might want to try
> a
> >>>>> larger value. It should be fine if the records are small and you need
> >>> them
> >>>>> all on the client side anyway.
> >>>>>
> >>>>> It also depends on the block size, of course. When you only ever do
> full
> >>>>> scans on a table and little random access, you might want to increase
> >>> that.
> >>>>>
> >>>>> Friso
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> On 11 nov 2010, at 12:15, Oleg Ruchovets wrote:
> >>>>>
> >>>>>> Hi ,
> >>>>>> To improve client performance I  changed
> >>>>>> hbase.client.scanner.caching from 1 to 50.
> >>>>>> After running client with new value( hbase.client.scanner.caching
> from
> >>> =
> >>>>> 50
> >>>>>> ) it didn't improve execution time at all.
> >>>>>>
> >>>>>> I have ~ 9 million small records.
> >>>>>> I have to do full scan  , so it brings all 9 million records to
> client
> >>> .
> >>>>>> My assumption -- this change have to bring significant improvement ,
> >>> but
> >>>>> it
> >>>>>> is not.
> >>>>>>
> >>>>>> Additional Information.
> >>>>>> I scan table which has 100 regions
> >>>>>> 5 server
> >>>>>> 20 map
> >>>>>> 4  concurrent map
> >>>>>> Scan process takes 5.5 - 6 hours. As for me it is too much time? Am
> I
> >>>>> write?
> >>>>>> and how can I improve it
> >>>>>>
> >>>>>>
> >>>>>> I changed the value in all hbase-site.xml files and restart hbase.
> >>>>>>
> >>>>>> Any suggestions.
> >>>>>
> >>>>>
> >>>
> >>>
> >
> >
>

Re: scan performance improvement

Posted by Ryan Rawson <ry...@gmail.com>.

I'd be careful about adjusting HFile block size, we took 64k after
benchmarking a bunch of things, and it seemed to e a good performance
point.

As for scanning small rows, I'd go with a caching size of 1000-3000.
When I set my scanners to that, I can pull 50k+ rows/sec from 1
client.

On Thu, Nov 11, 2010 at 7:36 AM, Friso van Vollenhoven
<fv...@xebia.com> wrote:
>> Great , thank you for the explanation.
>>
>>  my table schema is:
>>
>>         {NAME => 'URLs_sanity', FAMILIES => [{NAME => 'gs', VERSIONS =>
>> '1', COMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE => '65536',
>> IN_MEMORY => 'false', BLOCKCACHE => 'true'}, {NAME => 'meta-data', VERSIONS
>> => '1', COMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE => '65536',
>> IN_MEMORY => 'false', BLOCKCACHE => 'true'}, {NAME => 'snt', VERSIONS =>
>> '1', COMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE => '65536',
>> IN_MEMORY => 'false', BLOCKCACHE => 'true'}]
>>
>> couple of questions:
>>     1) How can I know what is the optimal size of BlockSize? What is the
>> best practice regarding this issue
>
> Check the link I sent. There is an explanation on this setting in there.
>
>>     2) Assuming that I have a record 4 k and changed to 50 --> 4*50 = 200
>> and it is ~ 3 blocks , so performance had to be improved , but execution
>> time was the same.
>
> There is of course more involved than just this. And also, you may be already getting the most of what your hardware can give you. You should also try to find out what bottleneck you have (IO or CPU or network). Hadoop and HBase have many settings. There is no magic single knob that makes things fast or slow.
>
>>
>> Oleg.
>>
>>
>> On Thu, Nov 11, 2010 at 3:08 PM, Friso van Vollenhoven <
>> fvanvollenhoven@xebia.com> wrote:
>>
>>> Not that block size (that's the HDFS one), but the HBase block size. You
>>> set it at table creation or it uses the default of 64K.
>>>
>>> The description of hbase.client.scanner.caching says:
>>> Number of rows that will be fetched when calling next
>>> on a scanner if it is not served from memory. Higher caching values
>>> will enable faster scanners but will eat up more memory and some
>>> calls of next may take longer and longer times when the cache is empty.
>>>
>>> That means that it will pre-fetch that number of rows, if the next row does
>>> not come from memory. So if your rows are small enough to fit 100 of them in
>>> one block, it doesn't matter whether you pre-fetch 1, 50 or 99, because it
>>> will only go to disk when it exhausts the whole block, which sticks in block
>>> cache. So, it will still fetch the same amount of data from disk every time.
>>> If you increase the number to a value that is certain to load multiple
>>> blocks at a time from disk, it will increase performance.
>>>
>>>
>>>
>>> On 11 nov 2010, at 12:55, Oleg Ruchovets wrote:
>>>
>>>> Yes , I thought about large number , so you said it depends on block
>>> size.
>>>> Good point.
>>>>
>>>> I have one recored ~ 4k ,
>>>> block size is:
>>>>
>>>> <property>
>>>> <name>dfs.block.size</name>
>>>> <value>268435456</value>
>>>> <description>HDFS blocksize of 256MB for large file-systems.
>>>> </description>
>>>> </property>
>>>>
>>>> what is the number that I have choose? Assuming
>>>> I am afraid that using number which is equal one block brings to
>>>> socketTimeOutException? Am I write?
>>>>
>>>> Thanks Oleg.
>>>>
>>>>
>>>>
>>>>
>>>> On Thu, Nov 11, 2010 at 1:30 PM, Friso van Vollenhoven <
>>>> fvanvollenhoven@xebia.com> wrote:
>>>>
>>>>> How small is small? If it is bytes, then setting the value to 50 is not
>>> so
>>>>> much different from 1, I suppose. If 50 rows fit in one block, it will
>>> just
>>>>> fetch one block whether the setting is 1 or 50. You might want to try a
>>>>> larger value. It should be fine if the records are small and you need
>>> them
>>>>> all on the client side anyway.
>>>>>
>>>>> It also depends on the block size, of course. When you only ever do full
>>>>> scans on a table and little random access, you might want to increase
>>> that.
>>>>>
>>>>> Friso
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 11 nov 2010, at 12:15, Oleg Ruchovets wrote:
>>>>>
>>>>>> Hi ,
>>>>>> To improve client performance I  changed
>>>>>> hbase.client.scanner.caching from 1 to 50.
>>>>>> After running client with new value( hbase.client.scanner.caching from
>>> =
>>>>> 50
>>>>>> ) it didn't improve execution time at all.
>>>>>>
>>>>>> I have ~ 9 million small records.
>>>>>> I have to do full scan  , so it brings all 9 million records to client
>>> .
>>>>>> My assumption -- this change have to bring significant improvement ,
>>> but
>>>>> it
>>>>>> is not.
>>>>>>
>>>>>> Additional Information.
>>>>>> I scan table which has 100 regions
>>>>>> 5 server
>>>>>> 20 map
>>>>>> 4  concurrent map
>>>>>> Scan process takes 5.5 - 6 hours. As for me it is too much time? Am I
>>>>> write?
>>>>>> and how can I improve it
>>>>>>
>>>>>>
>>>>>> I changed the value in all hbase-site.xml files and restart hbase.
>>>>>>
>>>>>> Any suggestions.
>>>>>
>>>>>
>>>
>>>
>
>

Re: scan performance improvement

Posted by Friso van Vollenhoven <fv...@xebia.com>.

> Great , thank you for the explanation.
> 
>  my table schema is:
> 
>         {NAME => 'URLs_sanity', FAMILIES => [{NAME => 'gs', VERSIONS =>
> '1', COMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE => '65536',
> IN_MEMORY => 'false', BLOCKCACHE => 'true'}, {NAME => 'meta-data', VERSIONS
> => '1', COMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE => '65536',
> IN_MEMORY => 'false', BLOCKCACHE => 'true'}, {NAME => 'snt', VERSIONS =>
> '1', COMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE => '65536',
> IN_MEMORY => 'false', BLOCKCACHE => 'true'}]
> 
> couple of questions:
>     1) How can I know what is the optimal size of BlockSize? What is the
> best practice regarding this issue

Check the link I sent. There is an explanation on this setting in there.

>     2) Assuming that I have a record 4 k and changed to 50 --> 4*50 = 200
> and it is ~ 3 blocks , so performance had to be improved , but execution
> time was the same.

There is of course more involved than just this. And also, you may be already getting the most of what your hardware can give you. You should also try to find out what bottleneck you have (IO or CPU or network). Hadoop and HBase have many settings. There is no magic single knob that makes things fast or slow.

> 
> Oleg.
> 
> 
> On Thu, Nov 11, 2010 at 3:08 PM, Friso van Vollenhoven <
> fvanvollenhoven@xebia.com> wrote:
> 
>> Not that block size (that's the HDFS one), but the HBase block size. You
>> set it at table creation or it uses the default of 64K.
>> 
>> The description of hbase.client.scanner.caching says:
>> Number of rows that will be fetched when calling next
>> on a scanner if it is not served from memory. Higher caching values
>> will enable faster scanners but will eat up more memory and some
>> calls of next may take longer and longer times when the cache is empty.
>> 
>> That means that it will pre-fetch that number of rows, if the next row does
>> not come from memory. So if your rows are small enough to fit 100 of them in
>> one block, it doesn't matter whether you pre-fetch 1, 50 or 99, because it
>> will only go to disk when it exhausts the whole block, which sticks in block
>> cache. So, it will still fetch the same amount of data from disk every time.
>> If you increase the number to a value that is certain to load multiple
>> blocks at a time from disk, it will increase performance.
>> 
>> 
>> 
>> On 11 nov 2010, at 12:55, Oleg Ruchovets wrote:
>> 
>>> Yes , I thought about large number , so you said it depends on block
>> size.
>>> Good point.
>>> 
>>> I have one recored ~ 4k ,
>>> block size is:
>>> 
>>> <property>
>>> <name>dfs.block.size</name>
>>> <value>268435456</value>
>>> <description>HDFS blocksize of 256MB for large file-systems.
>>> </description>
>>> </property>
>>> 
>>> what is the number that I have choose? Assuming
>>> I am afraid that using number which is equal one block brings to
>>> socketTimeOutException? Am I write?
>>> 
>>> Thanks Oleg.
>>> 
>>> 
>>> 
>>> 
>>> On Thu, Nov 11, 2010 at 1:30 PM, Friso van Vollenhoven <
>>> fvanvollenhoven@xebia.com> wrote:
>>> 
>>>> How small is small? If it is bytes, then setting the value to 50 is not
>> so
>>>> much different from 1, I suppose. If 50 rows fit in one block, it will
>> just
>>>> fetch one block whether the setting is 1 or 50. You might want to try a
>>>> larger value. It should be fine if the records are small and you need
>> them
>>>> all on the client side anyway.
>>>> 
>>>> It also depends on the block size, of course. When you only ever do full
>>>> scans on a table and little random access, you might want to increase
>> that.
>>>> 
>>>> Friso
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On 11 nov 2010, at 12:15, Oleg Ruchovets wrote:
>>>> 
>>>>> Hi ,
>>>>> To improve client performance I  changed
>>>>> hbase.client.scanner.caching from 1 to 50.
>>>>> After running client with new value( hbase.client.scanner.caching from
>> =
>>>> 50
>>>>> ) it didn't improve execution time at all.
>>>>> 
>>>>> I have ~ 9 million small records.
>>>>> I have to do full scan  , so it brings all 9 million records to client
>> .
>>>>> My assumption -- this change have to bring significant improvement ,
>> but
>>>> it
>>>>> is not.
>>>>> 
>>>>> Additional Information.
>>>>> I scan table which has 100 regions
>>>>> 5 server
>>>>> 20 map
>>>>> 4  concurrent map
>>>>> Scan process takes 5.5 - 6 hours. As for me it is too much time? Am I
>>>> write?
>>>>> and how can I improve it
>>>>> 
>>>>> 
>>>>> I changed the value in all hbase-site.xml files and restart hbase.
>>>>> 
>>>>> Any suggestions.
>>>> 
>>>> 
>> 
>>

Re: scan performance improvement

Posted by Oleg Ruchovets <or...@gmail.com>.

Great , thank you for the explanation.

  my table schema is:

         {NAME => 'URLs_sanity', FAMILIES => [{NAME => 'gs', VERSIONS =>
'1', COMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE => '65536',
IN_MEMORY => 'false', BLOCKCACHE => 'true'}, {NAME => 'meta-data', VERSIONS
=> '1', COMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE => '65536',
IN_MEMORY => 'false', BLOCKCACHE => 'true'}, {NAME => 'snt', VERSIONS =>
'1', COMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE => '65536',
IN_MEMORY => 'false', BLOCKCACHE => 'true'}]

couple of questions:
     1) How can I know what is the optimal size of BlockSize? What is the
best practice regarding this issue
     2) Assuming that I have a record 4 k and changed to 50 --> 4*50 = 200
and it is ~ 3 blocks , so performance had to be improved , but execution
time was the same.

Oleg.


On Thu, Nov 11, 2010 at 3:08 PM, Friso van Vollenhoven <
fvanvollenhoven@xebia.com> wrote:

> Not that block size (that's the HDFS one), but the HBase block size. You
> set it at table creation or it uses the default of 64K.
>
> The description of hbase.client.scanner.caching says:
> Number of rows that will be fetched when calling next
> on a scanner if it is not served from memory. Higher caching values
> will enable faster scanners but will eat up more memory and some
> calls of next may take longer and longer times when the cache is empty.
>
> That means that it will pre-fetch that number of rows, if the next row does
> not come from memory. So if your rows are small enough to fit 100 of them in
> one block, it doesn't matter whether you pre-fetch 1, 50 or 99, because it
> will only go to disk when it exhausts the whole block, which sticks in block
> cache. So, it will still fetch the same amount of data from disk every time.
> If you increase the number to a value that is certain to load multiple
> blocks at a time from disk, it will increase performance.
>
>
>
> On 11 nov 2010, at 12:55, Oleg Ruchovets wrote:
>
> > Yes , I thought about large number , so you said it depends on block
> size.
> > Good point.
> >
> > I have one recored ~ 4k ,
> > block size is:
> >
> > <property>
> >  <name>dfs.block.size</name>
> >  <value>268435456</value>
> >  <description>HDFS blocksize of 256MB for large file-systems.
> > </description>
> > </property>
> >
> > what is the number that I have choose? Assuming
> > I am afraid that using number which is equal one block brings to
> > socketTimeOutException? Am I write?
> >
> > Thanks Oleg.
> >
> >
> >
> >
> > On Thu, Nov 11, 2010 at 1:30 PM, Friso van Vollenhoven <
> > fvanvollenhoven@xebia.com> wrote:
> >
> >> How small is small? If it is bytes, then setting the value to 50 is not
> so
> >> much different from 1, I suppose. If 50 rows fit in one block, it will
> just
> >> fetch one block whether the setting is 1 or 50. You might want to try a
> >> larger value. It should be fine if the records are small and you need
> them
> >> all on the client side anyway.
> >>
> >> It also depends on the block size, of course. When you only ever do full
> >> scans on a table and little random access, you might want to increase
> that.
> >>
> >> Friso
> >>
> >>
> >>
> >>
> >> On 11 nov 2010, at 12:15, Oleg Ruchovets wrote:
> >>
> >>> Hi ,
> >>>  To improve client performance I  changed
> >>> hbase.client.scanner.caching from 1 to 50.
> >>> After running client with new value( hbase.client.scanner.caching from
> =
> >> 50
> >>> ) it didn't improve execution time at all.
> >>>
> >>> I have ~ 9 million small records.
> >>> I have to do full scan  , so it brings all 9 million records to client
> .
> >>> My assumption -- this change have to bring significant improvement ,
> but
> >> it
> >>> is not.
> >>>
> >>> Additional Information.
> >>> I scan table which has 100 regions
> >>> 5 server
> >>> 20 map
> >>> 4  concurrent map
> >>> Scan process takes 5.5 - 6 hours. As for me it is too much time? Am I
> >> write?
> >>> and how can I improve it
> >>>
> >>>
> >>> I changed the value in all hbase-site.xml files and restart hbase.
> >>>
> >>> Any suggestions.
> >>
> >>
>
>

Re: scan performance improvement

Posted by Friso van Vollenhoven <fv...@xebia.com>.

Not that block size (that's the HDFS one), but the HBase block size. You set it at table creation or it uses the default of 64K.

The description of hbase.client.scanner.caching says:
Number of rows that will be fetched when calling next
on a scanner if it is not served from memory. Higher caching values
will enable faster scanners but will eat up more memory and some
calls of next may take longer and longer times when the cache is empty.

That means that it will pre-fetch that number of rows, if the next row does not come from memory. So if your rows are small enough to fit 100 of them in one block, it doesn't matter whether you pre-fetch 1, 50 or 99, because it will only go to disk when it exhausts the whole block, which sticks in block cache. So, it will still fetch the same amount of data from disk every time. If you increase the number to a value that is certain to load multiple blocks at a time from disk, it will increase performance.

On 11 nov 2010, at 12:55, Oleg Ruchovets wrote:

> Yes , I thought about large number , so you said it depends on block size.
> Good point.
> 
> I have one recored ~ 4k ,
> block size is:
> 
> <property>
>  <name>dfs.block.size</name>
>  <value>268435456</value>
>  <description>HDFS blocksize of 256MB for large file-systems.
> </description>
> </property>
> 
> what is the number that I have choose? Assuming
> I am afraid that using number which is equal one block brings to
> socketTimeOutException? Am I write?
> 
> Thanks Oleg.
> 
> 
> 
> 
> On Thu, Nov 11, 2010 at 1:30 PM, Friso van Vollenhoven <
> fvanvollenhoven@xebia.com> wrote:
> 
>> How small is small? If it is bytes, then setting the value to 50 is not so
>> much different from 1, I suppose. If 50 rows fit in one block, it will just
>> fetch one block whether the setting is 1 or 50. You might want to try a
>> larger value. It should be fine if the records are small and you need them
>> all on the client side anyway.
>> 
>> It also depends on the block size, of course. When you only ever do full
>> scans on a table and little random access, you might want to increase that.
>> 
>> Friso
>> 
>> 
>> 
>> 
>> On 11 nov 2010, at 12:15, Oleg Ruchovets wrote:
>> 
>>> Hi ,
>>>  To improve client performance I  changed
>>> hbase.client.scanner.caching from 1 to 50.
>>> After running client with new value( hbase.client.scanner.caching from =
>> 50
>>> ) it didn't improve execution time at all.
>>> 
>>> I have ~ 9 million small records.
>>> I have to do full scan  , so it brings all 9 million records to client .
>>> My assumption -- this change have to bring significant improvement , but
>> it
>>> is not.
>>> 
>>> Additional Information.
>>> I scan table which has 100 regions
>>> 5 server
>>> 20 map
>>> 4  concurrent map
>>> Scan process takes 5.5 - 6 hours. As for me it is too much time? Am I
>> write?
>>> and how can I improve it
>>> 
>>> 
>>> I changed the value in all hbase-site.xml files and restart hbase.
>>> 
>>> Any suggestions.
>> 
>>

Re: scan performance improvement

Posted by Oleg Ruchovets <or...@gmail.com>.

Yes , I thought about large number , so you said it depends on block size.
Good point.

I have one recored ~ 4k ,
 block size is:

<property>
  <name>dfs.block.size</name>
  <value>268435456</value>
  <description>HDFS blocksize of 256MB for large file-systems.
</description>
</property>

what is the number that I have choose? Assuming
I am afraid that using number which is equal one block brings to
socketTimeOutException? Am I write?

Thanks Oleg.




On Thu, Nov 11, 2010 at 1:30 PM, Friso van Vollenhoven <
fvanvollenhoven@xebia.com> wrote:

> How small is small? If it is bytes, then setting the value to 50 is not so
> much different from 1, I suppose. If 50 rows fit in one block, it will just
> fetch one block whether the setting is 1 or 50. You might want to try a
> larger value. It should be fine if the records are small and you need them
> all on the client side anyway.
>
> It also depends on the block size, of course. When you only ever do full
> scans on a table and little random access, you might want to increase that.
>
> Friso
>
>
>
>
> On 11 nov 2010, at 12:15, Oleg Ruchovets wrote:
>
> > Hi ,
> >   To improve client performance I  changed
> > hbase.client.scanner.caching from 1 to 50.
> > After running client with new value( hbase.client.scanner.caching from =
> 50
> > ) it didn't improve execution time at all.
> >
> > I have ~ 9 million small records.
> > I have to do full scan  , so it brings all 9 million records to client .
> > My assumption -- this change have to bring significant improvement , but
> it
> > is not.
> >
> > Additional Information.
> > I scan table which has 100 regions
> > 5 server
> > 20 map
> > 4  concurrent map
> > Scan process takes 5.5 - 6 hours. As for me it is too much time? Am I
> write?
> > and how can I improve it
> >
> >
> > I changed the value in all hbase-site.xml files and restart hbase.
> >
> > Any suggestions.
>
>

Re: scan performance improvement

Posted by Friso van Vollenhoven <fv...@xebia.com>.

How small is small? If it is bytes, then setting the value to 50 is not so much different from 1, I suppose. If 50 rows fit in one block, it will just fetch one block whether the setting is 1 or 50. You might want to try a larger value. It should be fine if the records are small and you need them all on the client side anyway.

It also depends on the block size, of course. When you only ever do full scans on a table and little random access, you might want to increase that.

Friso

On 11 nov 2010, at 12:15, Oleg Ruchovets wrote:

> Hi ,
>   To improve client performance I  changed
> hbase.client.scanner.caching from 1 to 50.
> After running client with new value( hbase.client.scanner.caching from = 50
> ) it didn't improve execution time at all.
> 
> I have ~ 9 million small records.
> I have to do full scan  , so it brings all 9 million records to client .
> My assumption -- this change have to bring significant improvement , but it
> is not.
> 
> Additional Information.
> I scan table which has 100 regions
> 5 server
> 20 map
> 4  concurrent map
> Scan process takes 5.5 - 6 hours. As for me it is too much time? Am I write?
> and how can I improve it
> 
> 
> I changed the value in all hbase-site.xml files and restart hbase.
> 
> Any suggestions.