You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by "Qingyan(Evan) Liu" <qi...@gmail.com> on 2009/07/09 11:14:11 UTC

help! hbase rev-792389, scan speed is as slower as randomRead!

dears,

I'm fresh to hbase. I just checkout hbase trunk rev-792389, and test its
performance by means of org.apache.hadoop.hbase.PerformanceEvaluation
(Detailed testing results are listed below). It's strange that the scan
speed is as slower as randomRead. I haven't change any configuration
parameters in xml files. Anyone can help me to tune the scan performance?
Thanks a lot.

Hardware: HP Compaq nx6320, CPU Centrino Duo 2 GHz, 1 GB Memory
OS: ubuntu jaunty 9.04
Hbase: hadoop HDFS namenode + datanode + hbase master + zookeeper +
regionserver are all on localhost.
Test commands:
$ bin/hbase org.apache.hadoop.hbase.PerformanceEvaluation --rows=50000
randomWrite 1
$ bin/hbase org.apache.hadoop.hbase.PerformanceEvaluation --rows=50000 scan
1
$ bin/hbase org.apache.hadoop.hbase.PerformanceEvaluation --rows=50000
randomRead 1
Test results:
randomWrite, time cost: 6858ms
scan, time cost: 18836ms
randomRead, time cost: 16829ms

I suppose that the scan speed should be much more faster than randomRead.
However, it isn't. Why?

sincerely,
Evan

Re: help! hbase rev-792389, scan speed is as slower as randomRead!

Posted by Ryan Rawson <ry...@gmail.com>.

On large map-reduce runs with small rows, I set scanner caching to
1000-3000 rows.  This seemingly minor change allows me to reach 4.5m
row reads/sec (~ 40 bytes per row).  Without that, single row fetch is
stupid slow.

I don't think we can set a reasonable value here for 2 reasons:
- for those who are doing 'webtable' and doing large processing per
row, you can get scanner time outs
- small rows need large pre-caches to achieve performance.

These two are at odds.  Right now we are pre-optimized for the former
since it can cause m-r failures, whereas the latter is "merely"
performance issues.

Thanks for pointing this out!
-ryan

On Thu, Jul 9, 2009 at 7:35 PM, Qingyan(Evan) Liu<qi...@gmail.com> wrote:
> Thank JG a lot!
>
> I've just svn update and test the new codes, which
> setScannerCaching(30). Performance of scan is now very high: 5460ms at
> offset 0 for 100000000 rows.
>
> So, the conclusion is clear, switch on the prefetch will greatly boost
> the scan speed.
>
> Thank you all kind guys.
>
> sincerely,
> Evan
>
> 2009/7/10 Jonathan Gray <jl...@streamy.com>
>>
>> Not every test is created equal, different tests are testing different things, and different environments/setups/configurations can yield different results.
>>
>> I posted the utility (HBench) I used to generate the statistics from those slides up in a jira.  You can grab it and try it out to see what you get:
>>
>> https://issues.apache.org/jira/browse/HBASE-1501
>>
>> However, the primary reason that I think you're seeing significantly lower scan performance is that PerformanceEvaluation was (incorrectly) not pre-fetching any rows.  So, only a single row is returned for each round-trip.  This ends up basically benchmarking RPC performance, not what we're after.
>>
>> In the real world, if you know you want to scan a large number of rows, you should use HTable.setScannerCaching(N) where N is the number of rows to grab on each trip.  The scanner will still work as normal (you just continue to call next() on it), but it will be caching N results and serving from that.
>>
>> We just fixed up PerformanceEvaluation on TRUNK, so if you update your trunk and retest you should see a good boost in performance.  It is now set to 30 for PE.
>>
>> When running tests on your own data, you should play with this parameter to see how it will affect your performance in your environment and with your data.
>>
>>
>> Regarding cache warm-up, it exists for two reasons.
>>
>> First, the system caches io with available memory, so once you've read something off of HDFS there's a good chance it will be cached the next time around.
>>
>> Second, there is an internal read block cache.  This should help performance significantly when your blocks are available in the cache. The reason you did not see performance boosts in scans after warm-up was  because of what i describe above; it was really measuring RPC performance because in an open scanner with small rows, the fetch of each row (from a block already sitting in memory) is sub-ms.
>>
>>
>> Thanks JD for fixing PE :)
>>
>>
>> Jonathan Gray
>>
>>
>> Qingyan(Evan) Liu wrote:
>>>
>>> Dear J-D,
>>>
>>> Here's my another two tests. I changed the order of the tests. Before each
>>> test, I restarted both hbase & hadoop. All are 50,000 rows with sizeof 1KB.
>>>
>>> (1) randomWrite-randomRead-randomRead-scan-scan-randomRead
>>> 7117ms-15966ms-16678ms-10429ms-10730ms-15641ms
>>>
>>> (2) randomWrite-scan-scan-randomRead-randomRead-scan
>>> 6587ms-14671ms-12153ms-20264ms-18521ms-15619ms
>>>
>>>> From the above results, I think there're three major conclusions:
>>>
>>> a) "cache warm-up" phenomenon exists
>>> b) cache doesn't improve scan performance very much
>>> c) scan performance is much lower than it's announced in these slides:
>>> http://www.docstoc.com/docs/7493304/HBase-Goes-Realtime
>>> *(in page 10, this doc reports that the scan speed of 117ms/10,000rows, that
>>> is, just 585ms/50,000rows. 585ms is much much shorter than my test result,
>>> 10429ms!)*
>>>
>>> So, no matter what the cache warm-up affects, I cannot tune the scan speed
>>> up to the reported 117ms/10,000rows. I think it's my fault, or the
>>> reporter's fault.
>>> I'm also curious about your testing results of randomRead and scan. Can you
>>> be so kind to show me the results for my information? Thanks a lot!
>>>
>>> P.S. TestTable attributes:
>>>  TestTable <http://localhost:60010/table.jsp?name=TestTable>
>>>
>>>   -  Parameters
>>>      -   is_root: false
>>>      -   is_meta: false
>>>   -  Families
>>>      -  Name: info
>>>         -   bloomfilter: false
>>>         -   compression: none
>>>         -   versions: 3
>>>         -   ttl: 2147483647
>>>         -   blocksize: 65536
>>>         -   in_memory: false
>>>         -   blockcache: true
>>>
>>>
>>> sincerely,
>>> Evan
>>>
>>> 2009/7/9 Jean-Daniel Cryans <jd...@apache.org>
>>>
>>>> Even,
>>>>
>>>> The scan probably warmed the cache here. Do the same experiment with a
>>>> fresh HBase for the scan and the random reads.
>>>>
>>>> J-D
>>>>
>>>> On Thu, Jul 9, 2009 at 5:14 AM, Qingyan(Evan) Liu<qi...@gmail.com>
>>>> wrote:
>>>>>
>>>>> dears,
>>>>>
>>>>> I'm fresh to hbase. I just checkout hbase trunk rev-792389, and test its
>>>>> performance by means of org.apache.hadoop.hbase.PerformanceEvaluation
>>>>> (Detailed testing results are listed below). It's strange that the scan
>>>>> speed is as slower as randomRead. I haven't change any configuration
>>>>> parameters in xml files. Anyone can help me to tune the scan performance?
>>>>> Thanks a lot.
>>>>>
>>>>> Hardware: HP Compaq nx6320, CPU Centrino Duo 2 GHz, 1 GB Memory
>>>>> OS: ubuntu jaunty 9.04
>>>>> Hbase: hadoop HDFS namenode + datanode + hbase master + zookeeper +
>>>>> regionserver are all on localhost.
>>>>> Test commands:
>>>>> $ bin/hbase org.apache.hadoop.hbase.PerformanceEvaluation --rows=50000
>>>>> randomWrite 1
>>>>> $ bin/hbase org.apache.hadoop.hbase.PerformanceEvaluation --rows=50000
>>>>
>>>> scan
>>>>>
>>>>> 1
>>>>> $ bin/hbase org.apache.hadoop.hbase.PerformanceEvaluation --rows=50000
>>>>> randomRead 1
>>>>> Test results:
>>>>> randomWrite, time cost: 6858ms
>>>>> scan, time cost: 18836ms
>>>>> randomRead, time cost: 16829ms
>>>>>
>>>>> I suppose that the scan speed should be much more faster than randomRead.
>>>>> However, it isn't. Why?
>>>>>
>>>>> sincerely,
>>>>> Evan
>>>>>
>>>
>

Re: help! hbase rev-792389, scan speed is as slower as randomRead!

Posted by "Qingyan(Evan) Liu" <qi...@gmail.com>.

Thank JG a lot!

I've just svn update and test the new codes, which
setScannerCaching(30). Performance of scan is now very high: 5460ms at
offset 0 for 100000000 rows.

So, the conclusion is clear, switch on the prefetch will greatly boost
the scan speed.

Thank you all kind guys.

sincerely,
Evan

2009/7/10 Jonathan Gray <jl...@streamy.com>
>
> Not every test is created equal, different tests are testing different things, and different environments/setups/configurations can yield different results.
>
> I posted the utility (HBench) I used to generate the statistics from those slides up in a jira.  You can grab it and try it out to see what you get:
>
> https://issues.apache.org/jira/browse/HBASE-1501
>
> However, the primary reason that I think you're seeing significantly lower scan performance is that PerformanceEvaluation was (incorrectly) not pre-fetching any rows.  So, only a single row is returned for each round-trip.  This ends up basically benchmarking RPC performance, not what we're after.
>
> In the real world, if you know you want to scan a large number of rows, you should use HTable.setScannerCaching(N) where N is the number of rows to grab on each trip.  The scanner will still work as normal (you just continue to call next() on it), but it will be caching N results and serving from that.
>
> We just fixed up PerformanceEvaluation on TRUNK, so if you update your trunk and retest you should see a good boost in performance.  It is now set to 30 for PE.
>
> When running tests on your own data, you should play with this parameter to see how it will affect your performance in your environment and with your data.
>
>
> Regarding cache warm-up, it exists for two reasons.
>
> First, the system caches io with available memory, so once you've read something off of HDFS there's a good chance it will be cached the next time around.
>
> Second, there is an internal read block cache.  This should help performance significantly when your blocks are available in the cache. The reason you did not see performance boosts in scans after warm-up was  because of what i describe above; it was really measuring RPC performance because in an open scanner with small rows, the fetch of each row (from a block already sitting in memory) is sub-ms.
>
>
> Thanks JD for fixing PE :)
>
>
> Jonathan Gray
>
>
> Qingyan(Evan) Liu wrote:
>>
>> Dear J-D,
>>
>> Here's my another two tests. I changed the order of the tests. Before each
>> test, I restarted both hbase & hadoop. All are 50,000 rows with sizeof 1KB.
>>
>> (1) randomWrite-randomRead-randomRead-scan-scan-randomRead
>> 7117ms-15966ms-16678ms-10429ms-10730ms-15641ms
>>
>> (2) randomWrite-scan-scan-randomRead-randomRead-scan
>> 6587ms-14671ms-12153ms-20264ms-18521ms-15619ms
>>
>>> From the above results, I think there're three major conclusions:
>>
>> a) "cache warm-up" phenomenon exists
>> b) cache doesn't improve scan performance very much
>> c) scan performance is much lower than it's announced in these slides:
>> http://www.docstoc.com/docs/7493304/HBase-Goes-Realtime
>> *(in page 10, this doc reports that the scan speed of 117ms/10,000rows, that
>> is, just 585ms/50,000rows. 585ms is much much shorter than my test result,
>> 10429ms!)*
>>
>> So, no matter what the cache warm-up affects, I cannot tune the scan speed
>> up to the reported 117ms/10,000rows. I think it's my fault, or the
>> reporter's fault.
>> I'm also curious about your testing results of randomRead and scan. Can you
>> be so kind to show me the results for my information? Thanks a lot!
>>
>> P.S. TestTable attributes:
>>  TestTable <http://localhost:60010/table.jsp?name=TestTable>
>>
>>   -  Parameters
>>      -   is_root: false
>>      -   is_meta: false
>>   -  Families
>>      -  Name: info
>>         -   bloomfilter: false
>>         -   compression: none
>>         -   versions: 3
>>         -   ttl: 2147483647
>>         -   blocksize: 65536
>>         -   in_memory: false
>>         -   blockcache: true
>>
>>
>> sincerely,
>> Evan
>>
>> 2009/7/9 Jean-Daniel Cryans <jd...@apache.org>
>>
>>> Even,
>>>
>>> The scan probably warmed the cache here. Do the same experiment with a
>>> fresh HBase for the scan and the random reads.
>>>
>>> J-D
>>>
>>> On Thu, Jul 9, 2009 at 5:14 AM, Qingyan(Evan) Liu<qi...@gmail.com>
>>> wrote:
>>>>
>>>> dears,
>>>>
>>>> I'm fresh to hbase. I just checkout hbase trunk rev-792389, and test its
>>>> performance by means of org.apache.hadoop.hbase.PerformanceEvaluation
>>>> (Detailed testing results are listed below). It's strange that the scan
>>>> speed is as slower as randomRead. I haven't change any configuration
>>>> parameters in xml files. Anyone can help me to tune the scan performance?
>>>> Thanks a lot.
>>>>
>>>> Hardware: HP Compaq nx6320, CPU Centrino Duo 2 GHz, 1 GB Memory
>>>> OS: ubuntu jaunty 9.04
>>>> Hbase: hadoop HDFS namenode + datanode + hbase master + zookeeper +
>>>> regionserver are all on localhost.
>>>> Test commands:
>>>> $ bin/hbase org.apache.hadoop.hbase.PerformanceEvaluation --rows=50000
>>>> randomWrite 1
>>>> $ bin/hbase org.apache.hadoop.hbase.PerformanceEvaluation --rows=50000
>>>
>>> scan
>>>>
>>>> 1
>>>> $ bin/hbase org.apache.hadoop.hbase.PerformanceEvaluation --rows=50000
>>>> randomRead 1
>>>> Test results:
>>>> randomWrite, time cost: 6858ms
>>>> scan, time cost: 18836ms
>>>> randomRead, time cost: 16829ms
>>>>
>>>> I suppose that the scan speed should be much more faster than randomRead.
>>>> However, it isn't. Why?
>>>>
>>>> sincerely,
>>>> Evan
>>>>
>>

Re: help! hbase rev-792389, scan speed is as slower as randomRead!

Posted by Jonathan Gray <jl...@streamy.com>.

Not every test is created equal, different tests are testing different 
things, and different environments/setups/configurations can yield 
different results.

I posted the utility (HBench) I used to generate the statistics from 
those slides up in a jira.  You can grab it and try it out to see what 
you get:

https://issues.apache.org/jira/browse/HBASE-1501

However, the primary reason that I think you're seeing significantly 
lower scan performance is that PerformanceEvaluation was (incorrectly) 
not pre-fetching any rows.  So, only a single row is returned for each 
round-trip.  This ends up basically benchmarking RPC performance, not 
what we're after.

In the real world, if you know you want to scan a large number of rows, 
you should use HTable.setScannerCaching(N) where N is the number of rows 
to grab on each trip.  The scanner will still work as normal (you just 
continue to call next() on it), but it will be caching N results and 
serving from that.

We just fixed up PerformanceEvaluation on TRUNK, so if you update your 
trunk and retest you should see a good boost in performance.  It is now 
set to 30 for PE.

When running tests on your own data, you should play with this parameter 
to see how it will affect your performance in your environment and with 
your data.

Regarding cache warm-up, it exists for two reasons.

First, the system caches io with available memory, so once you've read 
something off of HDFS there's a good chance it will be cached the next 
time around.

Second, there is an internal read block cache.  This should help 
performance significantly when your blocks are available in the cache. 
The reason you did not see performance boosts in scans after warm-up was 
  because of what i describe above; it was really measuring RPC 
performance because in an open scanner with small rows, the fetch of 
each row (from a block already sitting in memory) is sub-ms.

Thanks JD for fixing PE :)

Jonathan Gray

Qingyan(Evan) Liu wrote:
> Dear J-D,
> 
> Here's my another two tests. I changed the order of the tests. Before each
> test, I restarted both hbase & hadoop. All are 50,000 rows with sizeof 1KB.
> 
> (1) randomWrite-randomRead-randomRead-scan-scan-randomRead
> 7117ms-15966ms-16678ms-10429ms-10730ms-15641ms
> 
> (2) randomWrite-scan-scan-randomRead-randomRead-scan
> 6587ms-14671ms-12153ms-20264ms-18521ms-15619ms
> 
>>>From the above results, I think there're three major conclusions:
> a) "cache warm-up" phenomenon exists
> b) cache doesn't improve scan performance very much
> c) scan performance is much lower than it's announced in these slides:
> http://www.docstoc.com/docs/7493304/HBase-Goes-Realtime
> *(in page 10, this doc reports that the scan speed of 117ms/10,000rows, that
> is, just 585ms/50,000rows. 585ms is much much shorter than my test result,
> 10429ms!)*
> 
> So, no matter what the cache warm-up affects, I cannot tune the scan speed
> up to the reported 117ms/10,000rows. I think it's my fault, or the
> reporter's fault.
> I'm also curious about your testing results of randomRead and scan. Can you
> be so kind to show me the results for my information? Thanks a lot!
> 
> P.S. TestTable attributes:
>  TestTable <http://localhost:60010/table.jsp?name=TestTable>
> 
>    -  Parameters
>       -   is_root: false
>       -   is_meta: false
>    -  Families
>       -  Name: info
>          -   bloomfilter: false
>          -   compression: none
>          -   versions: 3
>          -   ttl: 2147483647
>          -   blocksize: 65536
>          -   in_memory: false
>          -   blockcache: true
> 
> 
> sincerely,
> Evan
> 
> 2009/7/9 Jean-Daniel Cryans <jd...@apache.org>
> 
>> Even,
>>
>> The scan probably warmed the cache here. Do the same experiment with a
>> fresh HBase for the scan and the random reads.
>>
>> J-D
>>
>> On Thu, Jul 9, 2009 at 5:14 AM, Qingyan(Evan) Liu<qi...@gmail.com>
>> wrote:
>>> dears,
>>>
>>> I'm fresh to hbase. I just checkout hbase trunk rev-792389, and test its
>>> performance by means of org.apache.hadoop.hbase.PerformanceEvaluation
>>> (Detailed testing results are listed below). It's strange that the scan
>>> speed is as slower as randomRead. I haven't change any configuration
>>> parameters in xml files. Anyone can help me to tune the scan performance?
>>> Thanks a lot.
>>>
>>> Hardware: HP Compaq nx6320, CPU Centrino Duo 2 GHz, 1 GB Memory
>>> OS: ubuntu jaunty 9.04
>>> Hbase: hadoop HDFS namenode + datanode + hbase master + zookeeper +
>>> regionserver are all on localhost.
>>> Test commands:
>>> $ bin/hbase org.apache.hadoop.hbase.PerformanceEvaluation --rows=50000
>>> randomWrite 1
>>> $ bin/hbase org.apache.hadoop.hbase.PerformanceEvaluation --rows=50000
>> scan
>>> 1
>>> $ bin/hbase org.apache.hadoop.hbase.PerformanceEvaluation --rows=50000
>>> randomRead 1
>>> Test results:
>>> randomWrite, time cost: 6858ms
>>> scan, time cost: 18836ms
>>> randomRead, time cost: 16829ms
>>>
>>> I suppose that the scan speed should be much more faster than randomRead.
>>> However, it isn't. Why?
>>>
>>> sincerely,
>>> Evan
>>>
>

Re: help! hbase rev-792389, scan speed is as slower as randomRead!

Posted by "Qingyan(Evan) Liu" <qi...@gmail.com>.

Dear J-D,

Here's my another two tests. I changed the order of the tests. Before each
test, I restarted both hbase & hadoop. All are 50,000 rows with sizeof 1KB.

(1) randomWrite-randomRead-randomRead-scan-scan-randomRead
7117ms-15966ms-16678ms-10429ms-10730ms-15641ms

(2) randomWrite-scan-scan-randomRead-randomRead-scan
6587ms-14671ms-12153ms-20264ms-18521ms-15619ms

>From the above results, I think there're three major conclusions:
a) "cache warm-up" phenomenon exists
b) cache doesn't improve scan performance very much
c) scan performance is much lower than it's announced in these slides:
http://www.docstoc.com/docs/7493304/HBase-Goes-Realtime
*(in page 10, this doc reports that the scan speed of 117ms/10,000rows, that
is, just 585ms/50,000rows. 585ms is much much shorter than my test result,
10429ms!)*

So, no matter what the cache warm-up affects, I cannot tune the scan speed
up to the reported 117ms/10,000rows. I think it's my fault, or the
reporter's fault.
I'm also curious about your testing results of randomRead and scan. Can you
be so kind to show me the results for my information? Thanks a lot!

P.S. TestTable attributes:
 TestTable <http://localhost:60010/table.jsp?name=TestTable>

   -  Parameters
      -   is_root: false
      -   is_meta: false
   -  Families
      -  Name: info
         -   bloomfilter: false
         -   compression: none
         -   versions: 3
         -   ttl: 2147483647
         -   blocksize: 65536
         -   in_memory: false
         -   blockcache: true


sincerely,
Evan

2009/7/9 Jean-Daniel Cryans <jd...@apache.org>

> Even,
>
> The scan probably warmed the cache here. Do the same experiment with a
> fresh HBase for the scan and the random reads.
>
> J-D
>
> On Thu, Jul 9, 2009 at 5:14 AM, Qingyan(Evan) Liu<qi...@gmail.com>
> wrote:
> > dears,
> >
> > I'm fresh to hbase. I just checkout hbase trunk rev-792389, and test its
> > performance by means of org.apache.hadoop.hbase.PerformanceEvaluation
> > (Detailed testing results are listed below). It's strange that the scan
> > speed is as slower as randomRead. I haven't change any configuration
> > parameters in xml files. Anyone can help me to tune the scan performance?
> > Thanks a lot.
> >
> > Hardware: HP Compaq nx6320, CPU Centrino Duo 2 GHz, 1 GB Memory
> > OS: ubuntu jaunty 9.04
> > Hbase: hadoop HDFS namenode + datanode + hbase master + zookeeper +
> > regionserver are all on localhost.
> > Test commands:
> > $ bin/hbase org.apache.hadoop.hbase.PerformanceEvaluation --rows=50000
> > randomWrite 1
> > $ bin/hbase org.apache.hadoop.hbase.PerformanceEvaluation --rows=50000
> scan
> > 1
> > $ bin/hbase org.apache.hadoop.hbase.PerformanceEvaluation --rows=50000
> > randomRead 1
> > Test results:
> > randomWrite, time cost: 6858ms
> > scan, time cost: 18836ms
> > randomRead, time cost: 16829ms
> >
> > I suppose that the scan speed should be much more faster than randomRead.
> > However, it isn't. Why?
> >
> > sincerely,
> > Evan
> >
>

Re: help! hbase rev-792389, scan speed is as slower as randomRead!

Posted by Jean-Daniel Cryans <jd...@apache.org>.

Even,

The scan probably warmed the cache here. Do the same experiment with a
fresh HBase for the scan and the random reads.

J-D

On Thu, Jul 9, 2009 at 5:14 AM, Qingyan(Evan) Liu<qi...@gmail.com> wrote:
> dears,
>
> I'm fresh to hbase. I just checkout hbase trunk rev-792389, and test its
> performance by means of org.apache.hadoop.hbase.PerformanceEvaluation
> (Detailed testing results are listed below). It's strange that the scan
> speed is as slower as randomRead. I haven't change any configuration
> parameters in xml files. Anyone can help me to tune the scan performance?
> Thanks a lot.
>
> Hardware: HP Compaq nx6320, CPU Centrino Duo 2 GHz, 1 GB Memory
> OS: ubuntu jaunty 9.04
> Hbase: hadoop HDFS namenode + datanode + hbase master + zookeeper +
> regionserver are all on localhost.
> Test commands:
> $ bin/hbase org.apache.hadoop.hbase.PerformanceEvaluation --rows=50000
> randomWrite 1
> $ bin/hbase org.apache.hadoop.hbase.PerformanceEvaluation --rows=50000 scan
> 1
> $ bin/hbase org.apache.hadoop.hbase.PerformanceEvaluation --rows=50000
> randomRead 1
> Test results:
> randomWrite, time cost: 6858ms
> scan, time cost: 18836ms
> randomRead, time cost: 16829ms
>
> I suppose that the scan speed should be much more faster than randomRead.
> However, it isn't. Why?
>
> sincerely,
> Evan
>