You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by James Kennedy <ja...@troove.net> on 2007/06/18 21:24:00 UTC

HBase Performance Q's

I've been profiling HBase with my own benchmark but I'm having trouble 
interpreting some of the results.  Maybe someone can help shed some light.

The benchmark:

* Test table consists of 10 columns all of the same family. All values 
are random selection from 10 city names.
* My read method scans all rows on 3 of those columns using HClient.
* I run HMaster and one HRegion in separate JVM's (for more controlled 
profiling), both using local FS, not HDFS, all on a single machine.
* Each test is cold: run right after HRegion startup with nothing in 
HMemCache.
* hbase-site.xml is attached
* profile *.cpu files are large, i'll send them individually if requested

Profiling:
I'm using YourKit 6.0.15 to profile just the CPU time of HRegionServer 
running in its own JVM.

With a table of size 10k:
    - Scan took 5220ms.

With a table of size 100k:
    - Scan took 29366ms.

Obervations:

Within the combined RPC$server.call() of all Handler threads, 
HRegionServer.next() took about 90% in both the 10K and 100K cases.  
This is as expected since it's called 10K/100K times. The biggest 
bottleneck bellow that is HStoreKey.extractFamily() which is 33%/39% of 
HRegionServer.next() which is more expensive than all the IO operations 
in HStoreScanner.getNext() (31%/31%). Couldn't there be a cheaper way to 
extract family from a col key?

Outside the RPC$server.call(),

Server$Connection.readAndProcess() takes about 17%/22% of the time that 
HRegionServer.next() takes suggesting a heavy overhead in RPC IO and 
unmarshalling.  Overhead in the other direction, ObjectWritable.write() 
+ DataOutputStream.flush() take 63%/57% of the time that 
HRegionServer.next() takes.  The ratio of input/output makes sense since 
more data is returned than sent. But this seems like a lot of RPC overhead.

I may be incorrect, but my understanding of YourKit (latest build) is 
that it does not count the time that threads spend blocked. So it seems 
quite a mystery to me that the following methods consume so much cpu 
time (not counting methods called from within them).

Server$Handler.run()  68%
SelectorImpl.select() 7%

Relative to RPC$Server.call which is only 2%

In previous profiles i've observed that Server$Listener.run() consumed 
most of the time instead of Server$Handler.run(). Both those thread 
synchronize on the callQueue and I'm sure that synchronization has 
something to do with this result.  Unless i'm wrong about my YourKit 
assumption, i don't see why sync blocking should show up like this. Does 
anyone know?

As for SelectorImpl.select(), I looked it up but don't completely 
understand what's happening under the hood. I know that it is blocking 
too while waiting for a connection but I don't see why it should be 
consuming so much CPU.

If anyone else has done YourKit profiling on HBase, please let me know 
if you've found similar results or can help me clarify my results.

Right now, if i assume that YourKit is counting block time and I ignore 
those numbers, I come to the following conclusions:

It looks like HStoreKey.extractFamily() is a bottleneck.  For 100k rows 
and 10 cols, it will be called over 1,000,000 times during a full table 
scan (even if just retrieving one col) and each call requires multiple 
byte[] <-> string conversions.

RPC overhead can be quite high during a scan. Could batched/read-ahead 
scanning be useful to minimize RPC calls?

Re: HBase Performance Q's

Posted by Michael Stack <st...@duboce.net>.

James Kennedy wrote:
> I've re-profiled with your recent updates.  Overall the benchmark 
> seems 4.1% faster averaged over several runs.  I can't say i noticed a 
> significant improvement in the readAndProcess()/ObjectWritable.write() 
> methods. I'll definately add a vote to Hadoop-414.
I made no changes to readAndProcess nor ObjectWritable, just changes in 
hbase code. Profiling writing, the changes in HADOOP-1498 made it so 
hbase code no longer showed as a significant consumer of CPU nor as an 
allocator of objects (Are there hbase CPU/Object-allocation hotspots 
showing in your reads profiling.  Did HStoreKey.extractFamily 
improve?).  4% isn't much of an improvement.  For the big wins, it looks 
like we need to look more at the RPC'ing (or in how hbase is using the RPC).

St.Ack

Re: HBase Performance Q's

Posted by James Kennedy <ja...@troove.net>.

Ok St.Ack,

I've re-profiled with your recent updates.  Overall the benchmark seems 
4.1% faster averaged over several runs.  I can't say i noticed a 
significant improvement in the readAndProcess()/ObjectWritable.write() 
methods. I'll definately add a vote to Hadoop-414.

Definately i notice that UTF8.write() takes up 78% of the overall write 
time and is 50% the amount of time that the 100061 calls to 
HReginServer.next() take.

UTF8.readString() takes up only 15% with 
sun.nio.ch.SocketChannelImpl.read(ByteBuffer)  (41%) being the biggest 
piece of the pie.


Michael Stack wrote:
> I recently spent some time profiling hbase writing.  I haven't yet 
> spent time on reading.  The patch in HADOOP-1498 has fixes that came 
> of the write profiling sessions.  After applying the patch, running 
> the http://wiki.apache.org/lucene-hadoop/Hbase/PerformanceEvaluation 
> write tests, we seem to be spending > 90% of allocations and CPU time 
> in RPC reading and writing instances of the deprecated UTF8 class (I 
> believe HADOOP-414 is the issue that covers removing UTF8 from the 
> RPC.  If you are seeing the same phenomenon, you might add a vote so 
> the issue gets pegged at a higher priority).  Does the HADOOP-1498 
> patch help with the performance profile you are seeing in hbase?
>
> More below...
>
> James Kennedy wrote:
>> ...
>> Obervations:
>>
>> Within the combined RPC$server.call() of all Handler threads, 
>> HRegionServer.next() took about 90% in both the 10K and 100K cases.  
>> This is as expected since it's called 10K/100K times. The biggest 
>> bottleneck bellow that is HStoreKey.extractFamily() which is 33%/39% 
>> of HRegionServer.next() which is more expensive than all the IO 
>> operations in HStoreScanner.getNext() (31%/31%). Couldn't there be a 
>> cheaper way to extract family from a col key?
>
> The HADOOP_1498 patch redoes extractFamily so it uses Text#find and 
> then a byte array copy to produce the column family.  It used to do a 
> toString followed by an indexOf, substring, and then a new Text out of 
> the produced substring.
>
>> Outside the RPC$server.call(),
>>
>> Server$Connection.readAndProcess() takes about 17%/22% of the time 
>> that HRegionServer.next() takes suggesting a heavy overhead in RPC IO 
>> and unmarshalling.  Overhead in the other direction, 
>> ObjectWritable.write() + DataOutputStream.flush() take 63%/57% of the 
>> time that HRegionServer.next() takes.  The ratio of input/output 
>> makes sense since more data is returned than sent. But this seems 
>> like a lot of RPC overhead.
>>
>
> Yes.  See my comment above on what I'm seeing writing.
>
> ...
>>
>> Right now, if i assume that YourKit is counting block time and I 
>> ignore those numbers, I come to the following conclusions:
>>
>> It looks like HStoreKey.extractFamily() is a bottleneck.  For 100k 
>> rows and 10 cols, it will be called over 1,000,000 times during a 
>> full table scan (even if just retrieving one col) and each call 
>> requires multiple byte[] <-> string conversions.
>>
> If you have suggestions for how to improve on the version in 
> HADOOP-1498, would love to hear them.
>
>> RPC overhead can be quite high during a scan. Could 
>> batched/read-ahead scanning be useful to minimize RPC calls?
>
> Moving from UTF8 to Text would help with time spent in RPC some going 
> by some primitive tests I ran locally.
>
> Batching, as per your suggestion in HADOOP-1468, would also help.
>
> St.Ack
>
>
>
>>
>>
>>
>>
>> ------------------------------------------------------------------------
>>
>> <?xml version="1.0"?>
>> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>> <configuration>
>>     <property>
>>         <name>hbase.master</name>
>>         <value>172.30.20.15:60000</value>
>>         <description>The host and port that the HBase master runs at.
>>             TODO: Support 'local' (All running in single context).
>>         </description>
>>     </property>
>>     <property>
>>         <name>hbase.regionserver</name>
>>         <value>172.30.20.15:60010</value>
>>         <description>The host and port a HBase region server runs at.
>>         </description>
>>     </property>
>>     <property>
>>         <name>hbase.regiondir</name>
>>         <value>hbase</value>
>>         <description>The directory shared by region servers.
>>         </description>
>>     </property>
>>     <property>
>>         <name>hbase.hregion.max.filesize</name>
>>         <value>134217728</value>
>>         <description>
>>             Maximum desired file size for an HRegion.  If filesize 
>> exceeds
>>             value + (value / 2), the HRegion is split in two.  
>> Default: 128M (134217728).
>>         </description>
>>     </property>
>>     <property>
>>         <name>hbase.hregion.compactionThreshold</name>
>>         <value>10</value>
>>         <description>
>>             By default, we compact the region if an HStore has more 
>> than 10 map files.
>>         </description>
>>     </property>
>>     <property>
>>         
>> <name>hbase.regionserver.thread.splitcompactcheckfrequency</name>
>>         <value>40000</value>
>>         <description>
>>             Default: 60 * 1000
>>         </description>
>>     </property>
>>         <property>
>>         <name>hbase.client.pause</name>
>>         <value>2000</value>
>>     </property>
>>     <property>
>>         <name>hbase.client.timeout.number</name>
>>         <value>30000</value>
>>         <description>Try this many timeouts before giving up.
>>         </description>
>>     </property>
>>     <property>
>>         <name>hbase.client.retries.number</name>
>>         <value>15</value>
>>         <description>Count of maximum retries fetching the root 
>> region from root
>>             region server.
>>         </description>
>>     </property>
>>     <property>
>>         <name>hbase.master.meta.thread.rescanfrequency</name>
>>         <value>60000</value>
>>         <description>How long the HMaster sleeps (in milliseconds) 
>> between scans of
>>             the root and meta tables.
>>         </description>
>>     </property>
>>     <property>
>>         <name>hbase.master.lease.period</name>
>>         <value>30000</value>
>>         <description>HMaster server lease period in milliseconds. 
>> Default is
>>             30 seconds.</description>
>>     </property>
>>     <property>
>>         <name>hbase.server.thread.wakefrequency</name>
>>         <value>5000</value>
>>         <description>Time to sleep in between searches for work (in 
>> milliseconds).
>>             Used as sleep interval by service threads such as META 
>> scanner and log roller.
>>         </description>
>>     </property>
>>     <property>
>>         <name>hbase.regionserver.lease.period</name>
>>         <value>30000</value>
>>         <description>HRegion server lease period in milliseconds. 
>> Default is
>>             30 seconds.</description>
>>     </property>
>>     <property>
>>         <name>hbase.regionserver.handler.count</name>
>>         <value>10</value>
>>         <description>Count of RPC Server instances spun up on 
>> RegionServers
>>             Same property is used by the HMaster for count of master 
>> handlers.
>>             Default is 10.
>>         </description>
>>     </property>
>>     <property>
>>         <name>hbase.regionserver.msginterval</name>
>>         <value>500</value>
>>         <description>Interval between messages from the RegionServer 
>> to HMaster
>>             in milliseconds.  Default is 15 sec. Set this value low 
>> if you want unit
>>             tests to be responsive.
>>         </description>
>>     </property>
>>     </configuration>
>>   
>

Re: HBase Performance Q's

Posted by Michael Stack <st...@duboce.net>.

I recently spent some time profiling hbase writing.  I haven't yet spent 
time on reading.  The patch in HADOOP-1498 has fixes that came of the 
write profiling sessions.  After applying the patch, running the 
http://wiki.apache.org/lucene-hadoop/Hbase/PerformanceEvaluation write 
tests, we seem to be spending > 90% of allocations and CPU time in RPC 
reading and writing instances of the deprecated UTF8 class (I believe 
HADOOP-414 is the issue that covers removing UTF8 from the RPC.  If you 
are seeing the same phenomenon, you might add a vote so the issue gets 
pegged at a higher priority).  Does the HADOOP-1498 patch help with the 
performance profile you are seeing in hbase?

More below...

James Kennedy wrote:
> ...
> Obervations:
>
> Within the combined RPC$server.call() of all Handler threads, 
> HRegionServer.next() took about 90% in both the 10K and 100K cases.  
> This is as expected since it's called 10K/100K times. The biggest 
> bottleneck bellow that is HStoreKey.extractFamily() which is 33%/39% 
> of HRegionServer.next() which is more expensive than all the IO 
> operations in HStoreScanner.getNext() (31%/31%). Couldn't there be a 
> cheaper way to extract family from a col key?

The HADOOP_1498 patch redoes extractFamily so it uses Text#find and then 
a byte array copy to produce the column family.  It used to do a 
toString followed by an indexOf, substring, and then a new Text out of 
the produced substring.

> Outside the RPC$server.call(),
>
> Server$Connection.readAndProcess() takes about 17%/22% of the time 
> that HRegionServer.next() takes suggesting a heavy overhead in RPC IO 
> and unmarshalling.  Overhead in the other direction, 
> ObjectWritable.write() + DataOutputStream.flush() take 63%/57% of the 
> time that HRegionServer.next() takes.  The ratio of input/output makes 
> sense since more data is returned than sent. But this seems like a lot 
> of RPC overhead.
>

Yes.  See my comment above on what I'm seeing writing.

...
>
> Right now, if i assume that YourKit is counting block time and I 
> ignore those numbers, I come to the following conclusions:
>
> It looks like HStoreKey.extractFamily() is a bottleneck.  For 100k 
> rows and 10 cols, it will be called over 1,000,000 times during a full 
> table scan (even if just retrieving one col) and each call requires 
> multiple byte[] <-> string conversions.
>
If you have suggestions for how to improve on the version in 
HADOOP-1498, would love to hear them.

> RPC overhead can be quite high during a scan. Could batched/read-ahead 
> scanning be useful to minimize RPC calls?

Moving from UTF8 to Text would help with time spent in RPC some going by 
some primitive tests I ran locally.

Batching, as per your suggestion in HADOOP-1468, would also help.

St.Ack



>
>
>
>
> ------------------------------------------------------------------------
>
> <?xml version="1.0"?>
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
> <configuration>
>     <property>
>         <name>hbase.master</name>
>         <value>172.30.20.15:60000</value>
>         <description>The host and port that the HBase master runs at.
>             TODO: Support 'local' (All running in single context).
>         </description>
>     </property>
>     <property>
>         <name>hbase.regionserver</name>
>         <value>172.30.20.15:60010</value>
>         <description>The host and port a HBase region server runs at.
>         </description>
>     </property>
>     <property>
>         <name>hbase.regiondir</name>
>         <value>hbase</value>
>         <description>The directory shared by region servers.
>         </description>
>     </property>
>     <property>
>         <name>hbase.hregion.max.filesize</name>
>         <value>134217728</value>
>         <description>
>             Maximum desired file size for an HRegion.  If filesize exceeds
>             value + (value / 2), the HRegion is split in two.  Default: 128M (134217728).
>         </description>
>     </property>
>     <property>
>         <name>hbase.hregion.compactionThreshold</name>
>         <value>10</value>
>         <description>
>             By default, we compact the region if an HStore has more than 10 map files.
>         </description>
>     </property>
>     <property>
>         <name>hbase.regionserver.thread.splitcompactcheckfrequency</name>
>         <value>40000</value>
>         <description>
>             Default: 60 * 1000
>         </description>
>     </property>
>     
>     <property>
>         <name>hbase.client.pause</name>
>         <value>2000</value>
>     </property>
>     <property>
>         <name>hbase.client.timeout.number</name>
>         <value>30000</value>
>         <description>Try this many timeouts before giving up.
>         </description>
>     </property>
>     <property>
>         <name>hbase.client.retries.number</name>
>         <value>15</value>
>         <description>Count of maximum retries fetching the root region from root
>             region server.
>         </description>
>     </property>
>     <property>
>         <name>hbase.master.meta.thread.rescanfrequency</name>
>         <value>60000</value>
>         <description>How long the HMaster sleeps (in milliseconds) between scans of
>             the root and meta tables.
>         </description>
>     </property>
>     <property>
>         <name>hbase.master.lease.period</name>
>         <value>30000</value>
>         <description>HMaster server lease period in milliseconds. Default is
>             30 seconds.</description>
>     </property>
>     <property>
>         <name>hbase.server.thread.wakefrequency</name>
>         <value>5000</value>
>         <description>Time to sleep in between searches for work (in milliseconds).
>             Used as sleep interval by service threads such as META scanner and log roller.
>         </description>
>     </property>
>     <property>
>         <name>hbase.regionserver.lease.period</name>
>         <value>30000</value>
>         <description>HRegion server lease period in milliseconds. Default is
>             30 seconds.</description>
>     </property>
>     <property>
>         <name>hbase.regionserver.handler.count</name>
>         <value>10</value>
>         <description>Count of RPC Server instances spun up on RegionServers
>             Same property is used by the HMaster for count of master handlers.
>             Default is 10.
>         </description>
>     </property>
>     <property>
>         <name>hbase.regionserver.msginterval</name>
>         <value>500</value>
>         <description>Interval between messages from the RegionServer to HMaster
>             in milliseconds.  Default is 15 sec. Set this value low if you want unit
>             tests to be responsive.
>         </description>
>     </property>
>     
> </configuration>
>