You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Khaled Elmeleegy <kd...@hotmail.com> on 2014/10/02 09:12:23 UTC

HBase read performance

Hi,

I am trying to do a scatter/gather on hbase (0.98.6.1), where I have a client reading ~1000 keys from an HBase table. These keys happen to fall on the same region server. For my reads I use reverse scan to read each key as I want the key prior to a specific time stamp (time stamps are stored in reverse order). I don't believe gets can accomplish that, right? so I use scan, with caching set to 1.

I use 2000 reader threads in the client and on HBase, I've set hbase.regionserver.handler.count to 1500. With this setup, my scatter gather is very slow and can take up to 10s in total. Timing an individual getScanner(..) call on the client side, it can easily take few hundreds of ms. I also got the following metrics from the region server in question:

"queueCallTime_mean" : 2.190855525775637,
"queueCallTime_median" : 0.0,
"queueCallTime_75th_percentile" : 0.0,
"queueCallTime_95th_percentile" : 1.0,
"queueCallTime_99th_percentile" : 556.9799999999818,

"processCallTime_min" : 0,
"processCallTime_max" : 12755,
"processCallTime_mean" : 105.64873440912682,
"processCallTime_median" : 0.0,
"processCallTime_75th_percentile" : 2.0,
"processCallTime_95th_percentile" : 7917.95,
"processCallTime_99th_percentile" : 8876.89,

"namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_min" : 89,
"namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_max" : 11300,
"namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_mean" : 654.4949739797315,
"namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_median" : 101.0,
"namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_75th_percentile" : 101.0,
"namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_95th_percentile" : 101.0,
"namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_99th_percentile" : 113.0,

Where "delta" is the name of the table I am querying.

In addition to all this, i monitored the hardware resources (CPU, disk, and network) of both the client and the region server and nothing seems anywhere near saturation. So I am puzzled by what's going on and where this time is going.

Few things to note based on the above measurements: both medians of IPC processCallTime and queueCallTime are basically zero (ms I presume, right?). However, scanNext_median is 101 (ms too, right?). I am not sure how this adds up. Also, even though the 101 figure seems outrageously high and I don't know why, still all these scans should be happening in parallel, so the overall call should finish fast, given that no hardware resource is contended, right? but this is not what's happening, so I have to be missing something(s). 

So, any help is appreciated there.

Thanks,
Khaled

Re: HBase read performance

Posted by Stack <st...@duboce.net>.

On Mon, Oct 13, 2014 at 12:13 PM, Khaled Elmeleegy <kd...@hotmail.com>
wrote:

> I went through a lengthy testing process with Ted.


Offlist?



> I tried both disabling just cache on write and both COW and BC. Disabling
> them makes the problem go away.



Disabling both made it go away?  Running w/ BC and COW off did not produce
the issue?  That is good to know.



> I think disabling COW in my workload effectively doesn't exercise  the
> block cache as my workload is mostly writes, so disabling COW doesn't cause
> the cache blocks to churn.
>

If mostly writes, COW probably has little noticeable effect.



> I tried both overcommitting memory and not. This is irrelevant though. The
> cache operates on the virtual memory offered to it and it's agnostic to the
> physical memory size. Memory over commitment may only hurt performance due
> to swapping by the OS, but this has nothing to do with the bug.
>


A swapping java program is effectively dead. Is this what you observed?

Thanks for doing the bit of debug.

Helps,
St.Ack



> Khaled
>
> > Date: Sun, 12 Oct 2014 23:17:24 -0700
> > Subject: Re: HBase read performance
> > From: stack@duboce.net
> > To: user@hbase.apache.org
> >
> > On Sun, Oct 12, 2014 at 8:58 PM, Khaled Elmeleegy <kd...@hotmail.com>
> wrote:
> >
> > > It goes away, but i think that's because i am basically churning the
> cache
> > > far less. My workload is mostly writes.
> > >
> > >
> > What exactly did you change?
> >
> > Did you disable both BC and cacheonwrite or just one of these configs?
> >
> > Disabling BC and/or cacheonwrite would churn your cache more?
> >
> > What about the heap config?  Did you over commit mem?  If so, did you try
> > doing a direct memory sizing that was w/i the bounds of physical memory
> > (well within)?  Did you still see exceptions?
> >
> > St.Ack
> >
> >
> >
> >
> > > > Date: Sun, 12 Oct 2014 19:43:16 -0700
> > > > Subject: Re: HBase read performance
> > > > From: yuzhihong@gmail.com
> > > > To: user@hbase.apache.org
> > > >
> > > > Hi,
> > > > Can you turn off cacheonwrite and keep bucket cache to see if the
> problem
> > > > goes away ?
> > > >
> > > > Cheers
> > > >
> > > > On Fri, Oct 10, 2014 at 10:59 AM, Khaled Elmeleegy <
> kdiaa@hotmail.com>
> > > > wrote:
> > > >
> > > > > Yes, I can reproduce it with some work.
> > > > > The workload is basically as follows:
> > > > > There are writers streaming writes to a table. Then, there is a
> reader
> > > > > (invoked via a web interface). The reader does a 1000 parallel
> reverse
> > > > > scans, all end up hitting the same region in my case. The scans are
> > > > > effectively "gets" as I just need to get one record off of each of
> > > them. I
> > > > > just need to do a "reverse" get, which is not supported (would be
> > > great to
> > > > > have :)), so I do it via reverse scan. After few tries, the reader
> > > > > consistently hits this bug.
> > > > >
> > > > > This happens with these config changes:
> > > > > hbase-env:HBASE_REGIONSERVER_OPTS=-Xmx6G -XX:MaxDirectMemorySize=5G
> > > > > -XX:CMSInitiatingOccupancyFraction=88 -XX:+AggressiveOpts
> -verbose:gc
> > > > > -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Xlog
> > > > > gc:/tmp/hbase-regionserver-gc.loghbase-site:
> > > > > hbase.bucketcache.ioengine=offheap
> > > > > hbase.bucketcache.size=4196
> > > > > hbase.rs.cacheblocksonwrite=true
> > > > > hfile.block.index.cacheonwrite=true
> > > > > hfile.block.bloom.cacheonwrite=true
> > > > >
> > > > > Interestingly, without these config changes, I can't reproduce the
> > > problem.
> > > > > Khaled
> > > > >
> > > > >
> > > > > > Date: Fri, 10 Oct 2014 10:05:14 -0700
> > > > > > Subject: Re: HBase read performance
> > > > > > From: stack@duboce.net
> > > > > > To: user@hbase.apache.org
> > > > > >
> > > > > > It looks like we are messing up our positioning math:
> > > > > >
> > > > > > 233 <
> > > > >
> > >
> http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/nio/Buffer.java#233
> > > > > >
> > > > > >
> > > > > > <
> > > > >
> > >
> http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/nio/Buffer.java#
> > > > > >
> > > > > >
> > > > > >     public final Buffer
> > > > > > <
> > > > >
> > >
> http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/nio/Buffer.java#Buffer
> > > > > >
> > > > > >  <
> > > > >
> > >
> http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/nio/Buffer.java#
> > > > > >position(int
> > > > > > newPosition) {
> > > > > >
> > > > > > 234 <
> > > > >
> > >
> http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/nio/Buffer.java#234
> > > > > >
> > > > > >
> > > > > > <
> > > > >
> > >
> http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/nio/Buffer.java#
> > > > > >
> > > > > >
> > > > > >         if ((newPosition > limit
> > > > > > <
> > > > >
> > >
> http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/nio/Buffer.java#Buffer.0limit
> > > > > >)
> > > > > > || (newPosition < 0))
> > > > > >
> > > > > > 235 <
> > > > >
> > >
> http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/nio/Buffer.java#235
> > > > > >
> > > > > >
> > > > > > <
> > > > >
> > >
> http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/nio/Buffer.java#
> > > > > >
> > > > > >
> > > > > >             throw new IllegalArgumentException
> > > > > > <
> > > > >
> > >
> http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/lang/IllegalArgumentException.java#IllegalArgumentException
> > > > > >();
> > > > > >
> > > > > >
> > > > > > Is it easy to reproduce Khaled?  Always same region/store or
> spread
> > > > > > across all reads?
> > > > > >
> > > > > >
> > > > > > St.Ack
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Fri, Oct 10, 2014 at 8:31 AM, Khaled Elmeleegy <
> kdiaa@hotmail.com
> > > >
> > > > > wrote:
> > > > > >
> > > > > > > Andrew thanks. I think this indeed was the problem.
> > > > > > > To get over it, I increased the amount of memory given the
> region
> > > > > server
> > > > > > > to avoid IO on reads. I used the configs below. In my
> experiments,
> > > I
> > > > > have
> > > > > > > writers streaming their output to HBase. The reader powers a
> web
> > > page
> > > > > and
> > > > > > > does this scatter/gather, where it reads 1000 keys written
> last and
> > > > > passes
> > > > > > > them the the front end. With this workload, I get the exception
> > > below
> > > > > at
> > > > > > > the region server. Again, I am using HBAse (0.98.6.1). Any
> help is
> > > > > > > appreciated.
> > > > > > > hbase-env:HBASE_REGIONSERVER_OPTS=-Xmx6G
> -XX:MaxDirectMemorySize=5G
> > > > > > > -XX:CMSInitiatingOccupancyFraction=88 -XX:+AggressiveOpts
> > > -verbose:gc
> > > > > > > -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Xlog
> > > > > > > gc:/tmp/hbase-regionserver-gc.log
> > > > > > > hbase-site:hbase.bucketcache.ioengine=offheap
> > > > > > >
> > > > > > >                                    hfile.block.cache.size=0.4
> > > > > > >
> > > > > > >
> > > > > > >  hbase.bucketcache.size=4196
> > > > > > >
> > > > > > >                         hbase.rs.cacheblocksonwrite=true
> > > > > > >
> > > > > > >
> > > > > > >  hfile.block.index.cacheonwrite=true
> > > > > > >
> > > > > > >                         hfile.block.bloom.cacheonwrite=true
> > > > > > >
> > > > > > > 2014-10-10 15:06:44,173 ERROR
> > > > > > > [B.DefaultRpcServer.handler=62,queue=2,port=60020]
> ipc.RpcServer:
> > > > > > > Unexpected throwable object
> > > > > > > java.lang.IllegalArgumentException
> > > > > > >         at java.nio.Buffer.position(Buffer.java:236)
> > > > > > >         at
> > > > > > >
> > > > >
> > >
> org.apache.hadoop.hbase.util.ByteBufferUtils.skip(ByteBufferUtils.java:434)
> > > > > > >         at
> > > > > > >
> > > > >
> > >
> org.apache.hadoop.hbase.io.hfile.HFileReaderV2$ScannerV2.readKeyValueLen(HFileReaderV2.java:849)
> > > > > > >         at
> > > > > > >
> > > > >
> > >
> org.apache.hadoop.hbase.io.hfile.HFileReaderV2$ScannerV2.next(HFileReaderV2.java:760)
> > > > > > >         at
> > > > > > >
> > > > >
> > >
> org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekAtOrAfter(StoreFileScanner.java:248)
> > > > > > >         at
> > > > > > >
> > > > >
> > >
> org.apache.hadoop.hbase.regionserver.StoreFileScanner.seek(StoreFileScanner.java:152)
> > > > > > >         at
> > > > > > >
> > > > >
> > >
> org.apache.hadoop.hbase.regionserver.StoreScanner.seekScanners(StoreScanner.java:317)
> > > > > > >         at
> > > > > > >
> > > > >
> > >
> org.apache.hadoop.hbase.regionserver.StoreScanner.<init>(StoreScanner.java:176)
> > > > > > >         at
> > > > > > >
> > > > >
> > >
> org.apache.hadoop.hbase.regionserver.HStore.getScanner(HStore.java:1780)
> > > > > > >         at
> > > > > > >
> > > > >
> > >
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.<init>(HRegion.java:3758)
> > > > > > >         at
> > > > > > >
> > > > >
> > >
> org.apache.hadoop.hbase.regionserver.HRegion.instantiateRegionScanner(HRegion.java:1950)
> > > > > > >         at
> > > > > > >
> > > > >
> > >
> org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:1936)
> > > > > > >         at
> > > > > > >
> > > > >
> > >
> org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:1913)
> > > > > > >         at
> > > > > > >
> > > > >
> > >
> org.apache.hadoop.hbase.regionserver.HRegionServer.scan(HRegionServer.java:3157)
> > > > > > >         at
> > > > > > >
> > > > >
> > >
> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:29587)
> > > > > > >         at
> > > > > org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2027)
> > > > > > >         at
> > > > > org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108)
> > > > > > >         at
> > > > > > >
> > > > >
> > >
> org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:114)
> > > > > > >         at
> > > > > > >
> org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:94)
> > > > > > >         at java.lang.Thread.run(Thread.java:744)
> > > > > > > > From: apurtell@apache.org
> > > > > > > > Date: Tue, 7 Oct 2014 17:09:35 -0700
> > > > > > > > Subject: Re: HBase read performance
> > > > > > > > To: user@hbase.apache.org
> > > > > > > >
> > > > > > > > > The cluster has 2 m1.large nodes.
> > > > > > > >
> > > > > > > > That's the problem right there.
> > > > > > > >
> > > > > > > > You need to look at c3.4xlarge or i2 instances as a minimum
> > > > > requirement.
> > > > > > > M1
> > > > > > > > and even M3 instance types have ridiculously poor IO.
> > > > > > > >
> > > > > > > >
> > > > > > > > On Tue, Oct 7, 2014 at 3:01 PM, Khaled Elmeleegy <
> > > kdiaa@hotmail.com>
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Thanks Nicolas, Qiang.
> > > > > > > > >
> > > > > > > > > I was able to write a simple program that reproduces the
> > > problem
> > > > > on a
> > > > > > > tiny
> > > > > > > > > HBase cluster on ec2. The cluster has 2 m1.large nodes. One
> > > node
> > > > > runs
> > > > > > > the
> > > > > > > > > master, name node and zookeeper. The other node runs a data
> > > node
> > > > > and a
> > > > > > > > > region server, with heap size configured to be 6GB. There,
> the
> > > 1000
> > > > > > > > > parallel reverse gets (reverse scans) take 7-8 seconds. The
> > > data
> > > > > set is
> > > > > > > > > tiny (10M records, each having a small number of bytes).
> As I
> > > said
> > > > > > > before,
> > > > > > > > > all hardware resources are very idle there.
> > > > > > > > >
> > > > > > > > > Interestingly, running the same workload on my macbook, the
> > > 1000
> > > > > > > parallel
> > > > > > > > > gets take ~200ms on a pseudo-distributed installation.
> > > > > > > > >
> > > > > > > > > Any help to resolve this mystery is highly appreciated.
> > > > > > > > >
> > > > > > > > > P.S. please find my test program attached.
> > > > > > > > >
> > > > > > > > > Best,
> > > > > > > > > Khaled
> > > > > > > > >
> > > > > > > > > > From: nkeywal@gmail.com
> > > > > > > > > > Date: Mon, 6 Oct 2014 09:40:48 +0200
> > > > > > > > > > Subject: Re: HBase read performance
> > > > > > > > > > To: user@hbase.apache.org
> > > > > > > > > >
> > > > > > > > > > Hi,
> > > > > > > > > >
> > > > > > > > > > I haven't seen it mentioned, but if I understand
> correctly
> > > each
> > > > > scan
> > > > > > > > > > returns a single row? If so you should use Scan#setSmall
> to
> > > save
> > > > > > > some rpc
> > > > > > > > > > calls.
> > > > > > > > > >
> > > > > > > > > > Cheers,
> > > > > > > > > >
> > > > > > > > > > Nicolas
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Sun, Oct 5, 2014 at 11:28 AM, Qiang Tian <
> > > tianq01@gmail.com>
> > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > when using separate HConnection instance, both its
> > > > > > > > > > > RpcClient instance(maintain connection to a
> regionserver)
> > > and
> > > > > > > Registry
> > > > > > > > > > > instance(maintain connection to zookeeper) will be
> > > separate..
> > > > > > > > > > >
> > > > > > > > > > > see
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > >
> > >
> http://shammijayasinghe.blogspot.com/2012/02/zookeeper-increase-maximum-number-of.html
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > On Sun, Oct 5, 2014 at 2:24 PM, Khaled Elmeleegy <
> > > > > > > kdiaa@hotmail.com>
> > > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > I tried creating my own HConnections pool to use for
> my
> > > HBase
> > > > > > > calls,
> > > > > > > > > so
> > > > > > > > > > > > that not all the (2K) threads share the same
> HConnection.
> > > > > > > However, I
> > > > > > > > > > > could
> > > > > > > > > > > > only have 10 HConnections. Beyond that I get ZK
> > > exceptions,
> > > > > > > please
> > > > > > > > > find
> > > > > > > > > > > it
> > > > > > > > > > > > below. Also, with 10 HConnections, I don't see
> noticeable
> > > > > > > > > improvement in
> > > > > > > > > > > > performance so far.
> > > > > > > > > > > > 2014-10-05 06:11:26,490 WARN [main]
> > > > > > > zookeeper.RecoverableZooKeeper
> > > > > > > > > > > > (RecoverableZooKeeper.java:retryOrThrow(253)) -
> Possibly
> > > > > > > transient
> > > > > > > > > > > > ZooKeeper, quorum=54.68.206.252:2181,
> > > > > > > > > > > >
> > > > > > > > >
> > > > >
> exception=org.apache.zookeeper.KeeperException$ConnectionLossException:
> > > > > > > > > > > > KeeperErrorCode = ConnectionLoss for
> > > /hbase/hbaseid2014-10-05
> > > > > > > > > > > 06:11:26,490
> > > > > > > > > > > > INFO [main] util.RetryCounter
> > > > > > > > > > > (RetryCounter.java:sleepUntilNextRetry(155))
> > > > > > > > > > > > - Sleeping 1000ms before retry #0...2014-10-05
> > > 06:11:27,845
> > > > > WARN
> > > > > > > > > [main]
> > > > > > > > > > > > zookeeper.RecoverableZooKeeper
> > > > > > > > > > > > (RecoverableZooKeeper.java:retryOrThrow(253)) -
> Possibly
> > > > > > > transient
> > > > > > > > > > > > ZooKeeper, quorum=54.68.206.252:2181,
> > > > > > > > > > > >
> > > > > > > > >
> > > > >
> exception=org.apache.zookeeper.KeeperException$ConnectionLossException:
> > > > > > > > > > > > KeeperErrorCode = ConnectionLoss for
> > > /hbase/hbaseid2014-10-05
> > > > > > > > > > > 06:11:27,849
> > > > > > > > > > > > INFO [main] util.RetryCounter
> > > > > > > > > > > (RetryCounter.java:sleepUntilNextRetry(155))
> > > > > > > > > > > > - Sleeping 2000ms before retry #1...2014-10-05
> > > 06:11:30,405
> > > > > WARN
> > > > > > > > > [main]
> > > > > > > > > > > > zookeeper.RecoverableZooKeeper
> > > > > > > > > > > > (RecoverableZooKeeper.java:retryOrThrow(253)) -
> Possibly
> > > > > > > transient
> > > > > > > > > > > > ZooKeeper, quorum=54.68.206.252:2181,
> > > > > > > > > > > >
> > > > > > > > >
> > > > >
> exception=org.apache.zookeeper.KeeperException$ConnectionLossException:
> > > > > > > > > > > > KeeperErrorCode = ConnectionLoss for
> > > /hbase/hbaseid2014-10-05
> > > > > > > > > > > 06:11:30,405
> > > > > > > > > > > > INFO [main] util.RetryCounter
> > > > > > > > > > > (RetryCounter.java:sleepUntilNextRetry(155))
> > > > > > > > > > > > - Sleeping 4000ms before retry #2...2014-10-05
> > > 06:11:35,278
> > > > > WARN
> > > > > > > > > [main]
> > > > > > > > > > > > zookeeper.RecoverableZooKeeper
> > > > > > > > > > > > (RecoverableZooKeeper.java:retryOrThrow(253)) -
> Possibly
> > > > > > > transient
> > > > > > > > > > > > ZooKeeper, quorum=54.68.206.252:2181,
> > > > > > > > > > > >
> > > > > > > > >
> > > > >
> exception=org.apache.zookeeper.KeeperException$ConnectionLossException:
> > > > > > > > > > > > KeeperErrorCode = ConnectionLoss for
> > > /hbase/hbaseid2014-10-05
> > > > > > > > > > > 06:11:35,279
> > > > > > > > > > > > INFO [main] util.RetryCounter
> > > > > > > > > > > (RetryCounter.java:sleepUntilNextRetry(155))
> > > > > > > > > > > > - Sleeping 8000ms before retry #3...2014-10-05
> > > 06:11:44,393
> > > > > WARN
> > > > > > > > > [main]
> > > > > > > > > > > > zookeeper.RecoverableZooKeeper
> > > > > > > > > > > > (RecoverableZooKeeper.java:retryOrThrow(253)) -
> Possibly
> > > > > > > transient
> > > > > > > > > > > > ZooKeeper, quorum=54.68.206.252:2181,
> > > > > > > > > > > >
> > > > > > > > >
> > > > >
> exception=org.apache.zookeeper.KeeperException$ConnectionLossException:
> > > > > > > > > > > > KeeperErrorCode = ConnectionLoss for
> > > /hbase/hbaseid2014-10-05
> > > > > > > > > > > 06:11:44,393
> > > > > > > > > > > > ERROR [main] zookeeper.RecoverableZooKeeper
> > > > > > > > > > > > (RecoverableZooKeeper.java:retryOrThrow(255)) -
> ZooKeeper
> > > > > exists
> > > > > > > > > failed
> > > > > > > > > > > > after 4 attempts2014-10-05 06:11:44,394 WARN [main]
> > > > > > > zookeeper.ZKUtil
> > > > > > > > > > > > (ZKUtil.java:checkExists(482)) -
> hconnection-0x4e174f3b,
> > > > > quorum=
> > > > > > > > > > > > 54.68.206.252:2181, baseZNode=/hbase Unable to set
> > > watcher
> > > > > on
> > > > > > > znode
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > >
> > >
> (/hbase/hbaseid)org.apache.zookeeper.KeeperException$ConnectionLossException:
> > > > > > > > > > > > KeeperErrorCode = ConnectionLoss for /hbase/hbaseid
> at
> > > > > > > > > > > >
> > > > > > >
> > > org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
> > > > > > > > > at
> > > > > > > > > > > >
> > > > > > >
> > > org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
> > > > > > > > > at
> > > > > > > > > > > >
> > > org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1041) at
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > >
> > >
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.exists(RecoverableZooKeeper.java:199)
> > > > > > > > > > > > at
> > > > > > > > >
> > > > >
> org.apache.hadoop.hbase.zookeeper.ZKUtil.checkExists(ZKUtil.java:479)
> > > > > > > > > > > > at
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > >
> > >
> org.apache.hadoop.hbase.zookeeper.ZKClusterId.readClusterIdZNode(ZKClusterId.java:65)
> > > > > > > > > > > > at
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > >
> > >
> org.apache.hadoop.hbase.client.ZooKeeperRegistry.getClusterId(ZooKeeperRegistry.java:83)
> > > > > > > > > > > > at
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > >
> > >
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.retrieveClusterId(HConnectionManager.java:857)
> > > > > > > > > > > > at
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > >
> > >
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.<init>(HConnectionManager.java:662)
> > > > > > > > > > > > at
> > > > > sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> > > > > > > > > > > > Method) at
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > >
> > >
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
> > > > > > > > > > > > at
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > >
> > >
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> > > > > > > > > > > > at
> > > > > > > java.lang.reflect.Constructor.newInstance(Constructor.java:526)
> > > > > > > > > > > at
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > >
> > >
> org.apache.hadoop.hbase.client.HConnectionManager.createConnection(HConnectionManager.java:414)
> > > > > > > > > > > > at
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > >
> > >
> org.apache.hadoop.hbase.client.HConnectionManager.createConnection(HConnectionManager.java:335)...
> > > > > > > > > > > > > From: kdiaa@hotmail.com
> > > > > > > > > > > > > To: user@hbase.apache.org
> > > > > > > > > > > > > Subject: RE: HBase read performance
> > > > > > > > > > > > > Date: Fri, 3 Oct 2014 12:37:39 -0700
> > > > > > > > > > > > >
> > > > > > > > > > > > > Lars, Ted, and Qiang,
> > > > > > > > > > > > > Thanks for all the input.
> > > > > > > > > > > > > Qiang: yes all the threads are in the same client
> > > process
> > > > > > > sharing
> > > > > > > > > the
> > > > > > > > > > > > same connection. And since I don't see hardware
> > > contention,
> > > > > may
> > > > > > > be
> > > > > > > > > there
> > > > > > > > > > > is
> > > > > > > > > > > > contention over this code path. I'll try using many
> > > > > connections
> > > > > > > and
> > > > > > > > > see
> > > > > > > > > > > if
> > > > > > > > > > > > it alleviates the problems and I'll report back.
> > > > > > > > > > > > > Thanks again,Khaled
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Date: Fri, 3 Oct 2014 15:18:30 +0800
> > > > > > > > > > > > > > Subject: Re: HBase read performance
> > > > > > > > > > > > > > From: tianq01@gmail.com
> > > > > > > > > > > > > > To: user@hbase.apache.org
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Regarding to profiling, Andrew introduced
> > > > > > > > > > > > > >
> > > > > > > > >
> > > http://www.brendangregg.com/blog/2014-06-12/java-flame-graphs.html
> > > > > > > > > > > > months
> > > > > > > > > > > > > > ago.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > processCallTime comes from RpcServer#call, so it
> > > looks
> > > > > good?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > I have a suspect:
> > > > > > > > > https://issues.apache.org/jira/browse/HBASE-11306
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > how many processes do you have for your 2000
> threads?
> > > > > > > > > > > > > > if olny 1 process, those threads will share just
> 1
> > > > > > > connection to
> > > > > > > > > that
> > > > > > > > > > > > > > regionserver, there might be big contention on
> the
> > > RPC
> > > > > code
> > > > > > > path.
> > > > > > > > > > > > ---for
> > > > > > > > > > > > > > such case, could you try using different
> connections?
> > > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > >
> > >
> https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HConnectionManager.html
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Fri, Oct 3, 2014 at 9:55 AM, Ted Yu <
> > > > > yuzhihong@gmail.com>
> > > > > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Khaled:
> > > > > > > > > > > > > > > Do you have profiler such as jprofiler ?
> > > > > > > > > > > > > > > Profiling would give us more hint.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Otherwise capturing stack trace during the
> period
> > > of
> > > > > > > reverse
> > > > > > > > > scan
> > > > > > > > > > > > would
> > > > > > > > > > > > > > > help.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Cheers
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Thu, Oct 2, 2014 at 4:52 PM, lars hofhansl <
> > > > > > > > > larsh@apache.org>
> > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > You might have the data in the OS buffer
> cache,
> > > > > without
> > > > > > > short
> > > > > > > > > > > > circuit
> > > > > > > > > > > > > > > > reading the region server has to request the
> > > block
> > > > > from
> > > > > > > the
> > > > > > > > > data
> > > > > > > > > > > > node
> > > > > > > > > > > > > > > > process, which then reads it from the block
> > > cache.
> > > > > > > > > > > > > > > > That is a few context switches per RPC that
> do
> > > not
> > > > > show
> > > > > > > up
> > > > > > > > > in CPU
> > > > > > > > > > > > > > > metrics.
> > > > > > > > > > > > > > > > In that you also would not see disk IO.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > If - as you say - you see a lot of evicted
> > > blocks the
> > > > > > > data
> > > > > > > > > *has*
> > > > > > > > > > > > to come
> > > > > > > > > > > > > > > > from the OS. If you do not see disk IO is
> *has*
> > > to
> > > > > come
> > > > > > > from
> > > > > > > > > the
> > > > > > > > > > > OS
> > > > > > > > > > > > > > > cache.
> > > > > > > > > > > > > > > > I.e. there's more RAM on your boxes, and you
> > > should
> > > > > > > increase
> > > > > > > > > the
> > > > > > > > > > > > heap
> > > > > > > > > > > > > > > block
> > > > > > > > > > > > > > > > cache.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > You can measure the context switches with
> vmstat.
> > > > > Other
> > > > > > > than
> > > > > > > > > that
> > > > > > > > > > > > I have
> > > > > > > > > > > > > > > > no suggestion until I reproduce the problem.
> > > > > > > > > > > > > > > > Also check the data locality index of the
> region
> > > > > server
> > > > > > > it
> > > > > > > > > should
> > > > > > > > > > > > be
> > > > > > > > > > > > > > > close
> > > > > > > > > > > > > > > > to 100%.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > -- Lars
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > ________________________________
> > > > > > > > > > > > > > > > From: Khaled Elmeleegy <kd...@hotmail.com>
> > > > > > > > > > > > > > > > To: "user@hbase.apache.org" <
> > > user@hbase.apache.org>
> > > > > > > > > > > > > > > > Sent: Thursday, October 2, 2014 3:24 PM
> > > > > > > > > > > > > > > > Subject: RE: HBase read performance
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Lars thanks a lot about all the tips. I'll
> make
> > > sure
> > > > > I
> > > > > > > cover
> > > > > > > > > all
> > > > > > > > > > > > of them
> > > > > > > > > > > > > > > > and get back to you. I am not sure they are
> the
> > > > > > > bottleneck
> > > > > > > > > though
> > > > > > > > > > > > as they
> > > > > > > > > > > > > > > > all are about optimizing physical resource
> > > usage. As
> > > > > I
> > > > > > > said,
> > > > > > > > > I
> > > > > > > > > > > > don't see
> > > > > > > > > > > > > > > > any contended physical resources now. I'll
> also
> > > try
> > > > > to
> > > > > > > > > reproduce
> > > > > > > > > > > > this
> > > > > > > > > > > > > > > > problem in a simpler environment and pass to
> you
> > > the
> > > > > test
> > > > > > > > > program
> > > > > > > > > > > > to play
> > > > > > > > > > > > > > > > with.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Couple of high level points to make. You are
> > > right
> > > > > that
> > > > > > > my
> > > > > > > > > use
> > > > > > > > > > > > case is
> > > > > > > > > > > > > > > > kind of a worst case for HBase reads. But, if
> > > things
> > > > > go
> > > > > > > the
> > > > > > > > > way
> > > > > > > > > > > you
> > > > > > > > > > > > > > > > described them, there should be tons of disk
> IO
> > > and
> > > > > that
> > > > > > > > > should
> > > > > > > > > > > be
> > > > > > > > > > > > > > > clearly
> > > > > > > > > > > > > > > > the bottleneck. This is not the case though.
> > > That's
> > > > > for
> > > > > > > the
> > > > > > > > > > > simple
> > > > > > > > > > > > reason
> > > > > > > > > > > > > > > > that this is done in a test environment (I am
> > > still
> > > > > > > > > prototyping),
> > > > > > > > > > > > and
> > > > > > > > > > > > > > > not a
> > > > > > > > > > > > > > > > lot of data is yet written to HBase. However
> for
> > > the
> > > > > > > real use
> > > > > > > > > > > > case, there
> > > > > > > > > > > > > > > > should writers constantly writing data to
> HBase
> > > and
> > > > > > > readers
> > > > > > > > > > > > occasionally
> > > > > > > > > > > > > > > > doing this scatter/gather. At steady state,
> > > things
> > > > > should
> > > > > > > > > only
> > > > > > > > > > > get
> > > > > > > > > > > > worse
> > > > > > > > > > > > > > > > and all the issues you mentioned should get
> far
> > > more
> > > > > > > > > pronounced.
> > > > > > > > > > > > At this
> > > > > > > > > > > > > > > > point, one can try to mitigate it using more
> > > memory
> > > > > or
> > > > > > > so. I
> > > > > > > > > am
> > > > > > > > > > > > not there
> > > > > > > > > > > > > > > > yet as I think I am hitting some software
> > > bottleneck,
> > > > > > > which I
> > > > > > > > > > > > don't know
> > > > > > > > > > > > > > > > how to work around.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Khaled
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > ----------------------------------------
> > > > > > > > > > > > > > > > > Date: Thu, 2 Oct 2014 14:20:47 -0700
> > > > > > > > > > > > > > > > > From: larsh@apache.org
> > > > > > > > > > > > > > > > > Subject: Re: HBase read performance
> > > > > > > > > > > > > > > > > To: user@hbase.apache.org
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > OK... We might need to investigate this.
> > > > > > > > > > > > > > > > > Any chance that you can provide a minimal
> test
> > > > > program
> > > > > > > and
> > > > > > > > > > > > instruction
> > > > > > > > > > > > > > > > about how to set it up.
> > > > > > > > > > > > > > > > > We can do some profiling then.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > One thing to note is that with scanning
> HBase
> > > > > cannot
> > > > > > > use
> > > > > > > > > bloom
> > > > > > > > > > > > filters
> > > > > > > > > > > > > > > > to rule out HFiles ahead of time, it needs to
> > > look
> > > > > into
> > > > > > > all
> > > > > > > > > of
> > > > > > > > > > > > them.
> > > > > > > > > > > > > > > > > So you kind of hit on the absolute worst
> case:
> > > > > > > > > > > > > > > > > - random reads that do not fit into the
> block
> > > cache
> > > > > > > > > > > > > > > > > - cannot use bloom filters
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Few more question/comments:
> > > > > > > > > > > > > > > > > - Do you have short circuit reading
> enabled? If
> > > > > not,
> > > > > > > you
> > > > > > > > > > > should.
> > > > > > > > > > > > > > > > > - Is your table major compacted? That will
> > > reduce
> > > > > the
> > > > > > > > > number of
> > > > > > > > > > > > files
> > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > look at.
> > > > > > > > > > > > > > > > > - Did you disable Nagle's everywhere
> (enabled
> > > > > > > tcpnodelay)?
> > > > > > > > > It
> > > > > > > > > > > > disabled
> > > > > > > > > > > > > > > > by default in HBase, but necessarily in your
> > > install
> > > > > of
> > > > > > > HDFS.
> > > > > > > > > > > > > > > > > - Which version of HDFS are you using as
> > > backing
> > > > > > > > > filesystem?
> > > > > > > > > > > > > > > > > - If your disk is idle, it means the data
> fits
> > > into
> > > > > > > the OS
> > > > > > > > > > > buffer
> > > > > > > > > > > > > > > cache.
> > > > > > > > > > > > > > > > In turn that means that you increase the heap
> > > for the
> > > > > > > region
> > > > > > > > > > > > servers. You
> > > > > > > > > > > > > > > > can also use block encoding (FAST_DIFF) to
> try to
> > > > > make
> > > > > > > sure
> > > > > > > > > the
> > > > > > > > > > > > entire
> > > > > > > > > > > > > > > > working set fits into the cache.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > - Also try to reduce the block size -
> although
> > > if
> > > > > your
> > > > > > > > > overall
> > > > > > > > > > > > working
> > > > > > > > > > > > > > > > set does not fit in the heap it won't help
> much.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > This is a good section of the book to read
> > > through
> > > > > > > > > generally
> > > > > > > > > > > > (even
> > > > > > > > > > > > > > > > though you might know most of this already):
> > > > > > > > > > > > > > > >
> > > > > http://hbase.apache.org/book.html#perf.configurations
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > -- Lars
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > ----- Original Message -----
> > > > > > > > > > > > > > > > > From: Khaled Elmeleegy <kd...@hotmail.com>
> > > > > > > > > > > > > > > > > To: "user@hbase.apache.org" <
> > > user@hbase.apache.org
> > > > > >
> > > > > > > > > > > > > > > > > Cc:
> > > > > > > > > > > > > > > > > Sent: Thursday, October 2, 2014 11:27 AM
> > > > > > > > > > > > > > > > > Subject: RE: HBase read performance
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > I do see a very brief spike in CPU
> > > (user/system),
> > > > > but
> > > > > > > it's
> > > > > > > > > no
> > > > > > > > > > > > where
> > > > > > > > > > > > > > > near
> > > > > > > > > > > > > > > > 0% idle. It goes from 99% idle down to
> something
> > > > > like 40%
> > > > > > > > > idle
> > > > > > > > > > > for
> > > > > > > > > > > > a
> > > > > > > > > > > > > > > second
> > > > > > > > > > > > > > > > or so. The thing to note, this is all on a
> test
> > > > > cluster,
> > > > > > > so
> > > > > > > > > no
> > > > > > > > > > > > real load.
> > > > > > > > > > > > > > > > Things are generally idle until i issue 2-3
> of
> > > these
> > > > > > > > > > > > multi-scan-requests
> > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > render a web page. Then, you see the spike
> in the
> > > > > cpu and
> > > > > > > > > some
> > > > > > > > > > > > activity
> > > > > > > > > > > > > > > in
> > > > > > > > > > > > > > > > the network and disk, but nowhere near
> > > saturation.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > If there are specific tests you'd like me
> to
> > > do to
> > > > > > > debug
> > > > > > > > > this,
> > > > > > > > > > > > I'd be
> > > > > > > > > > > > > > > > more than happy to do it.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Khaled
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > ----------------------------------------
> > > > > > > > > > > > > > > > >> Date: Thu, 2 Oct 2014 11:15:59 -0700
> > > > > > > > > > > > > > > > >> From: larsh@apache.org
> > > > > > > > > > > > > > > > >> Subject: Re: HBase read performance
> > > > > > > > > > > > > > > > >> To: user@hbase.apache.org
> > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > >> I still think you're waiting on disk. No
> > > IOWAIT?
> > > > > So
> > > > > > > CPU
> > > > > > > > > is not
> > > > > > > > > > > > waiting
> > > > > > > > > > > > > > > > a lot for IO. No high User/System CPU either?
> > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > >> If you see a lot of evicted block then
> each
> > > RPC
> > > > > has a
> > > > > > > high
> > > > > > > > > > > > chance of
> > > > > > > > > > > > > > > > requiring to bring an entire 64k block in.
> You'll
> > > > > see bad
> > > > > > > > > > > > performance
> > > > > > > > > > > > > > > with
> > > > > > > > > > > > > > > > this.
> > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > >> We might need to trace this specific
> scenario.
> > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > >> -- Lars
> > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > >> ________________________________
> > > > > > > > > > > > > > > > >> From: Khaled Elmeleegy <kdiaa@hotmail.com
> >
> > > > > > > > > > > > > > > > >> To: "user@hbase.apache.org" <
> > > > > user@hbase.apache.org>
> > > > > > > > > > > > > > > > >> Sent: Thursday, October 2, 2014 10:46 AM
> > > > > > > > > > > > > > > > >> Subject: RE: HBase read performance
> > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > >> I've set the heap size to 6GB and I do
> have gc
> > > > > > > logging. No
> > > > > > > > > > > long
> > > > > > > > > > > > pauses
> > > > > > > > > > > > > > > > there -- occasional 0.1s or 0.2s.
> > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > >> Other than the discrepancy between what's
> > > > > reported on
> > > > > > > the
> > > > > > > > > > > > client and
> > > > > > > > > > > > > > > > what's reported at the RS, there is also the
> > > issue
> > > > > of not
> > > > > > > > > getting
> > > > > > > > > > > > proper
> > > > > > > > > > > > > > > > concurrency. So, even if a reverse get takes
> > > 100ms
> > > > > or so
> > > > > > > > > (this
> > > > > > > > > > > has
> > > > > > > > > > > > to be
> > > > > > > > > > > > > > > > mostly blocking on various things as no
> physical
> > > > > > > resource is
> > > > > > > > > > > > contended),
> > > > > > > > > > > > > > > > then the other gets/scans should be able to
> > > proceed
> > > > > in
> > > > > > > > > parallel,
> > > > > > > > > > > > so a
> > > > > > > > > > > > > > > > thousand concurrent gets/scans should finish
> in
> > > few
> > > > > > > hundreds
> > > > > > > > > of
> > > > > > > > > > > ms
> > > > > > > > > > > > not
> > > > > > > > > > > > > > > many
> > > > > > > > > > > > > > > > seconds. That's why I thought I'd increase
> the
> > > > > handlers
> > > > > > > > > count to
> > > > > > > > > > > > try to
> > > > > > > > > > > > > > > get
> > > > > > > > > > > > > > > > more concurrency, but it didn't help. So,
> there
> > > must
> > > > > be
> > > > > > > > > something
> > > > > > > > > > > > else.
> > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > >> Khaled
> > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > >> ----------------------------------------
> > > > > > > > > > > > > > > > >>> From: ndimiduk@gmail.com
> > > > > > > > > > > > > > > > >>> Date: Thu, 2 Oct 2014 10:36:39 -0700
> > > > > > > > > > > > > > > > >>> Subject: Re: HBase read performance
> > > > > > > > > > > > > > > > >>> To: user@hbase.apache.org
> > > > > > > > > > > > > > > > >>>
> > > > > > > > > > > > > > > > >>> Do check again on the heap size of the
> region
> > > > > > > servers.
> > > > > > > > > The
> > > > > > > > > > > > default
> > > > > > > > > > > > > > > > >>> unconfigured size is 1G; too small for
> much
> > > of
> > > > > > > anything.
> > > > > > > > > > > Check
> > > > > > > > > > > > your
> > > > > > > > > > > > > > > RS
> > > > > > > > > > > > > > > > logs
> > > > > > > > > > > > > > > > >>> -- look for lines produced by the
> > > JVMPauseMonitor
> > > > > > > thread.
> > > > > > > > > > > They
> > > > > > > > > > > > > > > usually
> > > > > > > > > > > > > > > > >>> correlate with long GC pauses or other
> > > > > process-freeze
> > > > > > > > > events.
> > > > > > > > > > > > > > > > >>>
> > > > > > > > > > > > > > > > >>> Get is implemented as a Scan of a single
> > > row, so
> > > > > a
> > > > > > > > > reverse
> > > > > > > > > > > > scan of a
> > > > > > > > > > > > > > > > single
> > > > > > > > > > > > > > > > >>> row should be functionally equivalent.
> > > > > > > > > > > > > > > > >>>
> > > > > > > > > > > > > > > > >>> In practice, I have seen discrepancy
> between
> > > the
> > > > > > > > > latencies
> > > > > > > > > > > > reported
> > > > > > > > > > > > > > > by
> > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > >>> RS and the latencies experienced by the
> > > client.
> > > > > I've
> > > > > > > not
> > > > > > > > > > > > investigated
> > > > > > > > > > > > > > > > this
> > > > > > > > > > > > > > > > >>> area thoroughly.
> > > > > > > > > > > > > > > > >>>
> > > > > > > > > > > > > > > > >>> On Thu, Oct 2, 2014 at 10:05 AM, Khaled
> > > > > Elmeleegy <
> > > > > > > > > > > > kdiaa@hotmail.com
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > >>>
> > > > > > > > > > > > > > > > >>>> Thanks Lars for your quick reply.
> > > > > > > > > > > > > > > > >>>>
> > > > > > > > > > > > > > > > >>>> Yes performance is similar with less
> > > handlers (I
> > > > > > > tried
> > > > > > > > > with
> > > > > > > > > > > > 100
> > > > > > > > > > > > > > > > first).
> > > > > > > > > > > > > > > > >>>>
> > > > > > > > > > > > > > > > >>>> The payload is not big ~1KB or so. The
> > > working
> > > > > set
> > > > > > > > > doesn't
> > > > > > > > > > > > seem to
> > > > > > > > > > > > > > > > fit in
> > > > > > > > > > > > > > > > >>>> memory as there are many cache misses.
> > > However,
> > > > > > > disk is
> > > > > > > > > far
> > > > > > > > > > > > from
> > > > > > > > > > > > > > > > being a
> > > > > > > > > > > > > > > > >>>> bottleneck. I checked using iostat. I
> also
> > > > > verified
> > > > > > > that
> > > > > > > > > > > > neither the
> > > > > > > > > > > > > > > > >>>> network nor the CPU of the region
> server or
> > > the
> > > > > > > client
> > > > > > > > > are a
> > > > > > > > > > > > > > > > bottleneck.
> > > > > > > > > > > > > > > > >>>> This leads me to believe that likely
> this
> > > is a
> > > > > > > software
> > > > > > > > > > > > bottleneck,
> > > > > > > > > > > > > > > > >>>> possibly due to a misconfiguration on my
> > > side. I
> > > > > > > just
> > > > > > > > > don't
> > > > > > > > > > > > know how
> > > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > >>>> debug it. A clear disconnect I see is
> the
> > > > > individual
> > > > > > > > > request
> > > > > > > > > > > > latency
> > > > > > > > > > > > > > > > as
> > > > > > > > > > > > > > > > >>>> reported by metrics on the region server
> > > (IPC
> > > > > > > > > > > processCallTime
> > > > > > > > > > > > vs
> > > > > > > > > > > > > > > > scanNext)
> > > > > > > > > > > > > > > > >>>> vs what's measured on the client. Does
> this
> > > > > sound
> > > > > > > > > right? Any
> > > > > > > > > > > > ideas
> > > > > > > > > > > > > > > on
> > > > > > > > > > > > > > > > how
> > > > > > > > > > > > > > > > >>>> to better debug it?
> > > > > > > > > > > > > > > > >>>>
> > > > > > > > > > > > > > > > >>>> About this trick with the timestamps to
> be
> > > able
> > > > > to
> > > > > > > do a
> > > > > > > > > > > > forward
> > > > > > > > > > > > > > > scan,
> > > > > > > > > > > > > > > > >>>> thanks for pointing it out. Actually, I
> am
> > > > > aware of
> > > > > > > it.
> > > > > > > > > The
> > > > > > > > > > > > problem
> > > > > > > > > > > > > > > I
> > > > > > > > > > > > > > > > have
> > > > > > > > > > > > > > > > >>>> is, sometimes I want to get the key
> after a
> > > > > > > particular
> > > > > > > > > > > > timestamp and
> > > > > > > > > > > > > > > > >>>> sometimes I want to get the key before,
> so
> > > just
> > > > > > > relying
> > > > > > > > > on
> > > > > > > > > > > > the key
> > > > > > > > > > > > > > > > order
> > > > > > > > > > > > > > > > >>>> doesn't work. Ideally, I want a reverse
> > > get(). I
> > > > > > > thought
> > > > > > > > > > > > reverse
> > > > > > > > > > > > > > > scan
> > > > > > > > > > > > > > > > can
> > > > > > > > > > > > > > > > >>>> do the trick though.
> > > > > > > > > > > > > > > > >>>>
> > > > > > > > > > > > > > > > >>>> Khaled
> > > > > > > > > > > > > > > > >>>>
> > > > > > > > > > > > > > > > >>>> ----------------------------------------
> > > > > > > > > > > > > > > > >>>>> Date: Thu, 2 Oct 2014 09:40:37 -0700
> > > > > > > > > > > > > > > > >>>>> From: larsh@apache.org
> > > > > > > > > > > > > > > > >>>>> Subject: Re: HBase read performance
> > > > > > > > > > > > > > > > >>>>> To: user@hbase.apache.org
> > > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > > >>>>> Hi Khaled,
> > > > > > > > > > > > > > > > >>>>> is it the same with fewer threads? 1500
> > > handler
> > > > > > > threads
> > > > > > > > > > > > seems to
> > > > > > > > > > > > > > > be a
> > > > > > > > > > > > > > > > >>>> lot. Typically a good number of threads
> > > depends
> > > > > on
> > > > > > > the
> > > > > > > > > > > > hardware
> > > > > > > > > > > > > > > > (number of
> > > > > > > > > > > > > > > > >>>> cores, number of spindles, etc). I
> cannot
> > > think
> > > > > of
> > > > > > > any
> > > > > > > > > type
> > > > > > > > > > > of
> > > > > > > > > > > > > > > > scenario
> > > > > > > > > > > > > > > > >>>> where more than 100 would give any
> > > improvement.
> > > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > > >>>>> How large is the payload per KV
> retrieved
> > > that
> > > > > > > way? If
> > > > > > > > > > > large
> > > > > > > > > > > > (as
> > > > > > > > > > > > > > > in a
> > > > > > > > > > > > > > > > >>>> few 100k) you definitely want to lower
> the
> > > > > number
> > > > > > > of the
> > > > > > > > > > > > handler
> > > > > > > > > > > > > > > > threads.
> > > > > > > > > > > > > > > > >>>>> How much heap do you give the region
> > > server?
> > > > > Does
> > > > > > > the
> > > > > > > > > > > > working set
> > > > > > > > > > > > > > > fit
> > > > > > > > > > > > > > > > >>>> into the cache? (i.e. in the metrics,
> do you
> > > > > see the
> > > > > > > > > > > eviction
> > > > > > > > > > > > count
> > > > > > > > > > > > > > > > going
> > > > > > > > > > > > > > > > >>>> up, if so it does not fit into the
> cache).
> > > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > > >>>>> If the working set does not fit into
> the
> > > cache
> > > > > > > > > (eviction
> > > > > > > > > > > > count goes
> > > > > > > > > > > > > > > > up)
> > > > > > > > > > > > > > > > >>>> then HBase will need to bring a new
> block in
> > > > > from
> > > > > > > disk
> > > > > > > > > on
> > > > > > > > > > > > each Get
> > > > > > > > > > > > > > > > >>>> (assuming the Gets are more or less
> random
> > > as
> > > > > far
> > > > > > > as the
> > > > > > > > > > > > server is
> > > > > > > > > > > > > > > > >>>> concerned).
> > > > > > > > > > > > > > > > >>>>> In case you'll benefit from reducing
> the
> > > HFile
> > > > > > > block
> > > > > > > > > size
> > > > > > > > > > > > (from 64k
> > > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > >>>> 8k or even 4k).
> > > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > > >>>>> Lastly I don't think we tested the
> > > performance
> > > > > of
> > > > > > > using
> > > > > > > > > > > > reverse
> > > > > > > > > > > > > > > scan
> > > > > > > > > > > > > > > > >>>> this way, there is probably room to
> optimize
> > > > > this.
> > > > > > > > > > > > > > > > >>>>> Can you restructure your keys to allow
> > > forwards
> > > > > > > > > scanning?
> > > > > > > > > > > For
> > > > > > > > > > > > > > > example
> > > > > > > > > > > > > > > > >>>> you could store the time as
> MAX_LONG-time.
> > > Or
> > > > > you
> > > > > > > could
> > > > > > > > > > > > invert all
> > > > > > > > > > > > > > > > the bits
> > > > > > > > > > > > > > > > >>>> of the time portion of the key, so that
> it
> > > sort
> > > > > the
> > > > > > > > > other
> > > > > > > > > > > > way. Then
> > > > > > > > > > > > > > > > you
> > > > > > > > > > > > > > > > >>>> could do a forward scan.
> > > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > > >>>>> Let us know how it goes.
> > > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > > >>>>> -- Lars
> > > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > > >>>>> ----- Original Message -----
> > > > > > > > > > > > > > > > >>>>> From: Khaled Elmeleegy <
> kdiaa@hotmail.com>
> > > > > > > > > > > > > > > > >>>>> To: "user@hbase.apache.org" <
> > > > > user@hbase.apache.org
> > > > > > > >
> > > > > > > > > > > > > > > > >>>>> Cc:
> > > > > > > > > > > > > > > > >>>>> Sent: Thursday, October 2, 2014 12:12
> AM
> > > > > > > > > > > > > > > > >>>>> Subject: HBase read performance
> > > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > > >>>>> Hi,
> > > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > > >>>>> I am trying to do a scatter/gather on
> hbase
> > > > > > > (0.98.6.1),
> > > > > > > > > > > > where I
> > > > > > > > > > > > > > > have
> > > > > > > > > > > > > > > > a
> > > > > > > > > > > > > > > > >>>> client reading ~1000 keys from an HBase
> > > table.
> > > > > These
> > > > > > > > > keys
> > > > > > > > > > > > happen to
> > > > > > > > > > > > > > > > fall on
> > > > > > > > > > > > > > > > >>>> the same region server. For my reads I
> use
> > > > > reverse
> > > > > > > scan
> > > > > > > > > to
> > > > > > > > > > > > read each
> > > > > > > > > > > > > > > > key as
> > > > > > > > > > > > > > > > >>>> I want the key prior to a specific time
> > > stamp
> > > > > (time
> > > > > > > > > stamps
> > > > > > > > > > > are
> > > > > > > > > > > > > > > stored
> > > > > > > > > > > > > > > > in
> > > > > > > > > > > > > > > > >>>> reverse order). I don't believe gets can
> > > > > accomplish
> > > > > > > > > that,
> > > > > > > > > > > > right? so
> > > > > > > > > > > > > > > I
> > > > > > > > > > > > > > > > use
> > > > > > > > > > > > > > > > >>>> scan, with caching set to 1.
> > > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > > >>>>> I use 2000 reader threads in the client
> > > and on
> > > > > > > HBase,
> > > > > > > > > I've
> > > > > > > > > > > > set
> > > > > > > > > > > > > > > > >>>> hbase.regionserver.handler.count to
> 1500.
> > > With
> > > > > this
> > > > > > > > > setup,
> > > > > > > > > > > my
> > > > > > > > > > > > > > > scatter
> > > > > > > > > > > > > > > > >>>> gather is very slow and can take up to
> 10s
> > > in
> > > > > total.
> > > > > > > > > Timing
> > > > > > > > > > > an
> > > > > > > > > > > > > > > > individual
> > > > > > > > > > > > > > > > >>>> getScanner(..) call on the client side,
> it
> > > can
> > > > > > > easily
> > > > > > > > > take
> > > > > > > > > > > few
> > > > > > > > > > > > > > > > hundreds of
> > > > > > > > > > > > > > > > >>>> ms. I also got the following metrics
> from
> > > the
> > > > > region
> > > > > > > > > server
> > > > > > > > > > > in
> > > > > > > > > > > > > > > > question:
> > > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > > >>>>> "queueCallTime_mean" :
> 2.190855525775637,
> > > > > > > > > > > > > > > > >>>>> "queueCallTime_median" : 0.0,
> > > > > > > > > > > > > > > > >>>>> "queueCallTime_75th_percentile" : 0.0,
> > > > > > > > > > > > > > > > >>>>> "queueCallTime_95th_percentile" : 1.0,
> > > > > > > > > > > > > > > > >>>>> "queueCallTime_99th_percentile" :
> > > > > > > 556.9799999999818,
> > > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > > >>>>> "processCallTime_min" : 0,
> > > > > > > > > > > > > > > > >>>>> "processCallTime_max" : 12755,
> > > > > > > > > > > > > > > > >>>>> "processCallTime_mean" :
> > > 105.64873440912682,
> > > > > > > > > > > > > > > > >>>>> "processCallTime_median" : 0.0,
> > > > > > > > > > > > > > > > >>>>> "processCallTime_75th_percentile" :
> 2.0,
> > > > > > > > > > > > > > > > >>>>> "processCallTime_95th_percentile" :
> > > 7917.95,
> > > > > > > > > > > > > > > > >>>>> "processCallTime_99th_percentile" :
> > > 8876.89,
> > > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > > >>>>
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > >
> > >
> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_min"
> > > > > > > > > > > > > > > > >>>> : 89,
> > > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > > >>>>
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > >
> > >
> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_max"
> > > > > > > > > > > > > > > > >>>> : 11300,
> > > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > > >>>>
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > >
> > >
> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_mean"
> > > > > > > > > > > > > > > > >>>> : 654.4949739797315,
> > > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > > >>>>
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > >
> > >
> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_median"
> > > > > > > > > > > > > > > > >>>> : 101.0,
> > > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > > >>>>
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > >
> > >
> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_75th_percentile"
> > > > > > > > > > > > > > > > >>>> : 101.0,
> > > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > > >>>>
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > >
> > >
> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_95th_percentile"
> > > > > > > > > > > > > > > > >>>> : 101.0,
> > > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > > >>>>
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > >
> > >
> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_99th_percentile"
> > > > > > > > > > > > > > > > >>>> : 113.0,
> > > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > > >>>>> Where "delta" is the name of the table
> I am
> > > > > > > querying.
> > > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > > >>>>> In addition to all this, i monitored
> the
> > > > > hardware
> > > > > > > > > resources
> > > > > > > > > > > > (CPU,
> > > > > > > > > > > > > > > > disk,
> > > > > > > > > > > > > > > > >>>> and network) of both the client and the
> > > region
> > > > > > > server
> > > > > > > > > and
> > > > > > > > > > > > nothing
> > > > > > > > > > > > > > > > seems
> > > > > > > > > > > > > > > > >>>> anywhere near saturation. So I am
> puzzled by
> > > > > what's
> > > > > > > > > going on
> > > > > > > > > > > > and
> > > > > > > > > > > > > > > > where this
> > > > > > > > > > > > > > > > >>>> time is going.
> > > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > > >>>>> Few things to note based on the above
> > > > > measurements:
> > > > > > > > > both
> > > > > > > > > > > > medians of
> > > > > > > > > > > > > > > > IPC
> > > > > > > > > > > > > > > > >>>> processCallTime and queueCallTime are
> > > basically
> > > > > zero
> > > > > > > > > (ms I
> > > > > > > > > > > > presume,
> > > > > > > > > > > > > > > > >>>> right?). However, scanNext_median is
> 101 (ms
> > > > > too,
> > > > > > > > > right?). I
> > > > > > > > > > > > am not
> > > > > > > > > > > > > > > > sure
> > > > > > > > > > > > > > > > >>>> how this adds up. Also, even though the
> 101
> > > > > figure
> > > > > > > seems
> > > > > > > > > > > > > > > outrageously
> > > > > > > > > > > > > > > > high
> > > > > > > > > > > > > > > > >>>> and I don't know why, still all these
> scans
> > > > > should
> > > > > > > be
> > > > > > > > > > > > happening in
> > > > > > > > > > > > > > > > >>>> parallel, so the overall call should
> finish
> > > > > fast,
> > > > > > > given
> > > > > > > > > that
> > > > > > > > > > > > no
> > > > > > > > > > > > > > > > hardware
> > > > > > > > > > > > > > > > >>>> resource is contended, right? but this
> is
> > > not
> > > > > what's
> > > > > > > > > > > > happening, so I
> > > > > > > > > > > > > > > > have
> > > > > > > > > > > > > > > > >>>> to be missing something(s).
> > > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > > >>>>> So, any help is appreciated there.
> > > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > > >>>>> Thanks,
> > > > > > > > > > > > > > > > >>>>> Khaled
> > > > > > > > > > > > > > > > >>>>
> > > > > > > > > > > > > > > > >>>>
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > Best regards,
> > > > > > > >
> > > > > > > >    - Andy
> > > > > > > >
> > > > > > > > Problems worthy of attack prove their worth by hitting back.
> -
> > > Piet
> > > > > Hein
> > > > > > > > (via Tom White)
> > > > > > >
> > > > > > >
> > > > >
> > > > >
> > >
> > >
>
>

RE: HBase read performance

Posted by Khaled Elmeleegy <kd...@hotmail.com>.

Done. It's HBASE-12270

Please let me know if you have any comments/questions.

----------------------------------------
> Date: Mon, 13 Oct 2014 22:42:37 -0500
> Subject: RE: HBase read performance
> From: busbey@cloudera.com
> To: user@hbase.apache.org
>
> Yes, please file a JIRA and attach the test there.
>
> --
> Sean
> On Oct 13, 2014 10:22 PM, "Khaled Elmeleegy" <kd...@hotmail.com> wrote:
>
>> Actually, it's very easy to reproduce. I have a simple program that
>> reproduces it if run against a simple toy cluster with one RS. If shared
>> the program with Ted. I can file a jira ticket about it if you want.
>> Khaled
>>
>>> Date: Mon, 13 Oct 2014 19:12:13 -0700
>>> From: larsh@apache.org
>>> Subject: Re: HBase read performance
>>> To: user@hbase.apache.org
>>>
>>> This looks like something to track.
>>> If there's an easy way to reproduce this, could you file a jira with
>> instructions?
>>> (Although from the conversations here, it looks like it's not trivial to
>> reproduce).
>>>
>>> -- Lars
>>>
>>>
>>>
>>> ________________________________
>>> From: Khaled Elmeleegy <kd...@hotmail.com>
>>> To: "user@hbase.apache.org" <us...@hbase.apache.org>
>>> Sent: Monday, October 13, 2014 12:13 PM
>>> Subject: RE: HBase read performance
>>>
>>>
>>> I went through a lengthy testing process with Ted. I tried both
>> disabling just cache on write and both COW and BC. Disabling them makes the
>> problem go away. I think disabling COW in my workload effectively doesn't
>> exercise the block cache as my workload is mostly writes, so disabling COW
>> doesn't cause the cache blocks to churn.
>>> I tried both overcommitting memory and not. This is irrelevant though.
>> The cache operates on the virtual memory offered to it and it's agnostic to
>> the physical memory size. Memory over commitment may only hurt performance
>> due to swapping by the OS, but this has nothing to do with the bug.
>>> Khaled
>>>
>>>> Date: Sun, 12 Oct 2014 23:17:24 -0700
>>>> Subject: Re: HBase read performance
>>>> From: stack@duboce.net
>>>> To: user@hbase.apache.org
>>>>
>>>> On Sun, Oct 12, 2014 at 8:58 PM, Khaled Elmeleegy <kd...@hotmail.com>
>> wrote:
>>>>
>>>>> It goes away, but i think that's because i am basically churning the
>> cache
>>>>> far less. My workload is mostly writes.
>>>>>
>>>>>
>>>> What exactly did you change?
>>>>
>>>> Did you disable both BC and cacheonwrite or just one of these configs?
>>>>
>>>> Disabling BC and/or cacheonwrite would churn your cache more?
>>>>
>>>> What about the heap config? Did you over commit mem? If so, did you
>> try
>>>> doing a direct memory sizing that was w/i the bounds of physical memory
>>>> (well within)? Did you still see exceptions?
>>>>
>>>> St.Ack
>>>>
>>>>
>>>>
>>>>
>>>>>> Date: Sun, 12 Oct 2014 19:43:16 -0700
>>>>>> Subject: Re: HBase read performance
>>>>>> From: yuzhihong@gmail.com
>>>>>> To: user@hbase.apache.org
>>>>>>
>>>>>> Hi,
>>>>>> Can you turn off cacheonwrite and keep bucket cache to see if the
>> problem
>>>>>> goes away ?
>>>>>>
>>>>>> Cheers
>>>>>>
>>>>>> On Fri, Oct 10, 2014 at 10:59 AM, Khaled Elmeleegy <
>> kdiaa@hotmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Yes, I can reproduce it with some work.
>>>>>>> The workload is basically as follows:
>>>>>>> There are writers streaming writes to a table. Then, there is a
>> reader
>>>>>>> (invoked via a web interface). The reader does a 1000 parallel
>> reverse
>>>>>>> scans, all end up hitting the same region in my case. The scans
>> are
>>>>>>> effectively "gets" as I just need to get one record off of each
>> of
>>>>> them. I
>>>>>>> just need to do a "reverse" get, which is not supported (would be
>>>>> great to
>>>>>>> have :)), so I do it via reverse scan. After few tries, the
>> reader
>>>>>>> consistently hits this bug.
>>>>>>>
>>>>>>> This happens with these config changes:
>>>>>>> hbase-env:HBASE_REGIONSERVER_OPTS=-Xmx6G
>> -XX:MaxDirectMemorySize=5G
>>>>>>> -XX:CMSInitiatingOccupancyFraction=88 -XX:+AggressiveOpts
>> -verbose:gc
>>>>>>> -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Xlog
>>>>>>> gc:/tmp/hbase-regionserver-gc.loghbase-site:
>>>>>>> hbase.bucketcache.ioengine=offheap
>>>>>>> hbase.bucketcache.size=4196
>>>>>>> hbase.rs.cacheblocksonwrite=true
>>>>>>> hfile.block.index.cacheonwrite=true
>>>>>>> hfile.block.bloom.cacheonwrite=true
>>>>>>>
>>>>>>> Interestingly, without these config changes, I can't reproduce
>> the
>>>>> problem.
>>>>>>> Khaled
>>>>>>>
>>>>>>>
>>>>>>>> Date: Fri, 10 Oct 2014 10:05:14 -0700
>>>>>>>> Subject: Re: HBase read performance
>>>>>>>> From: stack@duboce.net
>>>>>>>> To: user@hbase.apache.org
>>>>>>>>
>>>>>>>> It looks like we are messing up our positioning math:
>>>>>>>>
>>>>>>>> 233 <
>>>>>>>
>>>>>
>> http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/nio/Buffer.java#233
>>>>>>>>
>>>>>>>>
>>>>>>>> <
>>>>>>>
>>>>>
>> http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/nio/Buffer.java#
>>>>>>>>
>>>>>>>>
>>>>>>>> public final Buffer
>>>>>>>> <
>>>>>>>
>>>>>
>> http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/nio/Buffer.java#Buffer
>>>>>>>>
>>>>>>>> <
>>>>>>>
>>>>>
>> http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/nio/Buffer.java#
>>>>>>>>position(int
>>>>>>>> newPosition) {
>>>>>>>>
>>>>>>>> 234 <
>>>>>>>
>>>>>
>> http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/nio/Buffer.java#234
>>>>>>>>
>>>>>>>>
>>>>>>>> <
>>>>>>>
>>>>>
>> http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/nio/Buffer.java#
>>>>>>>>
>>>>>>>>
>>>>>>>> if ((newPosition> limit
>>>>>>>> <
>>>>>>>
>>>>>
>> http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/nio/Buffer.java#Buffer.0limit
>>>>>>>>)
>>>>>>>> || (newPosition < 0))
>>>>>>>>
>>>>>>>> 235 <
>>>>>>>
>>>>>
>> http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/nio/Buffer.java#235
>>>>>>>>
>>>>>>>>
>>>>>>>> <
>>>>>>>
>>>>>
>> http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/nio/Buffer.java#
>>>>>>>>
>>>>>>>>
>>>>>>>> throw new IllegalArgumentException
>>>>>>>> <
>>>>>>>
>>>>>
>> http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/lang/IllegalArgumentException.java#IllegalArgumentException
>>>>>>>>();
>>>>>>>>
>>>>>>>>
>>>>>>>> Is it easy to reproduce Khaled? Always same region/store or
>> spread
>>>>>>>> across all reads?
>>>>>>>>
>>>>>>>>
>>>>>>>> St.Ack
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Oct 10, 2014 at 8:31 AM, Khaled Elmeleegy <
>> kdiaa@hotmail.com
>>>>>>
>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Andrew thanks. I think this indeed was the problem.
>>>>>>>>> To get over it, I increased the amount of memory given the
>> region
>>>>>>> server
>>>>>>>>> to avoid IO on reads. I used the configs below. In my
>> experiments,
>>>>> I
>>>>>>> have
>>>>>>>>> writers streaming their output to HBase. The reader powers a
>> web
>>>>> page
>>>>>>> and
>>>>>>>>> does this scatter/gather, where it reads 1000 keys written
>> last and
>>>>>>> passes
>>>>>>>>> them the the front end. With this workload, I get the
>> exception
>>>>> below
>>>>>>> at
>>>>>>>>> the region server. Again, I am using HBAse (0.98.6.1). Any
>> help is
>>>>>>>>> appreciated.
>>>>>>>>> hbase-env:HBASE_REGIONSERVER_OPTS=-Xmx6G
>> -XX:MaxDirectMemorySize=5G
>>>>>>>>> -XX:CMSInitiatingOccupancyFraction=88 -XX:+AggressiveOpts
>>>>> -verbose:gc
>>>>>>>>> -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Xlog
>>>>>>>>> gc:/tmp/hbase-regionserver-gc.log
>>>>>>>>> hbase-site:hbase.bucketcache.ioengine=offheap
>>>>>>>>>
>>>>>>>>> hfile.block.cache.size=0.4
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> hbase.bucketcache.size=4196
>>>>>>>>>
>>>>>>>>> hbase.rs.cacheblocksonwrite=true
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> hfile.block.index.cacheonwrite=true
>>>>>>>>>
>>>>>>>>> hfile.block.bloom.cacheonwrite=true
>>>>>>>>>
>>>>>>>>> 2014-10-10 15:06:44,173 ERROR
>>>>>>>>> [B.DefaultRpcServer.handler=62,queue=2,port=60020]
>> ipc.RpcServer:
>>>>>>>>> Unexpected throwable object
>>>>>>>>> java.lang.IllegalArgumentException
>>>>>>>>> at java.nio.Buffer.position(Buffer.java:236)
>>>>>>>>> at
>>>>>>>>>
>>>>>>>
>>>>>
>> org.apache.hadoop.hbase.util.ByteBufferUtils.skip(ByteBufferUtils.java:434)
>>>>>>>>> at
>>>>>>>>>
>>>>>>>
>>>>>
>> org.apache.hadoop.hbase.io.hfile.HFileReaderV2$ScannerV2.readKeyValueLen(HFileReaderV2.java:849)
>>>>>>>>> at
>>>>>>>>>
>>>>>>>
>>>>>
>> org.apache.hadoop.hbase.io.hfile.HFileReaderV2$ScannerV2.next(HFileReaderV2.java:760)
>>>>>>>>> at
>>>>>>>>>
>>>>>>>
>>>>>
>> org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekAtOrAfter(StoreFileScanner.java:248)
>>>>>>>>> at
>>>>>>>>>
>>>>>>>
>>>>>
>> org.apache.hadoop.hbase.regionserver.StoreFileScanner.seek(StoreFileScanner.java:152)
>>>>>>>>> at
>>>>>>>>>
>>>>>>>
>>>>>
>> org.apache.hadoop.hbase.regionserver.StoreScanner.seekScanners(StoreScanner.java:317)
>>>>>>>>> at
>>>>>>>>>
>>>>>>>
>>>>>
>> org.apache.hadoop.hbase.regionserver.StoreScanner.<init>(StoreScanner.java:176)
>>>>>>>>> at
>>>>>>>>>
>>>>>>>
>>>>>
>> org.apache.hadoop.hbase.regionserver.HStore.getScanner(HStore.java:1780)
>>>>>>>>> at
>>>>>>>>>
>>>>>>>
>>>>>
>> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.<init>(HRegion.java:3758)
>>>>>>>>> at
>>>>>>>>>
>>>>>>>
>>>>>
>> org.apache.hadoop.hbase.regionserver.HRegion.instantiateRegionScanner(HRegion.java:1950)
>>>>>>>>> at
>>>>>>>>>
>>>>>>>
>>>>>
>> org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:1936)
>>>>>>>>> at
>>>>>>>>>
>>>>>>>
>>>>>
>> org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:1913)
>>>>>>>>> at
>>>>>>>>>
>>>>>>>
>>>>>
>> org.apache.hadoop.hbase.regionserver.HRegionServer.scan(HRegionServer.java:3157)
>>>>>>>>> at
>>>>>>>>>
>>>>>>>
>>>>>
>> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:29587)
>>>>>>>>> at
>>>>>>> org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2027)
>>>>>>>>> at
>>>>>>> org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108)
>>>>>>>>> at
>>>>>>>>>
>>>>>>>
>>>>>
>> org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:114)
>>>>>>>>> at
>>>>>>>>>
>> org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:94)
>>>>>>>>> at java.lang.Thread.run(Thread.java:744)
>>>>>>>>>> From: apurtell@apache.org
>>>>>>>>>> Date: Tue, 7 Oct 2014 17:09:35 -0700
>>>>>>>>>> Subject: Re: HBase read performance
>>>>>>>>>> To: user@hbase.apache.org
>>>>>>>>>>
>>>>>>>>>>> The cluster has 2 m1.large nodes.
>>>>>>>>>>
>>>>>>>>>> That's the problem right there.
>>>>>>>>>>
>>>>>>>>>> You need to look at c3.4xlarge or i2 instances as a minimum
>>>>>>> requirement.
>>>>>>>>> M1
>>>>>>>>>> and even M3 instance types have ridiculously poor IO.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Tue, Oct 7, 2014 at 3:01 PM, Khaled Elmeleegy <
>>>>> kdiaa@hotmail.com>
>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Thanks Nicolas, Qiang.
>>>>>>>>>>>
>>>>>>>>>>> I was able to write a simple program that reproduces the
>>>>> problem
>>>>>>> on a
>>>>>>>>> tiny
>>>>>>>>>>> HBase cluster on ec2. The cluster has 2 m1.large nodes.
>> One
>>>>> node
>>>>>>> runs
>>>>>>>>> the
>>>>>>>>>>> master, name node and zookeeper. The other node runs a
>> data
>>>>> node
>>>>>>> and a
>>>>>>>>>>> region server, with heap size configured to be 6GB.
>> There, the
>>>>> 1000
>>>>>>>>>>> parallel reverse gets (reverse scans) take 7-8 seconds.
>> The
>>>>> data
>>>>>>> set is
>>>>>>>>>>> tiny (10M records, each having a small number of bytes).
>> As I
>>>>> said
>>>>>>>>> before,
>>>>>>>>>>> all hardware resources are very idle there.
>>>>>>>>>>>
>>>>>>>>>>> Interestingly, running the same workload on my macbook,
>> the
>>>>> 1000
>>>>>>>>> parallel
>>>>>>>>>>> gets take ~200ms on a pseudo-distributed installation.
>>>>>>>>>>>
>>>>>>>>>>> Any help to resolve this mystery is highly appreciated.
>>>>>>>>>>>
>>>>>>>>>>> P.S. please find my test program attached.
>>>>>>>>>>>
>>>>>>>>>>> Best,
>>>>>>>>>>> Khaled
>>>>>>>>>>>
>>>>>>>>>>>> From: nkeywal@gmail.com
>>>>>>>>>>>> Date: Mon, 6 Oct 2014 09:40:48 +0200
>>>>>>>>>>>> Subject: Re: HBase read performance
>>>>>>>>>>>> To: user@hbase.apache.org
>>>>>>>>>>>>
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>> I haven't seen it mentioned, but if I understand
>> correctly
>>>>> each
>>>>>>> scan
>>>>>>>>>>>> returns a single row? If so you should use
>> Scan#setSmall to
>>>>> save
>>>>>>>>> some rpc
>>>>>>>>>>>> calls.
>>>>>>>>>>>>
>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>
>>>>>>>>>>>> Nicolas
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Sun, Oct 5, 2014 at 11:28 AM, Qiang Tian <
>>>>> tianq01@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> when using separate HConnection instance, both its
>>>>>>>>>>>>> RpcClient instance(maintain connection to a
>> regionserver)
>>>>> and
>>>>>>>>> Registry
>>>>>>>>>>>>> instance(maintain connection to zookeeper) will be
>>>>> separate..
>>>>>>>>>>>>>
>>>>>>>>>>>>> see
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>
>> http://shammijayasinghe.blogspot.com/2012/02/zookeeper-increase-maximum-number-of.html
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Sun, Oct 5, 2014 at 2:24 PM, Khaled Elmeleegy <
>>>>>>>>> kdiaa@hotmail.com>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> I tried creating my own HConnections pool to use
>> for my
>>>>> HBase
>>>>>>>>> calls,
>>>>>>>>>>> so
>>>>>>>>>>>>>> that not all the (2K) threads share the same
>> HConnection.
>>>>>>>>> However, I
>>>>>>>>>>>>> could
>>>>>>>>>>>>>> only have 10 HConnections. Beyond that I get ZK
>>>>> exceptions,
>>>>>>>>> please
>>>>>>>>>>> find
>>>>>>>>>>>>> it
>>>>>>>>>>>>>> below. Also, with 10 HConnections, I don't see
>> noticeable
>>>>>>>>>>> improvement in
>>>>>>>>>>>>>> performance so far.
>>>>>>>>>>>>>> 2014-10-05 06:11:26,490 WARN [main]
>>>>>>>>> zookeeper.RecoverableZooKeeper
>>>>>>>>>>>>>> (RecoverableZooKeeper.java:retryOrThrow(253)) -
>> Possibly
>>>>>>>>> transient
>>>>>>>>>>>>>> ZooKeeper, quorum=54.68.206.252:2181,
>>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>
>> exception=org.apache.zookeeper.KeeperException$ConnectionLossException:
>>>>>>>>>>>>>> KeeperErrorCode = ConnectionLoss for
>>>>> /hbase/hbaseid2014-10-05
>>>>>>>>>>>>> 06:11:26,490
>>>>>>>>>>>>>> INFO [main] util.RetryCounter
>>>>>>>>>>>>> (RetryCounter.java:sleepUntilNextRetry(155))
>>>>>>>>>>>>>> - Sleeping 1000ms before retry #0...2014-10-05
>>>>> 06:11:27,845
>>>>>>> WARN
>>>>>>>>>>> [main]
>>>>>>>>>>>>>> zookeeper.RecoverableZooKeeper
>>>>>>>>>>>>>> (RecoverableZooKeeper.java:retryOrThrow(253)) -
>> Possibly
>>>>>>>>> transient
>>>>>>>>>>>>>> ZooKeeper, quorum=54.68.206.252:2181,
>>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>
>> exception=org.apache.zookeeper.KeeperException$ConnectionLossException:
>>>>>>>>>>>>>> KeeperErrorCode = ConnectionLoss for
>>>>> /hbase/hbaseid2014-10-05
>>>>>>>>>>>>> 06:11:27,849
>>>>>>>>>>>>>> INFO [main] util.RetryCounter
>>>>>>>>>>>>> (RetryCounter.java:sleepUntilNextRetry(155))
>>>>>>>>>>>>>> - Sleeping 2000ms before retry #1...2014-10-05
>>>>> 06:11:30,405
>>>>>>> WARN
>>>>>>>>>>> [main]
>>>>>>>>>>>>>> zookeeper.RecoverableZooKeeper
>>>>>>>>>>>>>> (RecoverableZooKeeper.java:retryOrThrow(253)) -
>> Possibly
>>>>>>>>> transient
>>>>>>>>>>>>>> ZooKeeper, quorum=54.68.206.252:2181,
>>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>
>> exception=org.apache.zookeeper.KeeperException$ConnectionLossException:
>>>>>>>>>>>>>> KeeperErrorCode = ConnectionLoss for
>>>>> /hbase/hbaseid2014-10-05
>>>>>>>>>>>>> 06:11:30,405
>>>>>>>>>>>>>> INFO [main] util.RetryCounter
>>>>>>>>>>>>> (RetryCounter.java:sleepUntilNextRetry(155))
>>>>>>>>>>>>>> - Sleeping 4000ms before retry #2...2014-10-05
>>>>> 06:11:35,278
>>>>>>> WARN
>>>>>>>>>>> [main]
>>>>>>>>>>>>>> zookeeper.RecoverableZooKeeper
>>>>>>>>>>>>>> (RecoverableZooKeeper.java:retryOrThrow(253)) -
>> Possibly
>>>>>>>>> transient
>>>>>>>>>>>>>> ZooKeeper, quorum=54.68.206.252:2181,
>>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>
>> exception=org.apache.zookeeper.KeeperException$ConnectionLossException:
>>>>>>>>>>>>>> KeeperErrorCode = ConnectionLoss for
>>>>> /hbase/hbaseid2014-10-05
>>>>>>>>>>>>> 06:11:35,279
>>>>>>>>>>>>>> INFO [main] util.RetryCounter
>>>>>>>>>>>>> (RetryCounter.java:sleepUntilNextRetry(155))
>>>>>>>>>>>>>> - Sleeping 8000ms before retry #3...2014-10-05
>>>>> 06:11:44,393
>>>>>>> WARN
>>>>>>>>>>> [main]
>>>>>>>>>>>>>> zookeeper.RecoverableZooKeeper
>>>>>>>>>>>>>> (RecoverableZooKeeper.java:retryOrThrow(253)) -
>> Possibly
>>>>>>>>> transient
>>>>>>>>>>>>>> ZooKeeper, quorum=54.68.206.252:2181,
>>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>
>> exception=org.apache.zookeeper.KeeperException$ConnectionLossException:
>>>>>>>>>>>>>> KeeperErrorCode = ConnectionLoss for
>>>>> /hbase/hbaseid2014-10-05
>>>>>>>>>>>>> 06:11:44,393
>>>>>>>>>>>>>> ERROR [main] zookeeper.RecoverableZooKeeper
>>>>>>>>>>>>>> (RecoverableZooKeeper.java:retryOrThrow(255)) -
>> ZooKeeper
>>>>>>> exists
>>>>>>>>>>> failed
>>>>>>>>>>>>>> after 4 attempts2014-10-05 06:11:44,394 WARN [main]
>>>>>>>>> zookeeper.ZKUtil
>>>>>>>>>>>>>> (ZKUtil.java:checkExists(482)) -
>> hconnection-0x4e174f3b,
>>>>>>> quorum=
>>>>>>>>>>>>>> 54.68.206.252:2181, baseZNode=/hbase Unable to set
>>>>> watcher
>>>>>>> on
>>>>>>>>> znode
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>
>> (/hbase/hbaseid)org.apache.zookeeper.KeeperException$ConnectionLossException:
>>>>>>>>>>>>>> KeeperErrorCode = ConnectionLoss for
>> /hbase/hbaseid at
>>>>>>>>>>>>>>
>>>>>>>>>
>>>>> org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
>>>>>>>>>>> at
>>>>>>>>>>>>>>
>>>>>>>>>
>>>>> org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>>>>>>>>>>> at
>>>>>>>>>>>>>>
>>>>> org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1041) at
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>
>> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.exists(RecoverableZooKeeper.java:199)
>>>>>>>>>>>>>> at
>>>>>>>>>>>
>>>>>>>
>> org.apache.hadoop.hbase.zookeeper.ZKUtil.checkExists(ZKUtil.java:479)
>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>
>> org.apache.hadoop.hbase.zookeeper.ZKClusterId.readClusterIdZNode(ZKClusterId.java:65)
>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>
>> org.apache.hadoop.hbase.client.ZooKeeperRegistry.getClusterId(ZooKeeperRegistry.java:83)
>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>
>> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.retrieveClusterId(HConnectionManager.java:857)
>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>
>> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.<init>(HConnectionManager.java:662)
>>>>>>>>>>>>>> at
>>>>>>> sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>>>>>>>>>>>>>> Method) at
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>
>> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>
>> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>>>>>>>>>>>>>> at
>>>>>>>>>
>> java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>>>>>>>>>>>>> at
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>
>> org.apache.hadoop.hbase.client.HConnectionManager.createConnection(HConnectionManager.java:414)
>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>
>> org.apache.hadoop.hbase.client.HConnectionManager.createConnection(HConnectionManager.java:335)...
>>>>>>>>>>>>>>> From: kdiaa@hotmail.com
>>>>>>>>>>>>>>> To: user@hbase.apache.org
>>>>>>>>>>>>>>> Subject: RE: HBase read performance
>>>>>>>>>>>>>>> Date: Fri, 3 Oct 2014 12:37:39 -0700
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Lars, Ted, and Qiang,
>>>>>>>>>>>>>>> Thanks for all the input.
>>>>>>>>>>>>>>> Qiang: yes all the threads are in the same client
>>>>> process
>>>>>>>>> sharing
>>>>>>>>>>> the
>>>>>>>>>>>>>> same connection. And since I don't see hardware
>>>>> contention,
>>>>>>> may
>>>>>>>>> be
>>>>>>>>>>> there
>>>>>>>>>>>>> is
>>>>>>>>>>>>>> contention over this code path. I'll try using many
>>>>>>> connections
>>>>>>>>> and
>>>>>>>>>>> see
>>>>>>>>>>>>> if
>>>>>>>>>>>>>> it alleviates the problems and I'll report back.
>>>>>>>>>>>>>>> Thanks again,Khaled
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Date: Fri, 3 Oct 2014 15:18:30 +0800
>>>>>>>>>>>>>>>> Subject: Re: HBase read performance
>>>>>>>>>>>>>>>> From: tianq01@gmail.com
>>>>>>>>>>>>>>>> To: user@hbase.apache.org
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Regarding to profiling, Andrew introduced
>>>>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>> http://www.brendangregg.com/blog/2014-06-12/java-flame-graphs.html
>>>>>>>>>>>>>> months
>>>>>>>>>>>>>>>> ago.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> processCallTime comes from RpcServer#call, so
>> it
>>>>> looks
>>>>>>> good?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I have a suspect:
>>>>>>>>>>> https://issues.apache.org/jira/browse/HBASE-11306
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> how many processes do you have for your 2000
>> threads?
>>>>>>>>>>>>>>>> if olny 1 process, those threads will share
>> just 1
>>>>>>>>> connection to
>>>>>>>>>>> that
>>>>>>>>>>>>>>>> regionserver, there might be big contention on
>> the
>>>>> RPC
>>>>>>> code
>>>>>>>>> path.
>>>>>>>>>>>>>> ---for
>>>>>>>>>>>>>>>> such case, could you try using different
>> connections?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>
>> https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HConnectionManager.html
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Fri, Oct 3, 2014 at 9:55 AM, Ted Yu <
>>>>>>> yuzhihong@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Khaled:
>>>>>>>>>>>>>>>>> Do you have profiler such as jprofiler ?
>>>>>>>>>>>>>>>>> Profiling would give us more hint.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Otherwise capturing stack trace during the
>> period
>>>>> of
>>>>>>>>> reverse
>>>>>>>>>>> scan
>>>>>>>>>>>>>> would
>>>>>>>>>>>>>>>>> help.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Cheers
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Thu, Oct 2, 2014 at 4:52 PM, lars
>> hofhansl <
>>>>>>>>>>> larsh@apache.org>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> You might have the data in the OS buffer
>> cache,
>>>>>>> without
>>>>>>>>> short
>>>>>>>>>>>>>> circuit
>>>>>>>>>>>>>>>>>> reading the region server has to request
>> the
>>>>> block
>>>>>>> from
>>>>>>>>> the
>>>>>>>>>>> data
>>>>>>>>>>>>>> node
>>>>>>>>>>>>>>>>>> process, which then reads it from the block
>>>>> cache.
>>>>>>>>>>>>>>>>>> That is a few context switches per RPC
>> that do
>>>>> not
>>>>>>> show
>>>>>>>>> up
>>>>>>>>>>> in CPU
>>>>>>>>>>>>>>>>> metrics.
>>>>>>>>>>>>>>>>>> In that you also would not see disk IO.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> If - as you say - you see a lot of evicted
>>>>> blocks the
>>>>>>>>> data
>>>>>>>>>>> *has*
>>>>>>>>>>>>>> to come
>>>>>>>>>>>>>>>>>> from the OS. If you do not see disk IO is
>> *has*
>>>>> to
>>>>>>> come
>>>>>>>>> from
>>>>>>>>>>> the
>>>>>>>>>>>>> OS
>>>>>>>>>>>>>>>>> cache.
>>>>>>>>>>>>>>>>>> I.e. there's more RAM on your boxes, and
>> you
>>>>> should
>>>>>>>>> increase
>>>>>>>>>>> the
>>>>>>>>>>>>>> heap
>>>>>>>>>>>>>>>>> block
>>>>>>>>>>>>>>>>>> cache.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> You can measure the context switches with
>> vmstat.
>>>>>>> Other
>>>>>>>>> than
>>>>>>>>>>> that
>>>>>>>>>>>>>> I have
>>>>>>>>>>>>>>>>>> no suggestion until I reproduce the
>> problem.
>>>>>>>>>>>>>>>>>> Also check the data locality index of the
>> region
>>>>>>> server
>>>>>>>>> it
>>>>>>>>>>> should
>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>> close
>>>>>>>>>>>>>>>>>> to 100%.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> -- Lars
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> ________________________________
>>>>>>>>>>>>>>>>>> From: Khaled Elmeleegy <kd...@hotmail.com>
>>>>>>>>>>>>>>>>>> To: "user@hbase.apache.org" <
>>>>> user@hbase.apache.org>
>>>>>>>>>>>>>>>>>> Sent: Thursday, October 2, 2014 3:24 PM
>>>>>>>>>>>>>>>>>> Subject: RE: HBase read performance
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Lars thanks a lot about all the tips. I'll
>> make
>>>>> sure
>>>>>>> I
>>>>>>>>> cover
>>>>>>>>>>> all
>>>>>>>>>>>>>> of them
>>>>>>>>>>>>>>>>>> and get back to you. I am not sure they
>> are the
>>>>>>>>> bottleneck
>>>>>>>>>>> though
>>>>>>>>>>>>>> as they
>>>>>>>>>>>>>>>>>> all are about optimizing physical resource
>>>>> usage. As
>>>>>>> I
>>>>>>>>> said,
>>>>>>>>>>> I
>>>>>>>>>>>>>> don't see
>>>>>>>>>>>>>>>>>> any contended physical resources now. I'll
>> also
>>>>> try
>>>>>>> to
>>>>>>>>>>> reproduce
>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>> problem in a simpler environment and pass
>> to you
>>>>> the
>>>>>>> test
>>>>>>>>>>> program
>>>>>>>>>>>>>> to play
>>>>>>>>>>>>>>>>>> with.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Couple of high level points to make. You
>> are
>>>>> right
>>>>>>> that
>>>>>>>>> my
>>>>>>>>>>> use
>>>>>>>>>>>>>> case is
>>>>>>>>>>>>>>>>>> kind of a worst case for HBase reads. But,
>> if
>>>>> things
>>>>>>> go
>>>>>>>>> the
>>>>>>>>>>> way
>>>>>>>>>>>>> you
>>>>>>>>>>>>>>>>>> described them, there should be tons of
>> disk IO
>>>>> and
>>>>>>> that
>>>>>>>>>>> should
>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>> clearly
>>>>>>>>>>>>>>>>>> the bottleneck. This is not the case
>> though.
>>>>> That's
>>>>>>> for
>>>>>>>>> the
>>>>>>>>>>>>> simple
>>>>>>>>>>>>>> reason
>>>>>>>>>>>>>>>>>> that this is done in a test environment (I
>> am
>>>>> still
>>>>>>>>>>> prototyping),
>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>> not a
>>>>>>>>>>>>>>>>>> lot of data is yet written to HBase.
>> However for
>>>>> the
>>>>>>>>> real use
>>>>>>>>>>>>>> case, there
>>>>>>>>>>>>>>>>>> should writers constantly writing data to
>> HBase
>>>>> and
>>>>>>>>> readers
>>>>>>>>>>>>>> occasionally
>>>>>>>>>>>>>>>>>> doing this scatter/gather. At steady state,
>>>>> things
>>>>>>> should
>>>>>>>>>>> only
>>>>>>>>>>>>> get
>>>>>>>>>>>>>> worse
>>>>>>>>>>>>>>>>>> and all the issues you mentioned should
>> get far
>>>>> more
>>>>>>>>>>> pronounced.
>>>>>>>>>>>>>> At this
>>>>>>>>>>>>>>>>>> point, one can try to mitigate it using
>> more
>>>>> memory
>>>>>>> or
>>>>>>>>> so. I
>>>>>>>>>>> am
>>>>>>>>>>>>>> not there
>>>>>>>>>>>>>>>>>> yet as I think I am hitting some software
>>>>> bottleneck,
>>>>>>>>> which I
>>>>>>>>>>>>>> don't know
>>>>>>>>>>>>>>>>>> how to work around.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Khaled
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> ----------------------------------------
>>>>>>>>>>>>>>>>>>> Date: Thu, 2 Oct 2014 14:20:47 -0700
>>>>>>>>>>>>>>>>>>> From: larsh@apache.org
>>>>>>>>>>>>>>>>>>> Subject: Re: HBase read performance
>>>>>>>>>>>>>>>>>>> To: user@hbase.apache.org
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> OK... We might need to investigate this.
>>>>>>>>>>>>>>>>>>> Any chance that you can provide a
>> minimal test
>>>>>>> program
>>>>>>>>> and
>>>>>>>>>>>>>> instruction
>>>>>>>>>>>>>>>>>> about how to set it up.
>>>>>>>>>>>>>>>>>>> We can do some profiling then.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> One thing to note is that with scanning
>> HBase
>>>>>>> cannot
>>>>>>>>> use
>>>>>>>>>>> bloom
>>>>>>>>>>>>>> filters
>>>>>>>>>>>>>>>>>> to rule out HFiles ahead of time, it needs
>> to
>>>>> look
>>>>>>> into
>>>>>>>>> all
>>>>>>>>>>> of
>>>>>>>>>>>>>> them.
>>>>>>>>>>>>>>>>>>> So you kind of hit on the absolute worst
>> case:
>>>>>>>>>>>>>>>>>>> - random reads that do not fit into the
>> block
>>>>> cache
>>>>>>>>>>>>>>>>>>> - cannot use bloom filters
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Few more question/comments:
>>>>>>>>>>>>>>>>>>> - Do you have short circuit reading
>> enabled? If
>>>>>>> not,
>>>>>>>>> you
>>>>>>>>>>>>> should.
>>>>>>>>>>>>>>>>>>> - Is your table major compacted? That
>> will
>>>>> reduce
>>>>>>> the
>>>>>>>>>>> number of
>>>>>>>>>>>>>> files
>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>> look at.
>>>>>>>>>>>>>>>>>>> - Did you disable Nagle's everywhere
>> (enabled
>>>>>>>>> tcpnodelay)?
>>>>>>>>>>> It
>>>>>>>>>>>>>> disabled
>>>>>>>>>>>>>>>>>> by default in HBase, but necessarily in
>> your
>>>>> install
>>>>>>> of
>>>>>>>>> HDFS.
>>>>>>>>>>>>>>>>>>> - Which version of HDFS are you using as
>>>>> backing
>>>>>>>>>>> filesystem?
>>>>>>>>>>>>>>>>>>> - If your disk is idle, it means the
>> data fits
>>>>> into
>>>>>>>>> the OS
>>>>>>>>>>>>> buffer
>>>>>>>>>>>>>>>>> cache.
>>>>>>>>>>>>>>>>>> In turn that means that you increase the
>> heap
>>>>> for the
>>>>>>>>> region
>>>>>>>>>>>>>> servers. You
>>>>>>>>>>>>>>>>>> can also use block encoding (FAST_DIFF) to
>> try to
>>>>>>> make
>>>>>>>>> sure
>>>>>>>>>>> the
>>>>>>>>>>>>>> entire
>>>>>>>>>>>>>>>>>> working set fits into the cache.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> - Also try to reduce the block size -
>> although
>>>>> if
>>>>>>> your
>>>>>>>>>>> overall
>>>>>>>>>>>>>> working
>>>>>>>>>>>>>>>>>> set does not fit in the heap it won't help
>> much.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> This is a good section of the book to
>> read
>>>>> through
>>>>>>>>>>> generally
>>>>>>>>>>>>>> (even
>>>>>>>>>>>>>>>>>> though you might know most of this
>> already):
>>>>>>>>>>>>>>>>>>
>>>>>>> http://hbase.apache.org/book.html#perf.configurations
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> -- Lars
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> ----- Original Message -----
>>>>>>>>>>>>>>>>>>> From: Khaled Elmeleegy <
>> kdiaa@hotmail.com>
>>>>>>>>>>>>>>>>>>> To: "user@hbase.apache.org" <
>>>>> user@hbase.apache.org
>>>>>>>>
>>>>>>>>>>>>>>>>>>> Cc:
>>>>>>>>>>>>>>>>>>> Sent: Thursday, October 2, 2014 11:27 AM
>>>>>>>>>>>>>>>>>>> Subject: RE: HBase read performance
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I do see a very brief spike in CPU
>>>>> (user/system),
>>>>>>> but
>>>>>>>>> it's
>>>>>>>>>>> no
>>>>>>>>>>>>>> where
>>>>>>>>>>>>>>>>> near
>>>>>>>>>>>>>>>>>> 0% idle. It goes from 99% idle down to
>> something
>>>>>>> like 40%
>>>>>>>>>>> idle
>>>>>>>>>>>>> for
>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>> second
>>>>>>>>>>>>>>>>>> or so. The thing to note, this is all on a
>> test
>>>>>>> cluster,
>>>>>>>>> so
>>>>>>>>>>> no
>>>>>>>>>>>>>> real load.
>>>>>>>>>>>>>>>>>> Things are generally idle until i issue
>> 2-3 of
>>>>> these
>>>>>>>>>>>>>> multi-scan-requests
>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>> render a web page. Then, you see the spike
>> in the
>>>>>>> cpu and
>>>>>>>>>>> some
>>>>>>>>>>>>>> activity
>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>> the network and disk, but nowhere near
>>>>> saturation.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> If there are specific tests you'd like
>> me to
>>>>> do to
>>>>>>>>> debug
>>>>>>>>>>> this,
>>>>>>>>>>>>>> I'd be
>>>>>>>>>>>>>>>>>> more than happy to do it.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Khaled
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> ----------------------------------------
>>>>>>>>>>>>>>>>>>>> Date: Thu, 2 Oct 2014 11:15:59 -0700
>>>>>>>>>>>>>>>>>>>> From: larsh@apache.org
>>>>>>>>>>>>>>>>>>>> Subject: Re: HBase read performance
>>>>>>>>>>>>>>>>>>>> To: user@hbase.apache.org
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I still think you're waiting on disk. No
>>>>> IOWAIT?
>>>>>>> So
>>>>>>>>> CPU
>>>>>>>>>>> is not
>>>>>>>>>>>>>> waiting
>>>>>>>>>>>>>>>>>> a lot for IO. No high User/System CPU
>> either?
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> If you see a lot of evicted block then
>> each
>>>>> RPC
>>>>>>> has a
>>>>>>>>> high
>>>>>>>>>>>>>> chance of
>>>>>>>>>>>>>>>>>> requiring to bring an entire 64k block in.
>> You'll
>>>>>>> see bad
>>>>>>>>>>>>>> performance
>>>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>> this.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> We might need to trace this specific
>> scenario.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> -- Lars
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> ________________________________
>>>>>>>>>>>>>>>>>>>> From: Khaled Elmeleegy <
>> kdiaa@hotmail.com>
>>>>>>>>>>>>>>>>>>>> To: "user@hbase.apache.org" <
>>>>>>> user@hbase.apache.org>
>>>>>>>>>>>>>>>>>>>> Sent: Thursday, October 2, 2014 10:46 AM
>>>>>>>>>>>>>>>>>>>> Subject: RE: HBase read performance
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I've set the heap size to 6GB and I do
>> have gc
>>>>>>>>> logging. No
>>>>>>>>>>>>> long
>>>>>>>>>>>>>> pauses
>>>>>>>>>>>>>>>>>> there -- occasional 0.1s or 0.2s.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Other than the discrepancy between
>> what's
>>>>>>> reported on
>>>>>>>>> the
>>>>>>>>>>>>>> client and
>>>>>>>>>>>>>>>>>> what's reported at the RS, there is also
>> the
>>>>> issue
>>>>>>> of not
>>>>>>>>>>> getting
>>>>>>>>>>>>>> proper
>>>>>>>>>>>>>>>>>> concurrency. So, even if a reverse get
>> takes
>>>>> 100ms
>>>>>>> or so
>>>>>>>>>>> (this
>>>>>>>>>>>>> has
>>>>>>>>>>>>>> to be
>>>>>>>>>>>>>>>>>> mostly blocking on various things as no
>> physical
>>>>>>>>> resource is
>>>>>>>>>>>>>> contended),
>>>>>>>>>>>>>>>>>> then the other gets/scans should be able to
>>>>> proceed
>>>>>>> in
>>>>>>>>>>> parallel,
>>>>>>>>>>>>>> so a
>>>>>>>>>>>>>>>>>> thousand concurrent gets/scans should
>> finish in
>>>>> few
>>>>>>>>> hundreds
>>>>>>>>>>> of
>>>>>>>>>>>>> ms
>>>>>>>>>>>>>> not
>>>>>>>>>>>>>>>>> many
>>>>>>>>>>>>>>>>>> seconds. That's why I thought I'd increase
>> the
>>>>>>> handlers
>>>>>>>>>>> count to
>>>>>>>>>>>>>> try to
>>>>>>>>>>>>>>>>> get
>>>>>>>>>>>>>>>>>> more concurrency, but it didn't help. So,
>> there
>>>>> must
>>>>>>> be
>>>>>>>>>>> something
>>>>>>>>>>>>>> else.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Khaled
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> ----------------------------------------
>>>>>>>>>>>>>>>>>>>>> From: ndimiduk@gmail.com
>>>>>>>>>>>>>>>>>>>>> Date: Thu, 2 Oct 2014 10:36:39 -0700
>>>>>>>>>>>>>>>>>>>>> Subject: Re: HBase read performance
>>>>>>>>>>>>>>>>>>>>> To: user@hbase.apache.org
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Do check again on the heap size of the
>> region
>>>>>>>>> servers.
>>>>>>>>>>> The
>>>>>>>>>>>>>> default
>>>>>>>>>>>>>>>>>>>>> unconfigured size is 1G; too small for
>> much
>>>>> of
>>>>>>>>> anything.
>>>>>>>>>>>>> Check
>>>>>>>>>>>>>> your
>>>>>>>>>>>>>>>>> RS
>>>>>>>>>>>>>>>>>> logs
>>>>>>>>>>>>>>>>>>>>> -- look for lines produced by the
>>>>> JVMPauseMonitor
>>>>>>>>> thread.
>>>>>>>>>>>>> They
>>>>>>>>>>>>>>>>> usually
>>>>>>>>>>>>>>>>>>>>> correlate with long GC pauses or other
>>>>>>> process-freeze
>>>>>>>>>>> events.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Get is implemented as a Scan of a
>> single
>>>>> row, so
>>>>>>> a
>>>>>>>>>>> reverse
>>>>>>>>>>>>>> scan of a
>>>>>>>>>>>>>>>>>> single
>>>>>>>>>>>>>>>>>>>>> row should be functionally equivalent.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> In practice, I have seen discrepancy
>> between
>>>>> the
>>>>>>>>>>> latencies
>>>>>>>>>>>>>> reported
>>>>>>>>>>>>>>>>> by
>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>> RS and the latencies experienced by the
>>>>> client.
>>>>>>> I've
>>>>>>>>> not
>>>>>>>>>>>>>> investigated
>>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>>>> area thoroughly.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On Thu, Oct 2, 2014 at 10:05 AM, Khaled
>>>>>>> Elmeleegy <
>>>>>>>>>>>>>> kdiaa@hotmail.com
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Thanks Lars for your quick reply.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Yes performance is similar with less
>>>>> handlers (I
>>>>>>>>> tried
>>>>>>>>>>> with
>>>>>>>>>>>>>> 100
>>>>>>>>>>>>>>>>>> first).
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> The payload is not big ~1KB or so. The
>>>>> working
>>>>>>> set
>>>>>>>>>>> doesn't
>>>>>>>>>>>>>> seem to
>>>>>>>>>>>>>>>>>> fit in
>>>>>>>>>>>>>>>>>>>>>> memory as there are many cache misses.
>>>>> However,
>>>>>>>>> disk is
>>>>>>>>>>> far
>>>>>>>>>>>>>> from
>>>>>>>>>>>>>>>>>> being a
>>>>>>>>>>>>>>>>>>>>>> bottleneck. I checked using iostat. I
>> also
>>>>>>> verified
>>>>>>>>> that
>>>>>>>>>>>>>> neither the
>>>>>>>>>>>>>>>>>>>>>> network nor the CPU of the region
>> server or
>>>>> the
>>>>>>>>> client
>>>>>>>>>>> are a
>>>>>>>>>>>>>>>>>> bottleneck.
>>>>>>>>>>>>>>>>>>>>>> This leads me to believe that likely
>> this
>>>>> is a
>>>>>>>>> software
>>>>>>>>>>>>>> bottleneck,
>>>>>>>>>>>>>>>>>>>>>> possibly due to a misconfiguration on
>> my
>>>>> side. I
>>>>>>>>> just
>>>>>>>>>>> don't
>>>>>>>>>>>>>> know how
>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>> debug it. A clear disconnect I see is
>> the
>>>>>>> individual
>>>>>>>>>>> request
>>>>>>>>>>>>>> latency
>>>>>>>>>>>>>>>>>> as
>>>>>>>>>>>>>>>>>>>>>> reported by metrics on the region
>> server
>>>>> (IPC
>>>>>>>>>>>>> processCallTime
>>>>>>>>>>>>>> vs
>>>>>>>>>>>>>>>>>> scanNext)
>>>>>>>>>>>>>>>>>>>>>> vs what's measured on the client.
>> Does this
>>>>>>> sound
>>>>>>>>>>> right? Any
>>>>>>>>>>>>>> ideas
>>>>>>>>>>>>>>>>> on
>>>>>>>>>>>>>>>>>> how
>>>>>>>>>>>>>>>>>>>>>> to better debug it?
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> About this trick with the timestamps
>> to be
>>>>> able
>>>>>>> to
>>>>>>>>> do a
>>>>>>>>>>>>>> forward
>>>>>>>>>>>>>>>>> scan,
>>>>>>>>>>>>>>>>>>>>>> thanks for pointing it out. Actually,
>> I am
>>>>>>> aware of
>>>>>>>>> it.
>>>>>>>>>>> The
>>>>>>>>>>>>>> problem
>>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>> have
>>>>>>>>>>>>>>>>>>>>>> is, sometimes I want to get the key
>> after a
>>>>>>>>> particular
>>>>>>>>>>>>>> timestamp and
>>>>>>>>>>>>>>>>>>>>>> sometimes I want to get the key
>> before, so
>>>>> just
>>>>>>>>> relying
>>>>>>>>>>> on
>>>>>>>>>>>>>> the key
>>>>>>>>>>>>>>>>>> order
>>>>>>>>>>>>>>>>>>>>>> doesn't work. Ideally, I want a
>> reverse
>>>>> get(). I
>>>>>>>>> thought
>>>>>>>>>>>>>> reverse
>>>>>>>>>>>>>>>>> scan
>>>>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>>>>>>> do the trick though.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Khaled
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>> ----------------------------------------
>>>>>>>>>>>>>>>>>>>>>>> Date: Thu, 2 Oct 2014 09:40:37 -0700
>>>>>>>>>>>>>>>>>>>>>>> From: larsh@apache.org
>>>>>>>>>>>>>>>>>>>>>>> Subject: Re: HBase read performance
>>>>>>>>>>>>>>>>>>>>>>> To: user@hbase.apache.org
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Hi Khaled,
>>>>>>>>>>>>>>>>>>>>>>> is it the same with fewer threads?
>> 1500
>>>>> handler
>>>>>>>>> threads
>>>>>>>>>>>>>> seems to
>>>>>>>>>>>>>>>>> be a
>>>>>>>>>>>>>>>>>>>>>> lot. Typically a good number of
>> threads
>>>>> depends
>>>>>>> on
>>>>>>>>> the
>>>>>>>>>>>>>> hardware
>>>>>>>>>>>>>>>>>> (number of
>>>>>>>>>>>>>>>>>>>>>> cores, number of spindles, etc). I
>> cannot
>>>>> think
>>>>>>> of
>>>>>>>>> any
>>>>>>>>>>> type
>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>> scenario
>>>>>>>>>>>>>>>>>>>>>> where more than 100 would give any
>>>>> improvement.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> How large is the payload per KV
>> retrieved
>>>>> that
>>>>>>>>> way? If
>>>>>>>>>>>>> large
>>>>>>>>>>>>>> (as
>>>>>>>>>>>>>>>>> in a
>>>>>>>>>>>>>>>>>>>>>> few 100k) you definitely want to
>> lower the
>>>>>>> number
>>>>>>>>> of the
>>>>>>>>>>>>>> handler
>>>>>>>>>>>>>>>>>> threads.
>>>>>>>>>>>>>>>>>>>>>>> How much heap do you give the region
>>>>> server?
>>>>>>> Does
>>>>>>>>> the
>>>>>>>>>>>>>> working set
>>>>>>>>>>>>>>>>> fit
>>>>>>>>>>>>>>>>>>>>>> into the cache? (i.e. in the metrics,
>> do you
>>>>>>> see the
>>>>>>>>>>>>> eviction
>>>>>>>>>>>>>> count
>>>>>>>>>>>>>>>>>> going
>>>>>>>>>>>>>>>>>>>>>> up, if so it does not fit into the
>> cache).
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> If the working set does not fit into
>> the
>>>>> cache
>>>>>>>>>>> (eviction
>>>>>>>>>>>>>> count goes
>>>>>>>>>>>>>>>>>> up)
>>>>>>>>>>>>>>>>>>>>>> then HBase will need to bring a new
>> block in
>>>>>>> from
>>>>>>>>> disk
>>>>>>>>>>> on
>>>>>>>>>>>>>> each Get
>>>>>>>>>>>>>>>>>>>>>> (assuming the Gets are more or less
>> random
>>>>> as
>>>>>>> far
>>>>>>>>> as the
>>>>>>>>>>>>>> server is
>>>>>>>>>>>>>>>>>>>>>> concerned).
>>>>>>>>>>>>>>>>>>>>>>> In case you'll benefit from reducing
>> the
>>>>> HFile
>>>>>>>>> block
>>>>>>>>>>> size
>>>>>>>>>>>>>> (from 64k
>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>> 8k or even 4k).
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Lastly I don't think we tested the
>>>>> performance
>>>>>>> of
>>>>>>>>> using
>>>>>>>>>>>>>> reverse
>>>>>>>>>>>>>>>>> scan
>>>>>>>>>>>>>>>>>>>>>> this way, there is probably room to
>> optimize
>>>>>>> this.
>>>>>>>>>>>>>>>>>>>>>>> Can you restructure your keys to
>> allow
>>>>> forwards
>>>>>>>>>>> scanning?
>>>>>>>>>>>>> For
>>>>>>>>>>>>>>>>> example
>>>>>>>>>>>>>>>>>>>>>> you could store the time as
>> MAX_LONG-time.
>>>>> Or
>>>>>>> you
>>>>>>>>> could
>>>>>>>>>>>>>> invert all
>>>>>>>>>>>>>>>>>> the bits
>>>>>>>>>>>>>>>>>>>>>> of the time portion of the key, so
>> that it
>>>>> sort
>>>>>>> the
>>>>>>>>>>> other
>>>>>>>>>>>>>> way. Then
>>>>>>>>>>>>>>>>>> you
>>>>>>>>>>>>>>>>>>>>>> could do a forward scan.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Let us know how it goes.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> -- Lars
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> ----- Original Message -----
>>>>>>>>>>>>>>>>>>>>>>> From: Khaled Elmeleegy <
>> kdiaa@hotmail.com>
>>>>>>>>>>>>>>>>>>>>>>> To: "user@hbase.apache.org" <
>>>>>>> user@hbase.apache.org
>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Cc:
>>>>>>>>>>>>>>>>>>>>>>> Sent: Thursday, October 2, 2014
>> 12:12 AM
>>>>>>>>>>>>>>>>>>>>>>> Subject: HBase read performance
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> I am trying to do a scatter/gather
>> on hbase
>>>>>>>>> (0.98.6.1),
>>>>>>>>>>>>>> where I
>>>>>>>>>>>>>>>>> have
>>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>>>>> client reading ~1000 keys from an
>> HBase
>>>>> table.
>>>>>>> These
>>>>>>>>>>> keys
>>>>>>>>>>>>>> happen to
>>>>>>>>>>>>>>>>>> fall on
>>>>>>>>>>>>>>>>>>>>>> the same region server. For my reads
>> I use
>>>>>>> reverse
>>>>>>>>> scan
>>>>>>>>>>> to
>>>>>>>>>>>>>> read each
>>>>>>>>>>>>>>>>>> key as
>>>>>>>>>>>>>>>>>>>>>> I want the key prior to a specific
>> time
>>>>> stamp
>>>>>>> (time
>>>>>>>>>>> stamps
>>>>>>>>>>>>> are
>>>>>>>>>>>>>>>>> stored
>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>> reverse order). I don't believe gets
>> can
>>>>>>> accomplish
>>>>>>>>>>> that,
>>>>>>>>>>>>>> right? so
>>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>> use
>>>>>>>>>>>>>>>>>>>>>> scan, with caching set to 1.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> I use 2000 reader threads in the
>> client
>>>>> and on
>>>>>>>>> HBase,
>>>>>>>>>>> I've
>>>>>>>>>>>>>> set
>>>>>>>>>>>>>>>>>>>>>> hbase.regionserver.handler.count to
>> 1500.
>>>>> With
>>>>>>> this
>>>>>>>>>>> setup,
>>>>>>>>>>>>> my
>>>>>>>>>>>>>>>>> scatter
>>>>>>>>>>>>>>>>>>>>>> gather is very slow and can take up
>> to 10s
>>>>> in
>>>>>>> total.
>>>>>>>>>>> Timing
>>>>>>>>>>>>> an
>>>>>>>>>>>>>>>>>> individual
>>>>>>>>>>>>>>>>>>>>>> getScanner(..) call on the client
>> side, it
>>>>> can
>>>>>>>>> easily
>>>>>>>>>>> take
>>>>>>>>>>>>> few
>>>>>>>>>>>>>>>>>> hundreds of
>>>>>>>>>>>>>>>>>>>>>> ms. I also got the following metrics
>> from
>>>>> the
>>>>>>> region
>>>>>>>>>>> server
>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>> question:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> "queueCallTime_mean" :
>> 2.190855525775637,
>>>>>>>>>>>>>>>>>>>>>>> "queueCallTime_median" : 0.0,
>>>>>>>>>>>>>>>>>>>>>>> "queueCallTime_75th_percentile" :
>> 0.0,
>>>>>>>>>>>>>>>>>>>>>>> "queueCallTime_95th_percentile" :
>> 1.0,
>>>>>>>>>>>>>>>>>>>>>>> "queueCallTime_99th_percentile" :
>>>>>>>>> 556.9799999999818,
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> "processCallTime_min" : 0,
>>>>>>>>>>>>>>>>>>>>>>> "processCallTime_max" : 12755,
>>>>>>>>>>>>>>>>>>>>>>> "processCallTime_mean" :
>>>>> 105.64873440912682,
>>>>>>>>>>>>>>>>>>>>>>> "processCallTime_median" : 0.0,
>>>>>>>>>>>>>>>>>>>>>>> "processCallTime_75th_percentile" :
>> 2.0,
>>>>>>>>>>>>>>>>>>>>>>> "processCallTime_95th_percentile" :
>>>>> 7917.95,
>>>>>>>>>>>>>>>>>>>>>>> "processCallTime_99th_percentile" :
>>>>> 8876.89,
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>
>> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_min"
>>>>>>>>>>>>>>>>>>>>>> : 89,
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>
>> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_max"
>>>>>>>>>>>>>>>>>>>>>> : 11300,
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>
>> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_mean"
>>>>>>>>>>>>>>>>>>>>>> : 654.4949739797315,
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>
>> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_median"
>>>>>>>>>>>>>>>>>>>>>> : 101.0,
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>
>> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_75th_percentile"
>>>>>>>>>>>>>>>>>>>>>> : 101.0,
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>
>> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_95th_percentile"
>>>>>>>>>>>>>>>>>>>>>> : 101.0,
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>
>> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_99th_percentile"
>>>>>>>>>>>>>>>>>>>>>> : 113.0,
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Where "delta" is the name of the
>> table I am
>>>>>>>>> querying.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> In addition to all this, i monitored
>> the
>>>>>>> hardware
>>>>>>>>>>> resources
>>>>>>>>>>>>>> (CPU,
>>>>>>>>>>>>>>>>>> disk,
>>>>>>>>>>>>>>>>>>>>>> and network) of both the client and
>> the
>>>>> region
>>>>>>>>> server
>>>>>>>>>>> and
>>>>>>>>>>>>>> nothing
>>>>>>>>>>>>>>>>>> seems
>>>>>>>>>>>>>>>>>>>>>> anywhere near saturation. So I am
>> puzzled by
>>>>>>> what's
>>>>>>>>>>> going on
>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>> where this
>>>>>>>>>>>>>>>>>>>>>> time is going.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Few things to note based on the above
>>>>>>> measurements:
>>>>>>>>>>> both
>>>>>>>>>>>>>> medians of
>>>>>>>>>>>>>>>>>> IPC
>>>>>>>>>>>>>>>>>>>>>> processCallTime and queueCallTime are
>>>>> basically
>>>>>>> zero
>>>>>>>>>>> (ms I
>>>>>>>>>>>>>> presume,
>>>>>>>>>>>>>>>>>>>>>> right?). However, scanNext_median is
>> 101 (ms
>>>>>>> too,
>>>>>>>>>>> right?). I
>>>>>>>>>>>>>> am not
>>>>>>>>>>>>>>>>>> sure
>>>>>>>>>>>>>>>>>>>>>> how this adds up. Also, even though
>> the 101
>>>>>>> figure
>>>>>>>>> seems
>>>>>>>>>>>>>>>>> outrageously
>>>>>>>>>>>>>>>>>> high
>>>>>>>>>>>>>>>>>>>>>> and I don't know why, still all these
>> scans
>>>>>>> should
>>>>>>>>> be
>>>>>>>>>>>>>> happening in
>>>>>>>>>>>>>>>>>>>>>> parallel, so the overall call should
>> finish
>>>>>>> fast,
>>>>>>>>> given
>>>>>>>>>>> that
>>>>>>>>>>>>>> no
>>>>>>>>>>>>>>>>>> hardware
>>>>>>>>>>>>>>>>>>>>>> resource is contended, right? but
>> this is
>>>>> not
>>>>>>> what's
>>>>>>>>>>>>>> happening, so I
>>>>>>>>>>>>>>>>>> have
>>>>>>>>>>>>>>>>>>>>>> to be missing something(s).
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> So, any help is appreciated there.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>> Khaled
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Best regards,
>>>>>>>>>>
>>>>>>>>>> - Andy
>>>>>>>>>>
>>>>>>>>>> Problems worthy of attack prove their worth by hitting
>> back. -
>>>>> Piet
>>>>>>> Hein
>>>>>>>>>> (via Tom White)
>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>
>>>>>
>>

RE: HBase read performance

Posted by Sean Busbey <bu...@cloudera.com>.

Yes, please file a JIRA and attach the test there.

-- 
Sean
On Oct 13, 2014 10:22 PM, "Khaled Elmeleegy" <kd...@hotmail.com> wrote:

> Actually, it's very easy to reproduce. I have a simple program that
> reproduces it if run against a simple toy cluster with one RS. If shared
> the program with Ted. I can file a jira ticket about it if you want.
> Khaled
>
> > Date: Mon, 13 Oct 2014 19:12:13 -0700
> > From: larsh@apache.org
> > Subject: Re: HBase read performance
> > To: user@hbase.apache.org
> >
> > This looks like something to track.
> > If there's an easy way to reproduce this, could you file a jira with
> instructions?
> > (Although from the conversations here, it looks like it's not trivial to
> reproduce).
> >
> > -- Lars
> >
> >
> >
> > ________________________________
> >  From: Khaled Elmeleegy <kd...@hotmail.com>
> > To: "user@hbase.apache.org" <us...@hbase.apache.org>
> > Sent: Monday, October 13, 2014 12:13 PM
> > Subject: RE: HBase read performance
> >
> >
> > I went through a lengthy testing process with Ted. I tried both
> disabling just cache on write and both COW and BC. Disabling them makes the
> problem go away. I think disabling COW in my workload effectively doesn't
> exercise  the block cache as my workload is mostly writes, so disabling COW
> doesn't cause the cache blocks to churn.
> > I tried both overcommitting memory and not. This is irrelevant though.
> The cache operates on the virtual memory offered to it and it's agnostic to
> the physical memory size. Memory over commitment may only hurt performance
> due to swapping by the OS, but this has nothing to do with the bug.
> > Khaled
> >
> > > Date: Sun, 12 Oct 2014 23:17:24 -0700
> > > Subject: Re: HBase read performance
> > > From: stack@duboce.net
> > > To: user@hbase.apache.org
> > >
> > > On Sun, Oct 12, 2014 at 8:58 PM, Khaled Elmeleegy <kd...@hotmail.com>
> wrote:
> > >
> > > > It goes away, but i think that's because i am basically churning the
> cache
> > > > far less. My workload is mostly writes.
> > > >
> > > >
> > > What exactly did you change?
> > >
> > > Did you disable both BC and cacheonwrite or just one of these configs?
> > >
> > > Disabling BC and/or cacheonwrite would churn your cache more?
> > >
> > > What about the heap config?  Did you over commit mem?  If so, did you
> try
> > > doing a direct memory sizing that was w/i the bounds of physical memory
> > > (well within)?  Did you still see exceptions?
> > >
> > > St.Ack
> > >
> > >
> > >
> > >
> > > > > Date: Sun, 12 Oct 2014 19:43:16 -0700
> > > > > Subject: Re: HBase read performance
> > > > > From: yuzhihong@gmail.com
> > > > > To: user@hbase.apache.org
> > > > >
> > > > > Hi,
> > > > > Can you turn off cacheonwrite and keep bucket cache to see if the
> problem
> > > > > goes away ?
> > > > >
> > > > > Cheers
> > > > >
> > > > > On Fri, Oct 10, 2014 at 10:59 AM, Khaled Elmeleegy <
> kdiaa@hotmail.com>
> > > > > wrote:
> > > > >
> > > > > > Yes, I can reproduce it with some work.
> > > > > > The workload is basically as follows:
> > > > > > There are writers streaming writes to a table. Then, there is a
> reader
> > > > > > (invoked via a web interface). The reader does a 1000 parallel
> reverse
> > > > > > scans, all end up hitting the same region in my case. The scans
> are
> > > > > > effectively "gets" as I just need to get one record off of each
> of
> > > > them. I
> > > > > > just need to do a "reverse" get, which is not supported (would be
> > > > great to
> > > > > > have :)), so I do it via reverse scan. After few tries, the
> reader
> > > > > > consistently hits this bug.
> > > > > >
> > > > > > This happens with these config changes:
> > > > > > hbase-env:HBASE_REGIONSERVER_OPTS=-Xmx6G
> -XX:MaxDirectMemorySize=5G
> > > > > > -XX:CMSInitiatingOccupancyFraction=88 -XX:+AggressiveOpts
> -verbose:gc
> > > > > > -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Xlog
> > > > > > gc:/tmp/hbase-regionserver-gc.loghbase-site:
> > > > > > hbase.bucketcache.ioengine=offheap
> > > > > > hbase.bucketcache.size=4196
> > > > > > hbase.rs.cacheblocksonwrite=true
> > > > > > hfile.block.index.cacheonwrite=true
> > > > > > hfile.block.bloom.cacheonwrite=true
> > > > > >
> > > > > > Interestingly, without these config changes, I can't reproduce
> the
> > > > problem.
> > > > > > Khaled
> > > > > >
> > > > > >
> > > > > > > Date: Fri, 10 Oct 2014 10:05:14 -0700
> > > > > > > Subject: Re: HBase read performance
> > > > > > > From: stack@duboce.net
> > > > > > > To: user@hbase.apache.org
> > > > > > >
> > > > > > > It looks like we are messing up our positioning math:
> > > > > > >
> > > > > > > 233 <
> > > > > >
> > > >
> http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/nio/Buffer.java#233
> > > > > > >
> > > > > > >
> > > > > > > <
> > > > > >
> > > >
> http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/nio/Buffer.java#
> > > > > > >
> > > > > > >
> > > > > > >     public final Buffer
> > > > > > > <
> > > > > >
> > > >
> http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/nio/Buffer.java#Buffer
> > > > > > >
> > > > > > >  <
> > > > > >
> > > >
> http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/nio/Buffer.java#
> > > > > > >position(int
> > > > > > > newPosition) {
> > > > > > >
> > > > > > > 234 <
> > > > > >
> > > >
> http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/nio/Buffer.java#234
> > > > > > >
> > > > > > >
> > > > > > > <
> > > > > >
> > > >
> http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/nio/Buffer.java#
> > > > > > >
> > > > > > >
> > > > > > >         if ((newPosition > limit
> > > > > > > <
> > > > > >
> > > >
> http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/nio/Buffer.java#Buffer.0limit
> > > > > > >)
> > > > > > > || (newPosition < 0))
> > > > > > >
> > > > > > > 235 <
> > > > > >
> > > >
> http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/nio/Buffer.java#235
> > > > > > >
> > > > > > >
> > > > > > > <
> > > > > >
> > > >
> http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/nio/Buffer.java#
> > > > > > >
> > > > > > >
> > > > > > >             throw new IllegalArgumentException
> > > > > > > <
> > > > > >
> > > >
> http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/lang/IllegalArgumentException.java#IllegalArgumentException
> > > > > > >();
> > > > > > >
> > > > > > >
> > > > > > > Is it easy to reproduce Khaled?  Always same region/store or
> spread
> > > > > > > across all reads?
> > > > > > >
> > > > > > >
> > > > > > > St.Ack
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Fri, Oct 10, 2014 at 8:31 AM, Khaled Elmeleegy <
> kdiaa@hotmail.com
> > > > >
> > > > > > wrote:
> > > > > > >
> > > > > > > > Andrew thanks. I think this indeed was the problem.
> > > > > > > > To get over it, I increased the amount of memory given the
> region
> > > > > > server
> > > > > > > > to avoid IO on reads. I used the configs below. In my
> experiments,
> > > > I
> > > > > > have
> > > > > > > > writers streaming their output to HBase. The reader powers a
> web
> > > > page
> > > > > > and
> > > > > > > > does this scatter/gather, where it reads 1000 keys written
> last and
> > > > > > passes
> > > > > > > > them the the front end. With this workload, I get the
> exception
> > > > below
> > > > > > at
> > > > > > > > the region server. Again, I am using HBAse (0.98.6.1). Any
> help is
> > > > > > > > appreciated.
> > > > > > > > hbase-env:HBASE_REGIONSERVER_OPTS=-Xmx6G
> -XX:MaxDirectMemorySize=5G
> > > > > > > > -XX:CMSInitiatingOccupancyFraction=88 -XX:+AggressiveOpts
> > > > -verbose:gc
> > > > > > > > -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Xlog
> > > > > > > > gc:/tmp/hbase-regionserver-gc.log
> > > > > > > > hbase-site:hbase.bucketcache.ioengine=offheap
> > > > > > > >
> > > > > > > >                                    hfile.block.cache.size=0.4
> > > > > > > >
> > > > > > > >
> > > > > > > >  hbase.bucketcache.size=4196
> > > > > > > >
> > > > > > > >                         hbase.rs.cacheblocksonwrite=true
> > > > > > > >
> > > > > > > >
> > > > > > > >  hfile.block.index.cacheonwrite=true
> > > > > > > >
> > > > > > > >                         hfile.block.bloom.cacheonwrite=true
> > > > > > > >
> > > > > > > > 2014-10-10 15:06:44,173 ERROR
> > > > > > > > [B.DefaultRpcServer.handler=62,queue=2,port=60020]
> ipc.RpcServer:
> > > > > > > > Unexpected throwable object
> > > > > > > > java.lang.IllegalArgumentException
> > > > > > > >         at java.nio.Buffer.position(Buffer.java:236)
> > > > > > > >         at
> > > > > > > >
> > > > > >
> > > >
> org.apache.hadoop.hbase.util.ByteBufferUtils.skip(ByteBufferUtils.java:434)
> > > > > > > >         at
> > > > > > > >
> > > > > >
> > > >
> org.apache.hadoop.hbase.io.hfile.HFileReaderV2$ScannerV2.readKeyValueLen(HFileReaderV2.java:849)
> > > > > > > >         at
> > > > > > > >
> > > > > >
> > > >
> org.apache.hadoop.hbase.io.hfile.HFileReaderV2$ScannerV2.next(HFileReaderV2.java:760)
> > > > > > > >         at
> > > > > > > >
> > > > > >
> > > >
> org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekAtOrAfter(StoreFileScanner.java:248)
> > > > > > > >         at
> > > > > > > >
> > > > > >
> > > >
> org.apache.hadoop.hbase.regionserver.StoreFileScanner.seek(StoreFileScanner.java:152)
> > > > > > > >         at
> > > > > > > >
> > > > > >
> > > >
> org.apache.hadoop.hbase.regionserver.StoreScanner.seekScanners(StoreScanner.java:317)
> > > > > > > >         at
> > > > > > > >
> > > > > >
> > > >
> org.apache.hadoop.hbase.regionserver.StoreScanner.<init>(StoreScanner.java:176)
> > > > > > > >         at
> > > > > > > >
> > > > > >
> > > >
> org.apache.hadoop.hbase.regionserver.HStore.getScanner(HStore.java:1780)
> > > > > > > >         at
> > > > > > > >
> > > > > >
> > > >
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.<init>(HRegion.java:3758)
> > > > > > > >         at
> > > > > > > >
> > > > > >
> > > >
> org.apache.hadoop.hbase.regionserver.HRegion.instantiateRegionScanner(HRegion.java:1950)
> > > > > > > >         at
> > > > > > > >
> > > > > >
> > > >
> org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:1936)
> > > > > > > >         at
> > > > > > > >
> > > > > >
> > > >
> org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:1913)
> > > > > > > >         at
> > > > > > > >
> > > > > >
> > > >
> org.apache.hadoop.hbase.regionserver.HRegionServer.scan(HRegionServer.java:3157)
> > > > > > > >         at
> > > > > > > >
> > > > > >
> > > >
> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:29587)
> > > > > > > >         at
> > > > > > org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2027)
> > > > > > > >         at
> > > > > > org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108)
> > > > > > > >         at
> > > > > > > >
> > > > > >
> > > >
> org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:114)
> > > > > > > >         at
> > > > > > > >
> org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:94)
> > > > > > > >         at java.lang.Thread.run(Thread.java:744)
> > > > > > > > > From: apurtell@apache.org
> > > > > > > > > Date: Tue, 7 Oct 2014 17:09:35 -0700
> > > > > > > > > Subject: Re: HBase read performance
> > > > > > > > > To: user@hbase.apache.org
> > > > > > > > >
> > > > > > > > > > The cluster has 2 m1.large nodes.
> > > > > > > > >
> > > > > > > > > That's the problem right there.
> > > > > > > > >
> > > > > > > > > You need to look at c3.4xlarge or i2 instances as a minimum
> > > > > > requirement.
> > > > > > > > M1
> > > > > > > > > and even M3 instance types have ridiculously poor IO.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Tue, Oct 7, 2014 at 3:01 PM, Khaled Elmeleegy <
> > > > kdiaa@hotmail.com>
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Thanks Nicolas, Qiang.
> > > > > > > > > >
> > > > > > > > > > I was able to write a simple program that reproduces the
> > > > problem
> > > > > > on a
> > > > > > > > tiny
> > > > > > > > > > HBase cluster on ec2. The cluster has 2 m1.large nodes.
> One
> > > > node
> > > > > > runs
> > > > > > > > the
> > > > > > > > > > master, name node and zookeeper. The other node runs a
> data
> > > > node
> > > > > > and a
> > > > > > > > > > region server, with heap size configured to be 6GB.
> There, the
> > > > 1000
> > > > > > > > > > parallel reverse gets (reverse scans) take 7-8 seconds.
> The
> > > > data
> > > > > > set is
> > > > > > > > > > tiny (10M records, each having a small number of bytes).
> As I
> > > > said
> > > > > > > > before,
> > > > > > > > > > all hardware resources are very idle there.
> > > > > > > > > >
> > > > > > > > > > Interestingly, running the same workload on my macbook,
> the
> > > > 1000
> > > > > > > > parallel
> > > > > > > > > > gets take ~200ms on a pseudo-distributed installation.
> > > > > > > > > >
> > > > > > > > > > Any help to resolve this mystery is highly appreciated.
> > > > > > > > > >
> > > > > > > > > > P.S. please find my test program attached.
> > > > > > > > > >
> > > > > > > > > > Best,
> > > > > > > > > > Khaled
> > > > > > > > > >
> > > > > > > > > > > From: nkeywal@gmail.com
> > > > > > > > > > > Date: Mon, 6 Oct 2014 09:40:48 +0200
> > > > > > > > > > > Subject: Re: HBase read performance
> > > > > > > > > > > To: user@hbase.apache.org
> > > > > > > > > > >
> > > > > > > > > > > Hi,
> > > > > > > > > > >
> > > > > > > > > > > I haven't seen it mentioned, but if I understand
> correctly
> > > > each
> > > > > > scan
> > > > > > > > > > > returns a single row? If so you should use
> Scan#setSmall to
> > > > save
> > > > > > > > some rpc
> > > > > > > > > > > calls.
> > > > > > > > > > >
> > > > > > > > > > > Cheers,
> > > > > > > > > > >
> > > > > > > > > > > Nicolas
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > On Sun, Oct 5, 2014 at 11:28 AM, Qiang Tian <
> > > > tianq01@gmail.com>
> > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > when using separate HConnection instance, both its
> > > > > > > > > > > > RpcClient instance(maintain connection to a
> regionserver)
> > > > and
> > > > > > > > Registry
> > > > > > > > > > > > instance(maintain connection to zookeeper) will be
> > > > separate..
> > > > > > > > > > > >
> > > > > > > > > > > > see
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> http://shammijayasinghe.blogspot.com/2012/02/zookeeper-increase-maximum-number-of.html
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > On Sun, Oct 5, 2014 at 2:24 PM, Khaled Elmeleegy <
> > > > > > > > kdiaa@hotmail.com>
> > > > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > I tried creating my own HConnections pool to use
> for my
> > > > HBase
> > > > > > > > calls,
> > > > > > > > > > so
> > > > > > > > > > > > > that not all the (2K) threads share the same
> HConnection.
> > > > > > > > However, I
> > > > > > > > > > > > could
> > > > > > > > > > > > > only have 10 HConnections. Beyond that I get ZK
> > > > exceptions,
> > > > > > > > please
> > > > > > > > > > find
> > > > > > > > > > > > it
> > > > > > > > > > > > > below. Also, with 10 HConnections, I don't see
> noticeable
> > > > > > > > > > improvement in
> > > > > > > > > > > > > performance so far.
> > > > > > > > > > > > > 2014-10-05 06:11:26,490 WARN [main]
> > > > > > > > zookeeper.RecoverableZooKeeper
> > > > > > > > > > > > > (RecoverableZooKeeper.java:retryOrThrow(253)) -
> Possibly
> > > > > > > > transient
> > > > > > > > > > > > > ZooKeeper, quorum=54.68.206.252:2181,
> > > > > > > > > > > > >
> > > > > > > > > >
> > > > > >
> exception=org.apache.zookeeper.KeeperException$ConnectionLossException:
> > > > > > > > > > > > > KeeperErrorCode = ConnectionLoss for
> > > > /hbase/hbaseid2014-10-05
> > > > > > > > > > > > 06:11:26,490
> > > > > > > > > > > > > INFO [main] util.RetryCounter
> > > > > > > > > > > > (RetryCounter.java:sleepUntilNextRetry(155))
> > > > > > > > > > > > > - Sleeping 1000ms before retry #0...2014-10-05
> > > > 06:11:27,845
> > > > > > WARN
> > > > > > > > > > [main]
> > > > > > > > > > > > > zookeeper.RecoverableZooKeeper
> > > > > > > > > > > > > (RecoverableZooKeeper.java:retryOrThrow(253)) -
> Possibly
> > > > > > > > transient
> > > > > > > > > > > > > ZooKeeper, quorum=54.68.206.252:2181,
> > > > > > > > > > > > >
> > > > > > > > > >
> > > > > >
> exception=org.apache.zookeeper.KeeperException$ConnectionLossException:
> > > > > > > > > > > > > KeeperErrorCode = ConnectionLoss for
> > > > /hbase/hbaseid2014-10-05
> > > > > > > > > > > > 06:11:27,849
> > > > > > > > > > > > > INFO [main] util.RetryCounter
> > > > > > > > > > > > (RetryCounter.java:sleepUntilNextRetry(155))
> > > > > > > > > > > > > - Sleeping 2000ms before retry #1...2014-10-05
> > > > 06:11:30,405
> > > > > > WARN
> > > > > > > > > > [main]
> > > > > > > > > > > > > zookeeper.RecoverableZooKeeper
> > > > > > > > > > > > > (RecoverableZooKeeper.java:retryOrThrow(253)) -
> Possibly
> > > > > > > > transient
> > > > > > > > > > > > > ZooKeeper, quorum=54.68.206.252:2181,
> > > > > > > > > > > > >
> > > > > > > > > >
> > > > > >
> exception=org.apache.zookeeper.KeeperException$ConnectionLossException:
> > > > > > > > > > > > > KeeperErrorCode = ConnectionLoss for
> > > > /hbase/hbaseid2014-10-05
> > > > > > > > > > > > 06:11:30,405
> > > > > > > > > > > > > INFO [main] util.RetryCounter
> > > > > > > > > > > > (RetryCounter.java:sleepUntilNextRetry(155))
> > > > > > > > > > > > > - Sleeping 4000ms before retry #2...2014-10-05
> > > > 06:11:35,278
> > > > > > WARN
> > > > > > > > > > [main]
> > > > > > > > > > > > > zookeeper.RecoverableZooKeeper
> > > > > > > > > > > > > (RecoverableZooKeeper.java:retryOrThrow(253)) -
> Possibly
> > > > > > > > transient
> > > > > > > > > > > > > ZooKeeper, quorum=54.68.206.252:2181,
> > > > > > > > > > > > >
> > > > > > > > > >
> > > > > >
> exception=org.apache.zookeeper.KeeperException$ConnectionLossException:
> > > > > > > > > > > > > KeeperErrorCode = ConnectionLoss for
> > > > /hbase/hbaseid2014-10-05
> > > > > > > > > > > > 06:11:35,279
> > > > > > > > > > > > > INFO [main] util.RetryCounter
> > > > > > > > > > > > (RetryCounter.java:sleepUntilNextRetry(155))
> > > > > > > > > > > > > - Sleeping 8000ms before retry #3...2014-10-05
> > > > 06:11:44,393
> > > > > > WARN
> > > > > > > > > > [main]
> > > > > > > > > > > > > zookeeper.RecoverableZooKeeper
> > > > > > > > > > > > > (RecoverableZooKeeper.java:retryOrThrow(253)) -
> Possibly
> > > > > > > > transient
> > > > > > > > > > > > > ZooKeeper, quorum=54.68.206.252:2181,
> > > > > > > > > > > > >
> > > > > > > > > >
> > > > > >
> exception=org.apache.zookeeper.KeeperException$ConnectionLossException:
> > > > > > > > > > > > > KeeperErrorCode = ConnectionLoss for
> > > > /hbase/hbaseid2014-10-05
> > > > > > > > > > > > 06:11:44,393
> > > > > > > > > > > > > ERROR [main] zookeeper.RecoverableZooKeeper
> > > > > > > > > > > > > (RecoverableZooKeeper.java:retryOrThrow(255)) -
> ZooKeeper
> > > > > > exists
> > > > > > > > > > failed
> > > > > > > > > > > > > after 4 attempts2014-10-05 06:11:44,394 WARN [main]
> > > > > > > > zookeeper.ZKUtil
> > > > > > > > > > > > > (ZKUtil.java:checkExists(482)) -
> hconnection-0x4e174f3b,
> > > > > > quorum=
> > > > > > > > > > > > > 54.68.206.252:2181, baseZNode=/hbase Unable to set
> > > > watcher
> > > > > > on
> > > > > > > > znode
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> (/hbase/hbaseid)org.apache.zookeeper.KeeperException$ConnectionLossException:
> > > > > > > > > > > > > KeeperErrorCode = ConnectionLoss for
> /hbase/hbaseid at
> > > > > > > > > > > > >
> > > > > > > >
> > > > org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
> > > > > > > > > > at
> > > > > > > > > > > > >
> > > > > > > >
> > > > org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
> > > > > > > > > > at
> > > > > > > > > > > > >
> > > > org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1041) at
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.exists(RecoverableZooKeeper.java:199)
> > > > > > > > > > > > > at
> > > > > > > > > >
> > > > > >
> org.apache.hadoop.hbase.zookeeper.ZKUtil.checkExists(ZKUtil.java:479)
> > > > > > > > > > > > > at
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> org.apache.hadoop.hbase.zookeeper.ZKClusterId.readClusterIdZNode(ZKClusterId.java:65)
> > > > > > > > > > > > > at
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> org.apache.hadoop.hbase.client.ZooKeeperRegistry.getClusterId(ZooKeeperRegistry.java:83)
> > > > > > > > > > > > > at
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.retrieveClusterId(HConnectionManager.java:857)
> > > > > > > > > > > > > at
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.<init>(HConnectionManager.java:662)
> > > > > > > > > > > > > at
> > > > > > sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> > > > > > > > > > > > > Method) at
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
> > > > > > > > > > > > > at
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> > > > > > > > > > > > > at
> > > > > > > >
> java.lang.reflect.Constructor.newInstance(Constructor.java:526)
> > > > > > > > > > > > at
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> org.apache.hadoop.hbase.client.HConnectionManager.createConnection(HConnectionManager.java:414)
> > > > > > > > > > > > > at
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> org.apache.hadoop.hbase.client.HConnectionManager.createConnection(HConnectionManager.java:335)...
> > > > > > > > > > > > > > From: kdiaa@hotmail.com
> > > > > > > > > > > > > > To: user@hbase.apache.org
> > > > > > > > > > > > > > Subject: RE: HBase read performance
> > > > > > > > > > > > > > Date: Fri, 3 Oct 2014 12:37:39 -0700
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Lars, Ted, and Qiang,
> > > > > > > > > > > > > > Thanks for all the input.
> > > > > > > > > > > > > > Qiang: yes all the threads are in the same client
> > > > process
> > > > > > > > sharing
> > > > > > > > > > the
> > > > > > > > > > > > > same connection. And since I don't see hardware
> > > > contention,
> > > > > > may
> > > > > > > > be
> > > > > > > > > > there
> > > > > > > > > > > > is
> > > > > > > > > > > > > contention over this code path. I'll try using many
> > > > > > connections
> > > > > > > > and
> > > > > > > > > > see
> > > > > > > > > > > > if
> > > > > > > > > > > > > it alleviates the problems and I'll report back.
> > > > > > > > > > > > > > Thanks again,Khaled
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Date: Fri, 3 Oct 2014 15:18:30 +0800
> > > > > > > > > > > > > > > Subject: Re: HBase read performance
> > > > > > > > > > > > > > > From: tianq01@gmail.com
> > > > > > > > > > > > > > > To: user@hbase.apache.org
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Regarding to profiling, Andrew introduced
> > > > > > > > > > > > > > >
> > > > > > > > > >
> > > > http://www.brendangregg.com/blog/2014-06-12/java-flame-graphs.html
> > > > > > > > > > > > > months
> > > > > > > > > > > > > > > ago.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > processCallTime comes from RpcServer#call, so
> it
> > > > looks
> > > > > > good?
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > I have a suspect:
> > > > > > > > > > https://issues.apache.org/jira/browse/HBASE-11306
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > how many processes do you have for your 2000
> threads?
> > > > > > > > > > > > > > > if olny 1 process, those threads will share
> just 1
> > > > > > > > connection to
> > > > > > > > > > that
> > > > > > > > > > > > > > > regionserver, there might be big contention on
> the
> > > > RPC
> > > > > > code
> > > > > > > > path.
> > > > > > > > > > > > > ---for
> > > > > > > > > > > > > > > such case, could you try using different
> connections?
> > > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HConnectionManager.html
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Fri, Oct 3, 2014 at 9:55 AM, Ted Yu <
> > > > > > yuzhihong@gmail.com>
> > > > > > > > > > wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Khaled:
> > > > > > > > > > > > > > > > Do you have profiler such as jprofiler ?
> > > > > > > > > > > > > > > > Profiling would give us more hint.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Otherwise capturing stack trace during the
> period
> > > > of
> > > > > > > > reverse
> > > > > > > > > > scan
> > > > > > > > > > > > > would
> > > > > > > > > > > > > > > > help.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Cheers
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > On Thu, Oct 2, 2014 at 4:52 PM, lars
> hofhansl <
> > > > > > > > > > larsh@apache.org>
> > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > You might have the data in the OS buffer
> cache,
> > > > > > without
> > > > > > > > short
> > > > > > > > > > > > > circuit
> > > > > > > > > > > > > > > > > reading the region server has to request
> the
> > > > block
> > > > > > from
> > > > > > > > the
> > > > > > > > > > data
> > > > > > > > > > > > > node
> > > > > > > > > > > > > > > > > process, which then reads it from the block
> > > > cache.
> > > > > > > > > > > > > > > > > That is a few context switches per RPC
> that do
> > > > not
> > > > > > show
> > > > > > > > up
> > > > > > > > > > in CPU
> > > > > > > > > > > > > > > > metrics.
> > > > > > > > > > > > > > > > > In that you also would not see disk IO.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > If - as you say - you see a lot of evicted
> > > > blocks the
> > > > > > > > data
> > > > > > > > > > *has*
> > > > > > > > > > > > > to come
> > > > > > > > > > > > > > > > > from the OS. If you do not see disk IO is
> *has*
> > > > to
> > > > > > come
> > > > > > > > from
> > > > > > > > > > the
> > > > > > > > > > > > OS
> > > > > > > > > > > > > > > > cache.
> > > > > > > > > > > > > > > > > I.e. there's more RAM on your boxes, and
> you
> > > > should
> > > > > > > > increase
> > > > > > > > > > the
> > > > > > > > > > > > > heap
> > > > > > > > > > > > > > > > block
> > > > > > > > > > > > > > > > > cache.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > You can measure the context switches with
> vmstat.
> > > > > > Other
> > > > > > > > than
> > > > > > > > > > that
> > > > > > > > > > > > > I have
> > > > > > > > > > > > > > > > > no suggestion until I reproduce the
> problem.
> > > > > > > > > > > > > > > > > Also check the data locality index of the
> region
> > > > > > server
> > > > > > > > it
> > > > > > > > > > should
> > > > > > > > > > > > > be
> > > > > > > > > > > > > > > > close
> > > > > > > > > > > > > > > > > to 100%.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > -- Lars
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > ________________________________
> > > > > > > > > > > > > > > > > From: Khaled Elmeleegy <kd...@hotmail.com>
> > > > > > > > > > > > > > > > > To: "user@hbase.apache.org" <
> > > > user@hbase.apache.org>
> > > > > > > > > > > > > > > > > Sent: Thursday, October 2, 2014 3:24 PM
> > > > > > > > > > > > > > > > > Subject: RE: HBase read performance
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Lars thanks a lot about all the tips. I'll
> make
> > > > sure
> > > > > > I
> > > > > > > > cover
> > > > > > > > > > all
> > > > > > > > > > > > > of them
> > > > > > > > > > > > > > > > > and get back to you. I am not sure they
> are the
> > > > > > > > bottleneck
> > > > > > > > > > though
> > > > > > > > > > > > > as they
> > > > > > > > > > > > > > > > > all are about optimizing physical resource
> > > > usage. As
> > > > > > I
> > > > > > > > said,
> > > > > > > > > > I
> > > > > > > > > > > > > don't see
> > > > > > > > > > > > > > > > > any contended physical resources now. I'll
> also
> > > > try
> > > > > > to
> > > > > > > > > > reproduce
> > > > > > > > > > > > > this
> > > > > > > > > > > > > > > > > problem in a simpler environment and pass
> to you
> > > > the
> > > > > > test
> > > > > > > > > > program
> > > > > > > > > > > > > to play
> > > > > > > > > > > > > > > > > with.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Couple of high level points to make. You
> are
> > > > right
> > > > > > that
> > > > > > > > my
> > > > > > > > > > use
> > > > > > > > > > > > > case is
> > > > > > > > > > > > > > > > > kind of a worst case for HBase reads. But,
> if
> > > > things
> > > > > > go
> > > > > > > > the
> > > > > > > > > > way
> > > > > > > > > > > > you
> > > > > > > > > > > > > > > > > described them, there should be tons of
> disk IO
> > > > and
> > > > > > that
> > > > > > > > > > should
> > > > > > > > > > > > be
> > > > > > > > > > > > > > > > clearly
> > > > > > > > > > > > > > > > > the bottleneck. This is not the case
> though.
> > > > That's
> > > > > > for
> > > > > > > > the
> > > > > > > > > > > > simple
> > > > > > > > > > > > > reason
> > > > > > > > > > > > > > > > > that this is done in a test environment (I
> am
> > > > still
> > > > > > > > > > prototyping),
> > > > > > > > > > > > > and
> > > > > > > > > > > > > > > > not a
> > > > > > > > > > > > > > > > > lot of data is yet written to HBase.
> However for
> > > > the
> > > > > > > > real use
> > > > > > > > > > > > > case, there
> > > > > > > > > > > > > > > > > should writers constantly writing data to
> HBase
> > > > and
> > > > > > > > readers
> > > > > > > > > > > > > occasionally
> > > > > > > > > > > > > > > > > doing this scatter/gather. At steady state,
> > > > things
> > > > > > should
> > > > > > > > > > only
> > > > > > > > > > > > get
> > > > > > > > > > > > > worse
> > > > > > > > > > > > > > > > > and all the issues you mentioned should
> get far
> > > > more
> > > > > > > > > > pronounced.
> > > > > > > > > > > > > At this
> > > > > > > > > > > > > > > > > point, one can try to mitigate it using
> more
> > > > memory
> > > > > > or
> > > > > > > > so. I
> > > > > > > > > > am
> > > > > > > > > > > > > not there
> > > > > > > > > > > > > > > > > yet as I think I am hitting some software
> > > > bottleneck,
> > > > > > > > which I
> > > > > > > > > > > > > don't know
> > > > > > > > > > > > > > > > > how to work around.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Khaled
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > ----------------------------------------
> > > > > > > > > > > > > > > > > > Date: Thu, 2 Oct 2014 14:20:47 -0700
> > > > > > > > > > > > > > > > > > From: larsh@apache.org
> > > > > > > > > > > > > > > > > > Subject: Re: HBase read performance
> > > > > > > > > > > > > > > > > > To: user@hbase.apache.org
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > OK... We might need to investigate this.
> > > > > > > > > > > > > > > > > > Any chance that you can provide a
> minimal test
> > > > > > program
> > > > > > > > and
> > > > > > > > > > > > > instruction
> > > > > > > > > > > > > > > > > about how to set it up.
> > > > > > > > > > > > > > > > > > We can do some profiling then.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > One thing to note is that with scanning
> HBase
> > > > > > cannot
> > > > > > > > use
> > > > > > > > > > bloom
> > > > > > > > > > > > > filters
> > > > > > > > > > > > > > > > > to rule out HFiles ahead of time, it needs
> to
> > > > look
> > > > > > into
> > > > > > > > all
> > > > > > > > > > of
> > > > > > > > > > > > > them.
> > > > > > > > > > > > > > > > > > So you kind of hit on the absolute worst
> case:
> > > > > > > > > > > > > > > > > > - random reads that do not fit into the
> block
> > > > cache
> > > > > > > > > > > > > > > > > > - cannot use bloom filters
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Few more question/comments:
> > > > > > > > > > > > > > > > > > - Do you have short circuit reading
> enabled? If
> > > > > > not,
> > > > > > > > you
> > > > > > > > > > > > should.
> > > > > > > > > > > > > > > > > > - Is your table major compacted? That
> will
> > > > reduce
> > > > > > the
> > > > > > > > > > number of
> > > > > > > > > > > > > files
> > > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > look at.
> > > > > > > > > > > > > > > > > > - Did you disable Nagle's everywhere
> (enabled
> > > > > > > > tcpnodelay)?
> > > > > > > > > > It
> > > > > > > > > > > > > disabled
> > > > > > > > > > > > > > > > > by default in HBase, but necessarily in
> your
> > > > install
> > > > > > of
> > > > > > > > HDFS.
> > > > > > > > > > > > > > > > > > - Which version of HDFS are you using as
> > > > backing
> > > > > > > > > > filesystem?
> > > > > > > > > > > > > > > > > > - If your disk is idle, it means the
> data fits
> > > > into
> > > > > > > > the OS
> > > > > > > > > > > > buffer
> > > > > > > > > > > > > > > > cache.
> > > > > > > > > > > > > > > > > In turn that means that you increase the
> heap
> > > > for the
> > > > > > > > region
> > > > > > > > > > > > > servers. You
> > > > > > > > > > > > > > > > > can also use block encoding (FAST_DIFF) to
> try to
> > > > > > make
> > > > > > > > sure
> > > > > > > > > > the
> > > > > > > > > > > > > entire
> > > > > > > > > > > > > > > > > working set fits into the cache.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > - Also try to reduce the block size -
> although
> > > > if
> > > > > > your
> > > > > > > > > > overall
> > > > > > > > > > > > > working
> > > > > > > > > > > > > > > > > set does not fit in the heap it won't help
> much.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > This is a good section of the book to
> read
> > > > through
> > > > > > > > > > generally
> > > > > > > > > > > > > (even
> > > > > > > > > > > > > > > > > though you might know most of this
> already):
> > > > > > > > > > > > > > > > >
> > > > > > http://hbase.apache.org/book.html#perf.configurations
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > -- Lars
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > ----- Original Message -----
> > > > > > > > > > > > > > > > > > From: Khaled Elmeleegy <
> kdiaa@hotmail.com>
> > > > > > > > > > > > > > > > > > To: "user@hbase.apache.org" <
> > > > user@hbase.apache.org
> > > > > > >
> > > > > > > > > > > > > > > > > > Cc:
> > > > > > > > > > > > > > > > > > Sent: Thursday, October 2, 2014 11:27 AM
> > > > > > > > > > > > > > > > > > Subject: RE: HBase read performance
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > I do see a very brief spike in CPU
> > > > (user/system),
> > > > > > but
> > > > > > > > it's
> > > > > > > > > > no
> > > > > > > > > > > > > where
> > > > > > > > > > > > > > > > near
> > > > > > > > > > > > > > > > > 0% idle. It goes from 99% idle down to
> something
> > > > > > like 40%
> > > > > > > > > > idle
> > > > > > > > > > > > for
> > > > > > > > > > > > > a
> > > > > > > > > > > > > > > > second
> > > > > > > > > > > > > > > > > or so. The thing to note, this is all on a
> test
> > > > > > cluster,
> > > > > > > > so
> > > > > > > > > > no
> > > > > > > > > > > > > real load.
> > > > > > > > > > > > > > > > > Things are generally idle until i issue
> 2-3 of
> > > > these
> > > > > > > > > > > > > multi-scan-requests
> > > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > render a web page. Then, you see the spike
> in the
> > > > > > cpu and
> > > > > > > > > > some
> > > > > > > > > > > > > activity
> > > > > > > > > > > > > > > > in
> > > > > > > > > > > > > > > > > the network and disk, but nowhere near
> > > > saturation.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > If there are specific tests you'd like
> me to
> > > > do to
> > > > > > > > debug
> > > > > > > > > > this,
> > > > > > > > > > > > > I'd be
> > > > > > > > > > > > > > > > > more than happy to do it.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Khaled
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > ----------------------------------------
> > > > > > > > > > > > > > > > > >> Date: Thu, 2 Oct 2014 11:15:59 -0700
> > > > > > > > > > > > > > > > > >> From: larsh@apache.org
> > > > > > > > > > > > > > > > > >> Subject: Re: HBase read performance
> > > > > > > > > > > > > > > > > >> To: user@hbase.apache.org
> > > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > > >> I still think you're waiting on disk. No
> > > > IOWAIT?
> > > > > > So
> > > > > > > > CPU
> > > > > > > > > > is not
> > > > > > > > > > > > > waiting
> > > > > > > > > > > > > > > > > a lot for IO. No high User/System CPU
> either?
> > > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > > >> If you see a lot of evicted block then
> each
> > > > RPC
> > > > > > has a
> > > > > > > > high
> > > > > > > > > > > > > chance of
> > > > > > > > > > > > > > > > > requiring to bring an entire 64k block in.
> You'll
> > > > > > see bad
> > > > > > > > > > > > > performance
> > > > > > > > > > > > > > > > with
> > > > > > > > > > > > > > > > > this.
> > > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > > >> We might need to trace this specific
> scenario.
> > > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > > >> -- Lars
> > > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > > >> ________________________________
> > > > > > > > > > > > > > > > > >> From: Khaled Elmeleegy <
> kdiaa@hotmail.com>
> > > > > > > > > > > > > > > > > >> To: "user@hbase.apache.org" <
> > > > > > user@hbase.apache.org>
> > > > > > > > > > > > > > > > > >> Sent: Thursday, October 2, 2014 10:46 AM
> > > > > > > > > > > > > > > > > >> Subject: RE: HBase read performance
> > > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > > >> I've set the heap size to 6GB and I do
> have gc
> > > > > > > > logging. No
> > > > > > > > > > > > long
> > > > > > > > > > > > > pauses
> > > > > > > > > > > > > > > > > there -- occasional 0.1s or 0.2s.
> > > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > > >> Other than the discrepancy between
> what's
> > > > > > reported on
> > > > > > > > the
> > > > > > > > > > > > > client and
> > > > > > > > > > > > > > > > > what's reported at the RS, there is also
> the
> > > > issue
> > > > > > of not
> > > > > > > > > > getting
> > > > > > > > > > > > > proper
> > > > > > > > > > > > > > > > > concurrency. So, even if a reverse get
> takes
> > > > 100ms
> > > > > > or so
> > > > > > > > > > (this
> > > > > > > > > > > > has
> > > > > > > > > > > > > to be
> > > > > > > > > > > > > > > > > mostly blocking on various things as no
> physical
> > > > > > > > resource is
> > > > > > > > > > > > > contended),
> > > > > > > > > > > > > > > > > then the other gets/scans should be able to
> > > > proceed
> > > > > > in
> > > > > > > > > > parallel,
> > > > > > > > > > > > > so a
> > > > > > > > > > > > > > > > > thousand concurrent gets/scans should
> finish in
> > > > few
> > > > > > > > hundreds
> > > > > > > > > > of
> > > > > > > > > > > > ms
> > > > > > > > > > > > > not
> > > > > > > > > > > > > > > > many
> > > > > > > > > > > > > > > > > seconds. That's why I thought I'd increase
> the
> > > > > > handlers
> > > > > > > > > > count to
> > > > > > > > > > > > > try to
> > > > > > > > > > > > > > > > get
> > > > > > > > > > > > > > > > > more concurrency, but it didn't help. So,
> there
> > > > must
> > > > > > be
> > > > > > > > > > something
> > > > > > > > > > > > > else.
> > > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > > >> Khaled
> > > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > > >> ----------------------------------------
> > > > > > > > > > > > > > > > > >>> From: ndimiduk@gmail.com
> > > > > > > > > > > > > > > > > >>> Date: Thu, 2 Oct 2014 10:36:39 -0700
> > > > > > > > > > > > > > > > > >>> Subject: Re: HBase read performance
> > > > > > > > > > > > > > > > > >>> To: user@hbase.apache.org
> > > > > > > > > > > > > > > > > >>>
> > > > > > > > > > > > > > > > > >>> Do check again on the heap size of the
> region
> > > > > > > > servers.
> > > > > > > > > > The
> > > > > > > > > > > > > default
> > > > > > > > > > > > > > > > > >>> unconfigured size is 1G; too small for
> much
> > > > of
> > > > > > > > anything.
> > > > > > > > > > > > Check
> > > > > > > > > > > > > your
> > > > > > > > > > > > > > > > RS
> > > > > > > > > > > > > > > > > logs
> > > > > > > > > > > > > > > > > >>> -- look for lines produced by the
> > > > JVMPauseMonitor
> > > > > > > > thread.
> > > > > > > > > > > > They
> > > > > > > > > > > > > > > > usually
> > > > > > > > > > > > > > > > > >>> correlate with long GC pauses or other
> > > > > > process-freeze
> > > > > > > > > > events.
> > > > > > > > > > > > > > > > > >>>
> > > > > > > > > > > > > > > > > >>> Get is implemented as a Scan of a
> single
> > > > row, so
> > > > > > a
> > > > > > > > > > reverse
> > > > > > > > > > > > > scan of a
> > > > > > > > > > > > > > > > > single
> > > > > > > > > > > > > > > > > >>> row should be functionally equivalent.
> > > > > > > > > > > > > > > > > >>>
> > > > > > > > > > > > > > > > > >>> In practice, I have seen discrepancy
> between
> > > > the
> > > > > > > > > > latencies
> > > > > > > > > > > > > reported
> > > > > > > > > > > > > > > > by
> > > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > >>> RS and the latencies experienced by the
> > > > client.
> > > > > > I've
> > > > > > > > not
> > > > > > > > > > > > > investigated
> > > > > > > > > > > > > > > > > this
> > > > > > > > > > > > > > > > > >>> area thoroughly.
> > > > > > > > > > > > > > > > > >>>
> > > > > > > > > > > > > > > > > >>> On Thu, Oct 2, 2014 at 10:05 AM, Khaled
> > > > > > Elmeleegy <
> > > > > > > > > > > > > kdiaa@hotmail.com
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > >>>
> > > > > > > > > > > > > > > > > >>>> Thanks Lars for your quick reply.
> > > > > > > > > > > > > > > > > >>>>
> > > > > > > > > > > > > > > > > >>>> Yes performance is similar with less
> > > > handlers (I
> > > > > > > > tried
> > > > > > > > > > with
> > > > > > > > > > > > > 100
> > > > > > > > > > > > > > > > > first).
> > > > > > > > > > > > > > > > > >>>>
> > > > > > > > > > > > > > > > > >>>> The payload is not big ~1KB or so. The
> > > > working
> > > > > > set
> > > > > > > > > > doesn't
> > > > > > > > > > > > > seem to
> > > > > > > > > > > > > > > > > fit in
> > > > > > > > > > > > > > > > > >>>> memory as there are many cache misses.
> > > > However,
> > > > > > > > disk is
> > > > > > > > > > far
> > > > > > > > > > > > > from
> > > > > > > > > > > > > > > > > being a
> > > > > > > > > > > > > > > > > >>>> bottleneck. I checked using iostat. I
> also
> > > > > > verified
> > > > > > > > that
> > > > > > > > > > > > > neither the
> > > > > > > > > > > > > > > > > >>>> network nor the CPU of the region
> server or
> > > > the
> > > > > > > > client
> > > > > > > > > > are a
> > > > > > > > > > > > > > > > > bottleneck.
> > > > > > > > > > > > > > > > > >>>> This leads me to believe that likely
> this
> > > > is a
> > > > > > > > software
> > > > > > > > > > > > > bottleneck,
> > > > > > > > > > > > > > > > > >>>> possibly due to a misconfiguration on
> my
> > > > side. I
> > > > > > > > just
> > > > > > > > > > don't
> > > > > > > > > > > > > know how
> > > > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > >>>> debug it. A clear disconnect I see is
> the
> > > > > > individual
> > > > > > > > > > request
> > > > > > > > > > > > > latency
> > > > > > > > > > > > > > > > > as
> > > > > > > > > > > > > > > > > >>>> reported by metrics on the region
> server
> > > > (IPC
> > > > > > > > > > > > processCallTime
> > > > > > > > > > > > > vs
> > > > > > > > > > > > > > > > > scanNext)
> > > > > > > > > > > > > > > > > >>>> vs what's measured on the client.
> Does this
> > > > > > sound
> > > > > > > > > > right? Any
> > > > > > > > > > > > > ideas
> > > > > > > > > > > > > > > > on
> > > > > > > > > > > > > > > > > how
> > > > > > > > > > > > > > > > > >>>> to better debug it?
> > > > > > > > > > > > > > > > > >>>>
> > > > > > > > > > > > > > > > > >>>> About this trick with the timestamps
> to be
> > > > able
> > > > > > to
> > > > > > > > do a
> > > > > > > > > > > > > forward
> > > > > > > > > > > > > > > > scan,
> > > > > > > > > > > > > > > > > >>>> thanks for pointing it out. Actually,
> I am
> > > > > > aware of
> > > > > > > > it.
> > > > > > > > > > The
> > > > > > > > > > > > > problem
> > > > > > > > > > > > > > > > I
> > > > > > > > > > > > > > > > > have
> > > > > > > > > > > > > > > > > >>>> is, sometimes I want to get the key
> after a
> > > > > > > > particular
> > > > > > > > > > > > > timestamp and
> > > > > > > > > > > > > > > > > >>>> sometimes I want to get the key
> before, so
> > > > just
> > > > > > > > relying
> > > > > > > > > > on
> > > > > > > > > > > > > the key
> > > > > > > > > > > > > > > > > order
> > > > > > > > > > > > > > > > > >>>> doesn't work. Ideally, I want a
> reverse
> > > > get(). I
> > > > > > > > thought
> > > > > > > > > > > > > reverse
> > > > > > > > > > > > > > > > scan
> > > > > > > > > > > > > > > > > can
> > > > > > > > > > > > > > > > > >>>> do the trick though.
> > > > > > > > > > > > > > > > > >>>>
> > > > > > > > > > > > > > > > > >>>> Khaled
> > > > > > > > > > > > > > > > > >>>>
> > > > > > > > > > > > > > > > > >>>>
> ----------------------------------------
> > > > > > > > > > > > > > > > > >>>>> Date: Thu, 2 Oct 2014 09:40:37 -0700
> > > > > > > > > > > > > > > > > >>>>> From: larsh@apache.org
> > > > > > > > > > > > > > > > > >>>>> Subject: Re: HBase read performance
> > > > > > > > > > > > > > > > > >>>>> To: user@hbase.apache.org
> > > > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > > > >>>>> Hi Khaled,
> > > > > > > > > > > > > > > > > >>>>> is it the same with fewer threads?
> 1500
> > > > handler
> > > > > > > > threads
> > > > > > > > > > > > > seems to
> > > > > > > > > > > > > > > > be a
> > > > > > > > > > > > > > > > > >>>> lot. Typically a good number of
> threads
> > > > depends
> > > > > > on
> > > > > > > > the
> > > > > > > > > > > > > hardware
> > > > > > > > > > > > > > > > > (number of
> > > > > > > > > > > > > > > > > >>>> cores, number of spindles, etc). I
> cannot
> > > > think
> > > > > > of
> > > > > > > > any
> > > > > > > > > > type
> > > > > > > > > > > > of
> > > > > > > > > > > > > > > > > scenario
> > > > > > > > > > > > > > > > > >>>> where more than 100 would give any
> > > > improvement.
> > > > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > > > >>>>> How large is the payload per KV
> retrieved
> > > > that
> > > > > > > > way? If
> > > > > > > > > > > > large
> > > > > > > > > > > > > (as
> > > > > > > > > > > > > > > > in a
> > > > > > > > > > > > > > > > > >>>> few 100k) you definitely want to
> lower the
> > > > > > number
> > > > > > > > of the
> > > > > > > > > > > > > handler
> > > > > > > > > > > > > > > > > threads.
> > > > > > > > > > > > > > > > > >>>>> How much heap do you give the region
> > > > server?
> > > > > > Does
> > > > > > > > the
> > > > > > > > > > > > > working set
> > > > > > > > > > > > > > > > fit
> > > > > > > > > > > > > > > > > >>>> into the cache? (i.e. in the metrics,
> do you
> > > > > > see the
> > > > > > > > > > > > eviction
> > > > > > > > > > > > > count
> > > > > > > > > > > > > > > > > going
> > > > > > > > > > > > > > > > > >>>> up, if so it does not fit into the
> cache).
> > > > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > > > >>>>> If the working set does not fit into
> the
> > > > cache
> > > > > > > > > > (eviction
> > > > > > > > > > > > > count goes
> > > > > > > > > > > > > > > > > up)
> > > > > > > > > > > > > > > > > >>>> then HBase will need to bring a new
> block in
> > > > > > from
> > > > > > > > disk
> > > > > > > > > > on
> > > > > > > > > > > > > each Get
> > > > > > > > > > > > > > > > > >>>> (assuming the Gets are more or less
> random
> > > > as
> > > > > > far
> > > > > > > > as the
> > > > > > > > > > > > > server is
> > > > > > > > > > > > > > > > > >>>> concerned).
> > > > > > > > > > > > > > > > > >>>>> In case you'll benefit from reducing
> the
> > > > HFile
> > > > > > > > block
> > > > > > > > > > size
> > > > > > > > > > > > > (from 64k
> > > > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > >>>> 8k or even 4k).
> > > > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > > > >>>>> Lastly I don't think we tested the
> > > > performance
> > > > > > of
> > > > > > > > using
> > > > > > > > > > > > > reverse
> > > > > > > > > > > > > > > > scan
> > > > > > > > > > > > > > > > > >>>> this way, there is probably room to
> optimize
> > > > > > this.
> > > > > > > > > > > > > > > > > >>>>> Can you restructure your keys to
> allow
> > > > forwards
> > > > > > > > > > scanning?
> > > > > > > > > > > > For
> > > > > > > > > > > > > > > > example
> > > > > > > > > > > > > > > > > >>>> you could store the time as
> MAX_LONG-time.
> > > > Or
> > > > > > you
> > > > > > > > could
> > > > > > > > > > > > > invert all
> > > > > > > > > > > > > > > > > the bits
> > > > > > > > > > > > > > > > > >>>> of the time portion of the key, so
> that it
> > > > sort
> > > > > > the
> > > > > > > > > > other
> > > > > > > > > > > > > way. Then
> > > > > > > > > > > > > > > > > you
> > > > > > > > > > > > > > > > > >>>> could do a forward scan.
> > > > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > > > >>>>> Let us know how it goes.
> > > > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > > > >>>>> -- Lars
> > > > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > > > >>>>> ----- Original Message -----
> > > > > > > > > > > > > > > > > >>>>> From: Khaled Elmeleegy <
> kdiaa@hotmail.com>
> > > > > > > > > > > > > > > > > >>>>> To: "user@hbase.apache.org" <
> > > > > > user@hbase.apache.org
> > > > > > > > >
> > > > > > > > > > > > > > > > > >>>>> Cc:
> > > > > > > > > > > > > > > > > >>>>> Sent: Thursday, October 2, 2014
> 12:12 AM
> > > > > > > > > > > > > > > > > >>>>> Subject: HBase read performance
> > > > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > > > >>>>> Hi,
> > > > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > > > >>>>> I am trying to do a scatter/gather
> on hbase
> > > > > > > > (0.98.6.1),
> > > > > > > > > > > > > where I
> > > > > > > > > > > > > > > > have
> > > > > > > > > > > > > > > > > a
> > > > > > > > > > > > > > > > > >>>> client reading ~1000 keys from an
> HBase
> > > > table.
> > > > > > These
> > > > > > > > > > keys
> > > > > > > > > > > > > happen to
> > > > > > > > > > > > > > > > > fall on
> > > > > > > > > > > > > > > > > >>>> the same region server. For my reads
> I use
> > > > > > reverse
> > > > > > > > scan
> > > > > > > > > > to
> > > > > > > > > > > > > read each
> > > > > > > > > > > > > > > > > key as
> > > > > > > > > > > > > > > > > >>>> I want the key prior to a specific
> time
> > > > stamp
> > > > > > (time
> > > > > > > > > > stamps
> > > > > > > > > > > > are
> > > > > > > > > > > > > > > > stored
> > > > > > > > > > > > > > > > > in
> > > > > > > > > > > > > > > > > >>>> reverse order). I don't believe gets
> can
> > > > > > accomplish
> > > > > > > > > > that,
> > > > > > > > > > > > > right? so
> > > > > > > > > > > > > > > > I
> > > > > > > > > > > > > > > > > use
> > > > > > > > > > > > > > > > > >>>> scan, with caching set to 1.
> > > > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > > > >>>>> I use 2000 reader threads in the
> client
> > > > and on
> > > > > > > > HBase,
> > > > > > > > > > I've
> > > > > > > > > > > > > set
> > > > > > > > > > > > > > > > > >>>> hbase.regionserver.handler.count to
> 1500.
> > > > With
> > > > > > this
> > > > > > > > > > setup,
> > > > > > > > > > > > my
> > > > > > > > > > > > > > > > scatter
> > > > > > > > > > > > > > > > > >>>> gather is very slow and can take up
> to 10s
> > > > in
> > > > > > total.
> > > > > > > > > > Timing
> > > > > > > > > > > > an
> > > > > > > > > > > > > > > > > individual
> > > > > > > > > > > > > > > > > >>>> getScanner(..) call on the client
> side, it
> > > > can
> > > > > > > > easily
> > > > > > > > > > take
> > > > > > > > > > > > few
> > > > > > > > > > > > > > > > > hundreds of
> > > > > > > > > > > > > > > > > >>>> ms. I also got the following metrics
> from
> > > > the
> > > > > > region
> > > > > > > > > > server
> > > > > > > > > > > > in
> > > > > > > > > > > > > > > > > question:
> > > > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > > > >>>>> "queueCallTime_mean" :
> 2.190855525775637,
> > > > > > > > > > > > > > > > > >>>>> "queueCallTime_median" : 0.0,
> > > > > > > > > > > > > > > > > >>>>> "queueCallTime_75th_percentile" :
> 0.0,
> > > > > > > > > > > > > > > > > >>>>> "queueCallTime_95th_percentile" :
> 1.0,
> > > > > > > > > > > > > > > > > >>>>> "queueCallTime_99th_percentile" :
> > > > > > > > 556.9799999999818,
> > > > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > > > >>>>> "processCallTime_min" : 0,
> > > > > > > > > > > > > > > > > >>>>> "processCallTime_max" : 12755,
> > > > > > > > > > > > > > > > > >>>>> "processCallTime_mean" :
> > > > 105.64873440912682,
> > > > > > > > > > > > > > > > > >>>>> "processCallTime_median" : 0.0,
> > > > > > > > > > > > > > > > > >>>>> "processCallTime_75th_percentile" :
> 2.0,
> > > > > > > > > > > > > > > > > >>>>> "processCallTime_95th_percentile" :
> > > > 7917.95,
> > > > > > > > > > > > > > > > > >>>>> "processCallTime_99th_percentile" :
> > > > 8876.89,
> > > > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > > > >>>>
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_min"
> > > > > > > > > > > > > > > > > >>>> : 89,
> > > > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > > > >>>>
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_max"
> > > > > > > > > > > > > > > > > >>>> : 11300,
> > > > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > > > >>>>
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_mean"
> > > > > > > > > > > > > > > > > >>>> : 654.4949739797315,
> > > > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > > > >>>>
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_median"
> > > > > > > > > > > > > > > > > >>>> : 101.0,
> > > > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > > > >>>>
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_75th_percentile"
> > > > > > > > > > > > > > > > > >>>> : 101.0,
> > > > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > > > >>>>
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_95th_percentile"
> > > > > > > > > > > > > > > > > >>>> : 101.0,
> > > > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > > > >>>>
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_99th_percentile"
> > > > > > > > > > > > > > > > > >>>> : 113.0,
> > > > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > > > >>>>> Where "delta" is the name of the
> table I am
> > > > > > > > querying.
> > > > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > > > >>>>> In addition to all this, i monitored
> the
> > > > > > hardware
> > > > > > > > > > resources
> > > > > > > > > > > > > (CPU,
> > > > > > > > > > > > > > > > > disk,
> > > > > > > > > > > > > > > > > >>>> and network) of both the client and
> the
> > > > region
> > > > > > > > server
> > > > > > > > > > and
> > > > > > > > > > > > > nothing
> > > > > > > > > > > > > > > > > seems
> > > > > > > > > > > > > > > > > >>>> anywhere near saturation. So I am
> puzzled by
> > > > > > what's
> > > > > > > > > > going on
> > > > > > > > > > > > > and
> > > > > > > > > > > > > > > > > where this
> > > > > > > > > > > > > > > > > >>>> time is going.
> > > > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > > > >>>>> Few things to note based on the above
> > > > > > measurements:
> > > > > > > > > > both
> > > > > > > > > > > > > medians of
> > > > > > > > > > > > > > > > > IPC
> > > > > > > > > > > > > > > > > >>>> processCallTime and queueCallTime are
> > > > basically
> > > > > > zero
> > > > > > > > > > (ms I
> > > > > > > > > > > > > presume,
> > > > > > > > > > > > > > > > > >>>> right?). However, scanNext_median is
> 101 (ms
> > > > > > too,
> > > > > > > > > > right?). I
> > > > > > > > > > > > > am not
> > > > > > > > > > > > > > > > > sure
> > > > > > > > > > > > > > > > > >>>> how this adds up. Also, even though
> the 101
> > > > > > figure
> > > > > > > > seems
> > > > > > > > > > > > > > > > outrageously
> > > > > > > > > > > > > > > > > high
> > > > > > > > > > > > > > > > > >>>> and I don't know why, still all these
> scans
> > > > > > should
> > > > > > > > be
> > > > > > > > > > > > > happening in
> > > > > > > > > > > > > > > > > >>>> parallel, so the overall call should
> finish
> > > > > > fast,
> > > > > > > > given
> > > > > > > > > > that
> > > > > > > > > > > > > no
> > > > > > > > > > > > > > > > > hardware
> > > > > > > > > > > > > > > > > >>>> resource is contended, right? but
> this is
> > > > not
> > > > > > what's
> > > > > > > > > > > > > happening, so I
> > > > > > > > > > > > > > > > > have
> > > > > > > > > > > > > > > > > >>>> to be missing something(s).
> > > > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > > > >>>>> So, any help is appreciated there.
> > > > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > > > >>>>> Thanks,
> > > > > > > > > > > > > > > > > >>>>> Khaled
> > > > > > > > > > > > > > > > > >>>>
> > > > > > > > > > > > > > > > > >>>>
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > Best regards,
> > > > > > > > >
> > > > > > > > >    - Andy
> > > > > > > > >
> > > > > > > > > Problems worthy of attack prove their worth by hitting
> back. -
> > > > Piet
> > > > > > Hein
> > > > > > > > > (via Tom White)
> > > > > > > >
> > > > > > > >
> > > > > >
> > > > > >
> > > >
> > > >
>

RE: HBase read performance

Posted by Khaled Elmeleegy <kd...@hotmail.com>.

Actually, it's very easy to reproduce. I have a simple program that reproduces it if run against a simple toy cluster with one RS. If shared the program with Ted. I can file a jira ticket about it if you want.
Khaled

> Date: Mon, 13 Oct 2014 19:12:13 -0700
> From: larsh@apache.org
> Subject: Re: HBase read performance
> To: user@hbase.apache.org
> 
> This looks like something to track.
> If there's an easy way to reproduce this, could you file a jira with instructions?
> (Although from the conversations here, it looks like it's not trivial to reproduce).
> 
> -- Lars
> 
> 
> 
> ________________________________
>  From: Khaled Elmeleegy <kd...@hotmail.com>
> To: "user@hbase.apache.org" <us...@hbase.apache.org> 
> Sent: Monday, October 13, 2014 12:13 PM
> Subject: RE: HBase read performance
>  
> 
> I went through a lengthy testing process with Ted. I tried both disabling just cache on write and both COW and BC. Disabling them makes the problem go away. I think disabling COW in my workload effectively doesn't exercise  the block cache as my workload is mostly writes, so disabling COW doesn't cause the cache blocks to churn.
> I tried both overcommitting memory and not. This is irrelevant though. The cache operates on the virtual memory offered to it and it's agnostic to the physical memory size. Memory over commitment may only hurt performance due to swapping by the OS, but this has nothing to do with the bug.
> Khaled
> 
> > Date: Sun, 12 Oct 2014 23:17:24 -0700
> > Subject: Re: HBase read performance
> > From: stack@duboce.net
> > To: user@hbase.apache.org
> > 
> > On Sun, Oct 12, 2014 at 8:58 PM, Khaled Elmeleegy <kd...@hotmail.com> wrote:
> > 
> > > It goes away, but i think that's because i am basically churning the cache
> > > far less. My workload is mostly writes.
> > >
> > >
> > What exactly did you change?
> > 
> > Did you disable both BC and cacheonwrite or just one of these configs?
> > 
> > Disabling BC and/or cacheonwrite would churn your cache more?
> > 
> > What about the heap config?  Did you over commit mem?  If so, did you try
> > doing a direct memory sizing that was w/i the bounds of physical memory
> > (well within)?  Did you still see exceptions?
> > 
> > St.Ack
> > 
> > 
> > 
> > 
> > > > Date: Sun, 12 Oct 2014 19:43:16 -0700
> > > > Subject: Re: HBase read performance
> > > > From: yuzhihong@gmail.com
> > > > To: user@hbase.apache.org
> > > >
> > > > Hi,
> > > > Can you turn off cacheonwrite and keep bucket cache to see if the problem
> > > > goes away ?
> > > >
> > > > Cheers
> > > >
> > > > On Fri, Oct 10, 2014 at 10:59 AM, Khaled Elmeleegy <kd...@hotmail.com>
> > > > wrote:
> > > >
> > > > > Yes, I can reproduce it with some work.
> > > > > The workload is basically as follows:
> > > > > There are writers streaming writes to a table. Then, there is a reader
> > > > > (invoked via a web interface). The reader does a 1000 parallel reverse
> > > > > scans, all end up hitting the same region in my case. The scans are
> > > > > effectively "gets" as I just need to get one record off of each of
> > > them. I
> > > > > just need to do a "reverse" get, which is not supported (would be
> > > great to
> > > > > have :)), so I do it via reverse scan. After few tries, the reader
> > > > > consistently hits this bug.
> > > > >
> > > > > This happens with these config changes:
> > > > > hbase-env:HBASE_REGIONSERVER_OPTS=-Xmx6G -XX:MaxDirectMemorySize=5G
> > > > > -XX:CMSInitiatingOccupancyFraction=88 -XX:+AggressiveOpts -verbose:gc
> > > > > -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Xlog
> > > > > gc:/tmp/hbase-regionserver-gc.loghbase-site:
> > > > > hbase.bucketcache.ioengine=offheap
> > > > > hbase.bucketcache.size=4196
> > > > > hbase.rs.cacheblocksonwrite=true
> > > > > hfile.block.index.cacheonwrite=true
> > > > > hfile.block.bloom.cacheonwrite=true
> > > > >
> > > > > Interestingly, without these config changes, I can't reproduce the
> > > problem.
> > > > > Khaled
> > > > >
> > > > >
> > > > > > Date: Fri, 10 Oct 2014 10:05:14 -0700
> > > > > > Subject: Re: HBase read performance
> > > > > > From: stack@duboce.net
> > > > > > To: user@hbase.apache.org
> > > > > >
> > > > > > It looks like we are messing up our positioning math:
> > > > > >
> > > > > > 233 <
> > > > >
> > > http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/nio/Buffer.java#233
> > > > > >
> > > > > >
> > > > > > <
> > > > >
> > > http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/nio/Buffer.java#
> > > > > >
> > > > > >
> > > > > >     public final Buffer
> > > > > > <
> > > > >
> > > http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/nio/Buffer.java#Buffer
> > > > > >
> > > > > >  <
> > > > >
> > > http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/nio/Buffer.java#
> > > > > >position(int
> > > > > > newPosition) {
> > > > > >
> > > > > > 234 <
> > > > >
> > > http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/nio/Buffer.java#234
> > > > > >
> > > > > >
> > > > > > <
> > > > >
> > > http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/nio/Buffer.java#
> > > > > >
> > > > > >
> > > > > >         if ((newPosition > limit
> > > > > > <
> > > > >
> > > http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/nio/Buffer.java#Buffer.0limit
> > > > > >)
> > > > > > || (newPosition < 0))
> > > > > >
> > > > > > 235 <
> > > > >
> > > http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/nio/Buffer.java#235
> > > > > >
> > > > > >
> > > > > > <
> > > > >
> > > http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/nio/Buffer.java#
> > > > > >
> > > > > >
> > > > > >             throw new IllegalArgumentException
> > > > > > <
> > > > >
> > > http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/lang/IllegalArgumentException.java#IllegalArgumentException
> > > > > >();
> > > > > >
> > > > > >
> > > > > > Is it easy to reproduce Khaled?  Always same region/store or spread
> > > > > > across all reads?
> > > > > >
> > > > > >
> > > > > > St.Ack
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Fri, Oct 10, 2014 at 8:31 AM, Khaled Elmeleegy <kdiaa@hotmail.com
> > > >
> > > > > wrote:
> > > > > >
> > > > > > > Andrew thanks. I think this indeed was the problem.
> > > > > > > To get over it, I increased the amount of memory given the region
> > > > > server
> > > > > > > to avoid IO on reads. I used the configs below. In my experiments,
> > > I
> > > > > have
> > > > > > > writers streaming their output to HBase. The reader powers a web
> > > page
> > > > > and
> > > > > > > does this scatter/gather, where it reads 1000 keys written last and
> > > > > passes
> > > > > > > them the the front end. With this workload, I get the exception
> > > below
> > > > > at
> > > > > > > the region server. Again, I am using HBAse (0.98.6.1). Any help is
> > > > > > > appreciated.
> > > > > > > hbase-env:HBASE_REGIONSERVER_OPTS=-Xmx6G -XX:MaxDirectMemorySize=5G
> > > > > > > -XX:CMSInitiatingOccupancyFraction=88 -XX:+AggressiveOpts
> > > -verbose:gc
> > > > > > > -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Xlog
> > > > > > > gc:/tmp/hbase-regionserver-gc.log
> > > > > > > hbase-site:hbase.bucketcache.ioengine=offheap
> > > > > > >
> > > > > > >                                    hfile.block.cache.size=0.4
> > > > > > >
> > > > > > >
> > > > > > >  hbase.bucketcache.size=4196
> > > > > > >
> > > > > > >                         hbase.rs.cacheblocksonwrite=true
> > > > > > >
> > > > > > >
> > > > > > >  hfile.block.index.cacheonwrite=true
> > > > > > >
> > > > > > >                         hfile.block.bloom.cacheonwrite=true
> > > > > > >
> > > > > > > 2014-10-10 15:06:44,173 ERROR
> > > > > > > [B.DefaultRpcServer.handler=62,queue=2,port=60020] ipc.RpcServer:
> > > > > > > Unexpected throwable object
> > > > > > > java.lang.IllegalArgumentException
> > > > > > >         at java.nio.Buffer.position(Buffer.java:236)
> > > > > > >         at
> > > > > > >
> > > > >
> > > org.apache.hadoop.hbase.util.ByteBufferUtils.skip(ByteBufferUtils.java:434)
> > > > > > >         at
> > > > > > >
> > > > >
> > > org.apache.hadoop.hbase.io.hfile.HFileReaderV2$ScannerV2.readKeyValueLen(HFileReaderV2.java:849)
> > > > > > >         at
> > > > > > >
> > > > >
> > > org.apache.hadoop.hbase.io.hfile.HFileReaderV2$ScannerV2.next(HFileReaderV2.java:760)
> > > > > > >         at
> > > > > > >
> > > > >
> > > org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekAtOrAfter(StoreFileScanner.java:248)
> > > > > > >         at
> > > > > > >
> > > > >
> > > org.apache.hadoop.hbase.regionserver.StoreFileScanner.seek(StoreFileScanner.java:152)
> > > > > > >         at
> > > > > > >
> > > > >
> > > org.apache.hadoop.hbase.regionserver.StoreScanner.seekScanners(StoreScanner.java:317)
> > > > > > >         at
> > > > > > >
> > > > >
> > > org.apache.hadoop.hbase.regionserver.StoreScanner.<init>(StoreScanner.java:176)
> > > > > > >         at
> > > > > > >
> > > > >
> > > org.apache.hadoop.hbase.regionserver.HStore.getScanner(HStore.java:1780)
> > > > > > >         at
> > > > > > >
> > > > >
> > > org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.<init>(HRegion.java:3758)
> > > > > > >         at
> > > > > > >
> > > > >
> > > org.apache.hadoop.hbase.regionserver.HRegion.instantiateRegionScanner(HRegion.java:1950)
> > > > > > >         at
> > > > > > >
> > > > >
> > > org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:1936)
> > > > > > >         at
> > > > > > >
> > > > >
> > > org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:1913)
> > > > > > >         at
> > > > > > >
> > > > >
> > > org.apache.hadoop.hbase.regionserver.HRegionServer.scan(HRegionServer.java:3157)
> > > > > > >         at
> > > > > > >
> > > > >
> > > org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:29587)
> > > > > > >         at
> > > > > org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2027)
> > > > > > >         at
> > > > > org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108)
> > > > > > >         at
> > > > > > >
> > > > >
> > > org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:114)
> > > > > > >         at
> > > > > > > org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:94)
> > > > > > >         at java.lang.Thread.run(Thread.java:744)
> > > > > > > > From: apurtell@apache.org
> > > > > > > > Date: Tue, 7 Oct 2014 17:09:35 -0700
> > > > > > > > Subject: Re: HBase read performance
> > > > > > > > To: user@hbase.apache.org
> > > > > > > >
> > > > > > > > > The cluster has 2 m1.large nodes.
> > > > > > > >
> > > > > > > > That's the problem right there.
> > > > > > > >
> > > > > > > > You need to look at c3.4xlarge or i2 instances as a minimum
> > > > > requirement.
> > > > > > > M1
> > > > > > > > and even M3 instance types have ridiculously poor IO.
> > > > > > > >
> > > > > > > >
> > > > > > > > On Tue, Oct 7, 2014 at 3:01 PM, Khaled Elmeleegy <
> > > kdiaa@hotmail.com>
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Thanks Nicolas, Qiang.
> > > > > > > > >
> > > > > > > > > I was able to write a simple program that reproduces the
> > > problem
> > > > > on a
> > > > > > > tiny
> > > > > > > > > HBase cluster on ec2. The cluster has 2 m1.large nodes. One
> > > node
> > > > > runs
> > > > > > > the
> > > > > > > > > master, name node and zookeeper. The other node runs a data
> > > node
> > > > > and a
> > > > > > > > > region server, with heap size configured to be 6GB. There, the
> > > 1000
> > > > > > > > > parallel reverse gets (reverse scans) take 7-8 seconds. The
> > > data
> > > > > set is
> > > > > > > > > tiny (10M records, each having a small number of bytes). As I
> > > said
> > > > > > > before,
> > > > > > > > > all hardware resources are very idle there.
> > > > > > > > >
> > > > > > > > > Interestingly, running the same workload on my macbook, the
> > > 1000
> > > > > > > parallel
> > > > > > > > > gets take ~200ms on a pseudo-distributed installation.
> > > > > > > > >
> > > > > > > > > Any help to resolve this mystery is highly appreciated.
> > > > > > > > >
> > > > > > > > > P.S. please find my test program attached.
> > > > > > > > >
> > > > > > > > > Best,
> > > > > > > > > Khaled
> > > > > > > > >
> > > > > > > > > > From: nkeywal@gmail.com
> > > > > > > > > > Date: Mon, 6 Oct 2014 09:40:48 +0200
> > > > > > > > > > Subject: Re: HBase read performance
> > > > > > > > > > To: user@hbase.apache.org
> > > > > > > > > >
> > > > > > > > > > Hi,
> > > > > > > > > >
> > > > > > > > > > I haven't seen it mentioned, but if I understand correctly
> > > each
> > > > > scan
> > > > > > > > > > returns a single row? If so you should use Scan#setSmall to
> > > save
> > > > > > > some rpc
> > > > > > > > > > calls.
> > > > > > > > > >
> > > > > > > > > > Cheers,
> > > > > > > > > >
> > > > > > > > > > Nicolas
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Sun, Oct 5, 2014 at 11:28 AM, Qiang Tian <
> > > tianq01@gmail.com>
> > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > when using separate HConnection instance, both its
> > > > > > > > > > > RpcClient instance(maintain connection to a regionserver)
> > > and
> > > > > > > Registry
> > > > > > > > > > > instance(maintain connection to zookeeper) will be
> > > separate..
> > > > > > > > > > >
> > > > > > > > > > > see
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > >
> > > http://shammijayasinghe.blogspot.com/2012/02/zookeeper-increase-maximum-number-of.html
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > On Sun, Oct 5, 2014 at 2:24 PM, Khaled Elmeleegy <
> > > > > > > kdiaa@hotmail.com>
> > > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > I tried creating my own HConnections pool to use for my
> > > HBase
> > > > > > > calls,
> > > > > > > > > so
> > > > > > > > > > > > that not all the (2K) threads share the same HConnection.
> > > > > > > However, I
> > > > > > > > > > > could
> > > > > > > > > > > > only have 10 HConnections. Beyond that I get ZK
> > > exceptions,
> > > > > > > please
> > > > > > > > > find
> > > > > > > > > > > it
> > > > > > > > > > > > below. Also, with 10 HConnections, I don't see noticeable
> > > > > > > > > improvement in
> > > > > > > > > > > > performance so far.
> > > > > > > > > > > > 2014-10-05 06:11:26,490 WARN [main]
> > > > > > > zookeeper.RecoverableZooKeeper
> > > > > > > > > > > > (RecoverableZooKeeper.java:retryOrThrow(253)) - Possibly
> > > > > > > transient
> > > > > > > > > > > > ZooKeeper, quorum=54.68.206.252:2181,
> > > > > > > > > > > >
> > > > > > > > >
> > > > > exception=org.apache.zookeeper.KeeperException$ConnectionLossException:
> > > > > > > > > > > > KeeperErrorCode = ConnectionLoss for
> > > /hbase/hbaseid2014-10-05
> > > > > > > > > > > 06:11:26,490
> > > > > > > > > > > > INFO [main] util.RetryCounter
> > > > > > > > > > > (RetryCounter.java:sleepUntilNextRetry(155))
> > > > > > > > > > > > - Sleeping 1000ms before retry #0...2014-10-05
> > > 06:11:27,845
> > > > > WARN
> > > > > > > > > [main]
> > > > > > > > > > > > zookeeper.RecoverableZooKeeper
> > > > > > > > > > > > (RecoverableZooKeeper.java:retryOrThrow(253)) - Possibly
> > > > > > > transient
> > > > > > > > > > > > ZooKeeper, quorum=54.68.206.252:2181,
> > > > > > > > > > > >
> > > > > > > > >
> > > > > exception=org.apache.zookeeper.KeeperException$ConnectionLossException:
> > > > > > > > > > > > KeeperErrorCode = ConnectionLoss for
> > > /hbase/hbaseid2014-10-05
> > > > > > > > > > > 06:11:27,849
> > > > > > > > > > > > INFO [main] util.RetryCounter
> > > > > > > > > > > (RetryCounter.java:sleepUntilNextRetry(155))
> > > > > > > > > > > > - Sleeping 2000ms before retry #1...2014-10-05
> > > 06:11:30,405
> > > > > WARN
> > > > > > > > > [main]
> > > > > > > > > > > > zookeeper.RecoverableZooKeeper
> > > > > > > > > > > > (RecoverableZooKeeper.java:retryOrThrow(253)) - Possibly
> > > > > > > transient
> > > > > > > > > > > > ZooKeeper, quorum=54.68.206.252:2181,
> > > > > > > > > > > >
> > > > > > > > >
> > > > > exception=org.apache.zookeeper.KeeperException$ConnectionLossException:
> > > > > > > > > > > > KeeperErrorCode = ConnectionLoss for
> > > /hbase/hbaseid2014-10-05
> > > > > > > > > > > 06:11:30,405
> > > > > > > > > > > > INFO [main] util.RetryCounter
> > > > > > > > > > > (RetryCounter.java:sleepUntilNextRetry(155))
> > > > > > > > > > > > - Sleeping 4000ms before retry #2...2014-10-05
> > > 06:11:35,278
> > > > > WARN
> > > > > > > > > [main]
> > > > > > > > > > > > zookeeper.RecoverableZooKeeper
> > > > > > > > > > > > (RecoverableZooKeeper.java:retryOrThrow(253)) - Possibly
> > > > > > > transient
> > > > > > > > > > > > ZooKeeper, quorum=54.68.206.252:2181,
> > > > > > > > > > > >
> > > > > > > > >
> > > > > exception=org.apache.zookeeper.KeeperException$ConnectionLossException:
> > > > > > > > > > > > KeeperErrorCode = ConnectionLoss for
> > > /hbase/hbaseid2014-10-05
> > > > > > > > > > > 06:11:35,279
> > > > > > > > > > > > INFO [main] util.RetryCounter
> > > > > > > > > > > (RetryCounter.java:sleepUntilNextRetry(155))
> > > > > > > > > > > > - Sleeping 8000ms before retry #3...2014-10-05
> > > 06:11:44,393
> > > > > WARN
> > > > > > > > > [main]
> > > > > > > > > > > > zookeeper.RecoverableZooKeeper
> > > > > > > > > > > > (RecoverableZooKeeper.java:retryOrThrow(253)) - Possibly
> > > > > > > transient
> > > > > > > > > > > > ZooKeeper, quorum=54.68.206.252:2181,
> > > > > > > > > > > >
> > > > > > > > >
> > > > > exception=org.apache.zookeeper.KeeperException$ConnectionLossException:
> > > > > > > > > > > > KeeperErrorCode = ConnectionLoss for
> > > /hbase/hbaseid2014-10-05
> > > > > > > > > > > 06:11:44,393
> > > > > > > > > > > > ERROR [main] zookeeper.RecoverableZooKeeper
> > > > > > > > > > > > (RecoverableZooKeeper.java:retryOrThrow(255)) - ZooKeeper
> > > > > exists
> > > > > > > > > failed
> > > > > > > > > > > > after 4 attempts2014-10-05 06:11:44,394 WARN [main]
> > > > > > > zookeeper.ZKUtil
> > > > > > > > > > > > (ZKUtil.java:checkExists(482)) - hconnection-0x4e174f3b,
> > > > > quorum=
> > > > > > > > > > > > 54.68.206.252:2181, baseZNode=/hbase Unable to set
> > > watcher
> > > > > on
> > > > > > > znode
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > >
> > > (/hbase/hbaseid)org.apache.zookeeper.KeeperException$ConnectionLossException:
> > > > > > > > > > > > KeeperErrorCode = ConnectionLoss for /hbase/hbaseid at
> > > > > > > > > > > >
> > > > > > >
> > > org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
> > > > > > > > > at
> > > > > > > > > > > >
> > > > > > >
> > > org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
> > > > > > > > > at
> > > > > > > > > > > >
> > > org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1041) at
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > >
> > > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.exists(RecoverableZooKeeper.java:199)
> > > > > > > > > > > > at
> > > > > > > > >
> > > > > org.apache.hadoop.hbase.zookeeper.ZKUtil.checkExists(ZKUtil.java:479)
> > > > > > > > > > > > at
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > >
> > > org.apache.hadoop.hbase.zookeeper.ZKClusterId.readClusterIdZNode(ZKClusterId.java:65)
> > > > > > > > > > > > at
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > >
> > > org.apache.hadoop.hbase.client.ZooKeeperRegistry.getClusterId(ZooKeeperRegistry.java:83)
> > > > > > > > > > > > at
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > >
> > > org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.retrieveClusterId(HConnectionManager.java:857)
> > > > > > > > > > > > at
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > >
> > > org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.<init>(HConnectionManager.java:662)
> > > > > > > > > > > > at
> > > > > sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> > > > > > > > > > > > Method) at
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > >
> > > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
> > > > > > > > > > > > at
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > >
> > > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> > > > > > > > > > > > at
> > > > > > > java.lang.reflect.Constructor.newInstance(Constructor.java:526)
> > > > > > > > > > > at
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > >
> > > org.apache.hadoop.hbase.client.HConnectionManager.createConnection(HConnectionManager.java:414)
> > > > > > > > > > > > at
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > >
> > > org.apache.hadoop.hbase.client.HConnectionManager.createConnection(HConnectionManager.java:335)...
> > > > > > > > > > > > > From: kdiaa@hotmail.com
> > > > > > > > > > > > > To: user@hbase.apache.org
> > > > > > > > > > > > > Subject: RE: HBase read performance
> > > > > > > > > > > > > Date: Fri, 3 Oct 2014 12:37:39 -0700
> > > > > > > > > > > > >
> > > > > > > > > > > > > Lars, Ted, and Qiang,
> > > > > > > > > > > > > Thanks for all the input.
> > > > > > > > > > > > > Qiang: yes all the threads are in the same client
> > > process
> > > > > > > sharing
> > > > > > > > > the
> > > > > > > > > > > > same connection. And since I don't see hardware
> > > contention,
> > > > > may
> > > > > > > be
> > > > > > > > > there
> > > > > > > > > > > is
> > > > > > > > > > > > contention over this code path. I'll try using many
> > > > > connections
> > > > > > > and
> > > > > > > > > see
> > > > > > > > > > > if
> > > > > > > > > > > > it alleviates the problems and I'll report back.
> > > > > > > > > > > > > Thanks again,Khaled
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Date: Fri, 3 Oct 2014 15:18:30 +0800
> > > > > > > > > > > > > > Subject: Re: HBase read performance
> > > > > > > > > > > > > > From: tianq01@gmail.com
> > > > > > > > > > > > > > To: user@hbase.apache.org
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Regarding to profiling, Andrew introduced
> > > > > > > > > > > > > >
> > > > > > > > >
> > > http://www.brendangregg.com/blog/2014-06-12/java-flame-graphs.html
> > > > > > > > > > > > months
> > > > > > > > > > > > > > ago.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > processCallTime comes from RpcServer#call, so it
> > > looks
> > > > > good?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > I have a suspect:
> > > > > > > > > https://issues.apache.org/jira/browse/HBASE-11306
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > how many processes do you have for your 2000 threads?
> > > > > > > > > > > > > > if olny 1 process, those threads will share just 1
> > > > > > > connection to
> > > > > > > > > that
> > > > > > > > > > > > > > regionserver, there might be big contention on the
> > > RPC
> > > > > code
> > > > > > > path.
> > > > > > > > > > > > ---for
> > > > > > > > > > > > > > such case, could you try using different connections?
> > > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > >
> > > https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HConnectionManager.html
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Fri, Oct 3, 2014 at 9:55 AM, Ted Yu <
> > > > > yuzhihong@gmail.com>
> > > > > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Khaled:
> > > > > > > > > > > > > > > Do you have profiler such as jprofiler ?
> > > > > > > > > > > > > > > Profiling would give us more hint.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Otherwise capturing stack trace during the period
> > > of
> > > > > > > reverse
> > > > > > > > > scan
> > > > > > > > > > > > would
> > > > > > > > > > > > > > > help.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Cheers
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Thu, Oct 2, 2014 at 4:52 PM, lars hofhansl <
> > > > > > > > > larsh@apache.org>
> > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > You might have the data in the OS buffer cache,
> > > > > without
> > > > > > > short
> > > > > > > > > > > > circuit
> > > > > > > > > > > > > > > > reading the region server has to request the
> > > block
> > > > > from
> > > > > > > the
> > > > > > > > > data
> > > > > > > > > > > > node
> > > > > > > > > > > > > > > > process, which then reads it from the block
> > > cache.
> > > > > > > > > > > > > > > > That is a few context switches per RPC that do
> > > not
> > > > > show
> > > > > > > up
> > > > > > > > > in CPU
> > > > > > > > > > > > > > > metrics.
> > > > > > > > > > > > > > > > In that you also would not see disk IO.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > If - as you say - you see a lot of evicted
> > > blocks the
> > > > > > > data
> > > > > > > > > *has*
> > > > > > > > > > > > to come
> > > > > > > > > > > > > > > > from the OS. If you do not see disk IO is *has*
> > > to
> > > > > come
> > > > > > > from
> > > > > > > > > the
> > > > > > > > > > > OS
> > > > > > > > > > > > > > > cache.
> > > > > > > > > > > > > > > > I.e. there's more RAM on your boxes, and you
> > > should
> > > > > > > increase
> > > > > > > > > the
> > > > > > > > > > > > heap
> > > > > > > > > > > > > > > block
> > > > > > > > > > > > > > > > cache.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > You can measure the context switches with vmstat.
> > > > > Other
> > > > > > > than
> > > > > > > > > that
> > > > > > > > > > > > I have
> > > > > > > > > > > > > > > > no suggestion until I reproduce the problem.
> > > > > > > > > > > > > > > > Also check the data locality index of the region
> > > > > server
> > > > > > > it
> > > > > > > > > should
> > > > > > > > > > > > be
> > > > > > > > > > > > > > > close
> > > > > > > > > > > > > > > > to 100%.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > -- Lars
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > ________________________________
> > > > > > > > > > > > > > > > From: Khaled Elmeleegy <kd...@hotmail.com>
> > > > > > > > > > > > > > > > To: "user@hbase.apache.org" <
> > > user@hbase.apache.org>
> > > > > > > > > > > > > > > > Sent: Thursday, October 2, 2014 3:24 PM
> > > > > > > > > > > > > > > > Subject: RE: HBase read performance
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Lars thanks a lot about all the tips. I'll make
> > > sure
> > > > > I
> > > > > > > cover
> > > > > > > > > all
> > > > > > > > > > > > of them
> > > > > > > > > > > > > > > > and get back to you. I am not sure they are the
> > > > > > > bottleneck
> > > > > > > > > though
> > > > > > > > > > > > as they
> > > > > > > > > > > > > > > > all are about optimizing physical resource
> > > usage. As
> > > > > I
> > > > > > > said,
> > > > > > > > > I
> > > > > > > > > > > > don't see
> > > > > > > > > > > > > > > > any contended physical resources now. I'll also
> > > try
> > > > > to
> > > > > > > > > reproduce
> > > > > > > > > > > > this
> > > > > > > > > > > > > > > > problem in a simpler environment and pass to you
> > > the
> > > > > test
> > > > > > > > > program
> > > > > > > > > > > > to play
> > > > > > > > > > > > > > > > with.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Couple of high level points to make. You are
> > > right
> > > > > that
> > > > > > > my
> > > > > > > > > use
> > > > > > > > > > > > case is
> > > > > > > > > > > > > > > > kind of a worst case for HBase reads. But, if
> > > things
> > > > > go
> > > > > > > the
> > > > > > > > > way
> > > > > > > > > > > you
> > > > > > > > > > > > > > > > described them, there should be tons of disk IO
> > > and
> > > > > that
> > > > > > > > > should
> > > > > > > > > > > be
> > > > > > > > > > > > > > > clearly
> > > > > > > > > > > > > > > > the bottleneck. This is not the case though.
> > > That's
> > > > > for
> > > > > > > the
> > > > > > > > > > > simple
> > > > > > > > > > > > reason
> > > > > > > > > > > > > > > > that this is done in a test environment (I am
> > > still
> > > > > > > > > prototyping),
> > > > > > > > > > > > and
> > > > > > > > > > > > > > > not a
> > > > > > > > > > > > > > > > lot of data is yet written to HBase. However for
> > > the
> > > > > > > real use
> > > > > > > > > > > > case, there
> > > > > > > > > > > > > > > > should writers constantly writing data to HBase
> > > and
> > > > > > > readers
> > > > > > > > > > > > occasionally
> > > > > > > > > > > > > > > > doing this scatter/gather. At steady state,
> > > things
> > > > > should
> > > > > > > > > only
> > > > > > > > > > > get
> > > > > > > > > > > > worse
> > > > > > > > > > > > > > > > and all the issues you mentioned should get far
> > > more
> > > > > > > > > pronounced.
> > > > > > > > > > > > At this
> > > > > > > > > > > > > > > > point, one can try to mitigate it using more
> > > memory
> > > > > or
> > > > > > > so. I
> > > > > > > > > am
> > > > > > > > > > > > not there
> > > > > > > > > > > > > > > > yet as I think I am hitting some software
> > > bottleneck,
> > > > > > > which I
> > > > > > > > > > > > don't know
> > > > > > > > > > > > > > > > how to work around.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Khaled
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > ----------------------------------------
> > > > > > > > > > > > > > > > > Date: Thu, 2 Oct 2014 14:20:47 -0700
> > > > > > > > > > > > > > > > > From: larsh@apache.org
> > > > > > > > > > > > > > > > > Subject: Re: HBase read performance
> > > > > > > > > > > > > > > > > To: user@hbase.apache.org
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > OK... We might need to investigate this.
> > > > > > > > > > > > > > > > > Any chance that you can provide a minimal test
> > > > > program
> > > > > > > and
> > > > > > > > > > > > instruction
> > > > > > > > > > > > > > > > about how to set it up.
> > > > > > > > > > > > > > > > > We can do some profiling then.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > One thing to note is that with scanning HBase
> > > > > cannot
> > > > > > > use
> > > > > > > > > bloom
> > > > > > > > > > > > filters
> > > > > > > > > > > > > > > > to rule out HFiles ahead of time, it needs to
> > > look
> > > > > into
> > > > > > > all
> > > > > > > > > of
> > > > > > > > > > > > them.
> > > > > > > > > > > > > > > > > So you kind of hit on the absolute worst case:
> > > > > > > > > > > > > > > > > - random reads that do not fit into the block
> > > cache
> > > > > > > > > > > > > > > > > - cannot use bloom filters
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Few more question/comments:
> > > > > > > > > > > > > > > > > - Do you have short circuit reading enabled? If
> > > > > not,
> > > > > > > you
> > > > > > > > > > > should.
> > > > > > > > > > > > > > > > > - Is your table major compacted? That will
> > > reduce
> > > > > the
> > > > > > > > > number of
> > > > > > > > > > > > files
> > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > look at.
> > > > > > > > > > > > > > > > > - Did you disable Nagle's everywhere (enabled
> > > > > > > tcpnodelay)?
> > > > > > > > > It
> > > > > > > > > > > > disabled
> > > > > > > > > > > > > > > > by default in HBase, but necessarily in your
> > > install
> > > > > of
> > > > > > > HDFS.
> > > > > > > > > > > > > > > > > - Which version of HDFS are you using as
> > > backing
> > > > > > > > > filesystem?
> > > > > > > > > > > > > > > > > - If your disk is idle, it means the data fits
> > > into
> > > > > > > the OS
> > > > > > > > > > > buffer
> > > > > > > > > > > > > > > cache.
> > > > > > > > > > > > > > > > In turn that means that you increase the heap
> > > for the
> > > > > > > region
> > > > > > > > > > > > servers. You
> > > > > > > > > > > > > > > > can also use block encoding (FAST_DIFF) to try to
> > > > > make
> > > > > > > sure
> > > > > > > > > the
> > > > > > > > > > > > entire
> > > > > > > > > > > > > > > > working set fits into the cache.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > - Also try to reduce the block size - although
> > > if
> > > > > your
> > > > > > > > > overall
> > > > > > > > > > > > working
> > > > > > > > > > > > > > > > set does not fit in the heap it won't help much.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > This is a good section of the book to read
> > > through
> > > > > > > > > generally
> > > > > > > > > > > > (even
> > > > > > > > > > > > > > > > though you might know most of this already):
> > > > > > > > > > > > > > > >
> > > > > http://hbase.apache.org/book.html#perf.configurations
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > -- Lars
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > ----- Original Message -----
> > > > > > > > > > > > > > > > > From: Khaled Elmeleegy <kd...@hotmail.com>
> > > > > > > > > > > > > > > > > To: "user@hbase.apache.org" <
> > > user@hbase.apache.org
> > > > > >
> > > > > > > > > > > > > > > > > Cc:
> > > > > > > > > > > > > > > > > Sent: Thursday, October 2, 2014 11:27 AM
> > > > > > > > > > > > > > > > > Subject: RE: HBase read performance
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > I do see a very brief spike in CPU
> > > (user/system),
> > > > > but
> > > > > > > it's
> > > > > > > > > no
> > > > > > > > > > > > where
> > > > > > > > > > > > > > > near
> > > > > > > > > > > > > > > > 0% idle. It goes from 99% idle down to something
> > > > > like 40%
> > > > > > > > > idle
> > > > > > > > > > > for
> > > > > > > > > > > > a
> > > > > > > > > > > > > > > second
> > > > > > > > > > > > > > > > or so. The thing to note, this is all on a test
> > > > > cluster,
> > > > > > > so
> > > > > > > > > no
> > > > > > > > > > > > real load.
> > > > > > > > > > > > > > > > Things are generally idle until i issue 2-3 of
> > > these
> > > > > > > > > > > > multi-scan-requests
> > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > render a web page. Then, you see the spike in the
> > > > > cpu and
> > > > > > > > > some
> > > > > > > > > > > > activity
> > > > > > > > > > > > > > > in
> > > > > > > > > > > > > > > > the network and disk, but nowhere near
> > > saturation.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > If there are specific tests you'd like me to
> > > do to
> > > > > > > debug
> > > > > > > > > this,
> > > > > > > > > > > > I'd be
> > > > > > > > > > > > > > > > more than happy to do it.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Khaled
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > ----------------------------------------
> > > > > > > > > > > > > > > > >> Date: Thu, 2 Oct 2014 11:15:59 -0700
> > > > > > > > > > > > > > > > >> From: larsh@apache.org
> > > > > > > > > > > > > > > > >> Subject: Re: HBase read performance
> > > > > > > > > > > > > > > > >> To: user@hbase.apache.org
> > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > >> I still think you're waiting on disk. No
> > > IOWAIT?
> > > > > So
> > > > > > > CPU
> > > > > > > > > is not
> > > > > > > > > > > > waiting
> > > > > > > > > > > > > > > > a lot for IO. No high User/System CPU either?
> > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > >> If you see a lot of evicted block then each
> > > RPC
> > > > > has a
> > > > > > > high
> > > > > > > > > > > > chance of
> > > > > > > > > > > > > > > > requiring to bring an entire 64k block in. You'll
> > > > > see bad
> > > > > > > > > > > > performance
> > > > > > > > > > > > > > > with
> > > > > > > > > > > > > > > > this.
> > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > >> We might need to trace this specific scenario.
> > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > >> -- Lars
> > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > >> ________________________________
> > > > > > > > > > > > > > > > >> From: Khaled Elmeleegy <kd...@hotmail.com>
> > > > > > > > > > > > > > > > >> To: "user@hbase.apache.org" <
> > > > > user@hbase.apache.org>
> > > > > > > > > > > > > > > > >> Sent: Thursday, October 2, 2014 10:46 AM
> > > > > > > > > > > > > > > > >> Subject: RE: HBase read performance
> > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > >> I've set the heap size to 6GB and I do have gc
> > > > > > > logging. No
> > > > > > > > > > > long
> > > > > > > > > > > > pauses
> > > > > > > > > > > > > > > > there -- occasional 0.1s or 0.2s.
> > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > >> Other than the discrepancy between what's
> > > > > reported on
> > > > > > > the
> > > > > > > > > > > > client and
> > > > > > > > > > > > > > > > what's reported at the RS, there is also the
> > > issue
> > > > > of not
> > > > > > > > > getting
> > > > > > > > > > > > proper
> > > > > > > > > > > > > > > > concurrency. So, even if a reverse get takes
> > > 100ms
> > > > > or so
> > > > > > > > > (this
> > > > > > > > > > > has
> > > > > > > > > > > > to be
> > > > > > > > > > > > > > > > mostly blocking on various things as no physical
> > > > > > > resource is
> > > > > > > > > > > > contended),
> > > > > > > > > > > > > > > > then the other gets/scans should be able to
> > > proceed
> > > > > in
> > > > > > > > > parallel,
> > > > > > > > > > > > so a
> > > > > > > > > > > > > > > > thousand concurrent gets/scans should finish in
> > > few
> > > > > > > hundreds
> > > > > > > > > of
> > > > > > > > > > > ms
> > > > > > > > > > > > not
> > > > > > > > > > > > > > > many
> > > > > > > > > > > > > > > > seconds. That's why I thought I'd increase the
> > > > > handlers
> > > > > > > > > count to
> > > > > > > > > > > > try to
> > > > > > > > > > > > > > > get
> > > > > > > > > > > > > > > > more concurrency, but it didn't help. So, there
> > > must
> > > > > be
> > > > > > > > > something
> > > > > > > > > > > > else.
> > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > >> Khaled
> > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > >> ----------------------------------------
> > > > > > > > > > > > > > > > >>> From: ndimiduk@gmail.com
> > > > > > > > > > > > > > > > >>> Date: Thu, 2 Oct 2014 10:36:39 -0700
> > > > > > > > > > > > > > > > >>> Subject: Re: HBase read performance
> > > > > > > > > > > > > > > > >>> To: user@hbase.apache.org
> > > > > > > > > > > > > > > > >>>
> > > > > > > > > > > > > > > > >>> Do check again on the heap size of the region
> > > > > > > servers.
> > > > > > > > > The
> > > > > > > > > > > > default
> > > > > > > > > > > > > > > > >>> unconfigured size is 1G; too small for much
> > > of
> > > > > > > anything.
> > > > > > > > > > > Check
> > > > > > > > > > > > your
> > > > > > > > > > > > > > > RS
> > > > > > > > > > > > > > > > logs
> > > > > > > > > > > > > > > > >>> -- look for lines produced by the
> > > JVMPauseMonitor
> > > > > > > thread.
> > > > > > > > > > > They
> > > > > > > > > > > > > > > usually
> > > > > > > > > > > > > > > > >>> correlate with long GC pauses or other
> > > > > process-freeze
> > > > > > > > > events.
> > > > > > > > > > > > > > > > >>>
> > > > > > > > > > > > > > > > >>> Get is implemented as a Scan of a single
> > > row, so
> > > > > a
> > > > > > > > > reverse
> > > > > > > > > > > > scan of a
> > > > > > > > > > > > > > > > single
> > > > > > > > > > > > > > > > >>> row should be functionally equivalent.
> > > > > > > > > > > > > > > > >>>
> > > > > > > > > > > > > > > > >>> In practice, I have seen discrepancy between
> > > the
> > > > > > > > > latencies
> > > > > > > > > > > > reported
> > > > > > > > > > > > > > > by
> > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > >>> RS and the latencies experienced by the
> > > client.
> > > > > I've
> > > > > > > not
> > > > > > > > > > > > investigated
> > > > > > > > > > > > > > > > this
> > > > > > > > > > > > > > > > >>> area thoroughly.
> > > > > > > > > > > > > > > > >>>
> > > > > > > > > > > > > > > > >>> On Thu, Oct 2, 2014 at 10:05 AM, Khaled
> > > > > Elmeleegy <
> > > > > > > > > > > > kdiaa@hotmail.com
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > >>>
> > > > > > > > > > > > > > > > >>>> Thanks Lars for your quick reply.
> > > > > > > > > > > > > > > > >>>>
> > > > > > > > > > > > > > > > >>>> Yes performance is similar with less
> > > handlers (I
> > > > > > > tried
> > > > > > > > > with
> > > > > > > > > > > > 100
> > > > > > > > > > > > > > > > first).
> > > > > > > > > > > > > > > > >>>>
> > > > > > > > > > > > > > > > >>>> The payload is not big ~1KB or so. The
> > > working
> > > > > set
> > > > > > > > > doesn't
> > > > > > > > > > > > seem to
> > > > > > > > > > > > > > > > fit in
> > > > > > > > > > > > > > > > >>>> memory as there are many cache misses.
> > > However,
> > > > > > > disk is
> > > > > > > > > far
> > > > > > > > > > > > from
> > > > > > > > > > > > > > > > being a
> > > > > > > > > > > > > > > > >>>> bottleneck. I checked using iostat. I also
> > > > > verified
> > > > > > > that
> > > > > > > > > > > > neither the
> > > > > > > > > > > > > > > > >>>> network nor the CPU of the region server or
> > > the
> > > > > > > client
> > > > > > > > > are a
> > > > > > > > > > > > > > > > bottleneck.
> > > > > > > > > > > > > > > > >>>> This leads me to believe that likely this
> > > is a
> > > > > > > software
> > > > > > > > > > > > bottleneck,
> > > > > > > > > > > > > > > > >>>> possibly due to a misconfiguration on my
> > > side. I
> > > > > > > just
> > > > > > > > > don't
> > > > > > > > > > > > know how
> > > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > >>>> debug it. A clear disconnect I see is the
> > > > > individual
> > > > > > > > > request
> > > > > > > > > > > > latency
> > > > > > > > > > > > > > > > as
> > > > > > > > > > > > > > > > >>>> reported by metrics on the region server
> > > (IPC
> > > > > > > > > > > processCallTime
> > > > > > > > > > > > vs
> > > > > > > > > > > > > > > > scanNext)
> > > > > > > > > > > > > > > > >>>> vs what's measured on the client. Does this
> > > > > sound
> > > > > > > > > right? Any
> > > > > > > > > > > > ideas
> > > > > > > > > > > > > > > on
> > > > > > > > > > > > > > > > how
> > > > > > > > > > > > > > > > >>>> to better debug it?
> > > > > > > > > > > > > > > > >>>>
> > > > > > > > > > > > > > > > >>>> About this trick with the timestamps to be
> > > able
> > > > > to
> > > > > > > do a
> > > > > > > > > > > > forward
> > > > > > > > > > > > > > > scan,
> > > > > > > > > > > > > > > > >>>> thanks for pointing it out. Actually, I am
> > > > > aware of
> > > > > > > it.
> > > > > > > > > The
> > > > > > > > > > > > problem
> > > > > > > > > > > > > > > I
> > > > > > > > > > > > > > > > have
> > > > > > > > > > > > > > > > >>>> is, sometimes I want to get the key after a
> > > > > > > particular
> > > > > > > > > > > > timestamp and
> > > > > > > > > > > > > > > > >>>> sometimes I want to get the key before, so
> > > just
> > > > > > > relying
> > > > > > > > > on
> > > > > > > > > > > > the key
> > > > > > > > > > > > > > > > order
> > > > > > > > > > > > > > > > >>>> doesn't work. Ideally, I want a reverse
> > > get(). I
> > > > > > > thought
> > > > > > > > > > > > reverse
> > > > > > > > > > > > > > > scan
> > > > > > > > > > > > > > > > can
> > > > > > > > > > > > > > > > >>>> do the trick though.
> > > > > > > > > > > > > > > > >>>>
> > > > > > > > > > > > > > > > >>>> Khaled
> > > > > > > > > > > > > > > > >>>>
> > > > > > > > > > > > > > > > >>>> ----------------------------------------
> > > > > > > > > > > > > > > > >>>>> Date: Thu, 2 Oct 2014 09:40:37 -0700
> > > > > > > > > > > > > > > > >>>>> From: larsh@apache.org
> > > > > > > > > > > > > > > > >>>>> Subject: Re: HBase read performance
> > > > > > > > > > > > > > > > >>>>> To: user@hbase.apache.org
> > > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > > >>>>> Hi Khaled,
> > > > > > > > > > > > > > > > >>>>> is it the same with fewer threads? 1500
> > > handler
> > > > > > > threads
> > > > > > > > > > > > seems to
> > > > > > > > > > > > > > > be a
> > > > > > > > > > > > > > > > >>>> lot. Typically a good number of threads
> > > depends
> > > > > on
> > > > > > > the
> > > > > > > > > > > > hardware
> > > > > > > > > > > > > > > > (number of
> > > > > > > > > > > > > > > > >>>> cores, number of spindles, etc). I cannot
> > > think
> > > > > of
> > > > > > > any
> > > > > > > > > type
> > > > > > > > > > > of
> > > > > > > > > > > > > > > > scenario
> > > > > > > > > > > > > > > > >>>> where more than 100 would give any
> > > improvement.
> > > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > > >>>>> How large is the payload per KV retrieved
> > > that
> > > > > > > way? If
> > > > > > > > > > > large
> > > > > > > > > > > > (as
> > > > > > > > > > > > > > > in a
> > > > > > > > > > > > > > > > >>>> few 100k) you definitely want to lower the
> > > > > number
> > > > > > > of the
> > > > > > > > > > > > handler
> > > > > > > > > > > > > > > > threads.
> > > > > > > > > > > > > > > > >>>>> How much heap do you give the region
> > > server?
> > > > > Does
> > > > > > > the
> > > > > > > > > > > > working set
> > > > > > > > > > > > > > > fit
> > > > > > > > > > > > > > > > >>>> into the cache? (i.e. in the metrics, do you
> > > > > see the
> > > > > > > > > > > eviction
> > > > > > > > > > > > count
> > > > > > > > > > > > > > > > going
> > > > > > > > > > > > > > > > >>>> up, if so it does not fit into the cache).
> > > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > > >>>>> If the working set does not fit into the
> > > cache
> > > > > > > > > (eviction
> > > > > > > > > > > > count goes
> > > > > > > > > > > > > > > > up)
> > > > > > > > > > > > > > > > >>>> then HBase will need to bring a new block in
> > > > > from
> > > > > > > disk
> > > > > > > > > on
> > > > > > > > > > > > each Get
> > > > > > > > > > > > > > > > >>>> (assuming the Gets are more or less random
> > > as
> > > > > far
> > > > > > > as the
> > > > > > > > > > > > server is
> > > > > > > > > > > > > > > > >>>> concerned).
> > > > > > > > > > > > > > > > >>>>> In case you'll benefit from reducing the
> > > HFile
> > > > > > > block
> > > > > > > > > size
> > > > > > > > > > > > (from 64k
> > > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > >>>> 8k or even 4k).
> > > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > > >>>>> Lastly I don't think we tested the
> > > performance
> > > > > of
> > > > > > > using
> > > > > > > > > > > > reverse
> > > > > > > > > > > > > > > scan
> > > > > > > > > > > > > > > > >>>> this way, there is probably room to optimize
> > > > > this.
> > > > > > > > > > > > > > > > >>>>> Can you restructure your keys to allow
> > > forwards
> > > > > > > > > scanning?
> > > > > > > > > > > For
> > > > > > > > > > > > > > > example
> > > > > > > > > > > > > > > > >>>> you could store the time as MAX_LONG-time.
> > > Or
> > > > > you
> > > > > > > could
> > > > > > > > > > > > invert all
> > > > > > > > > > > > > > > > the bits
> > > > > > > > > > > > > > > > >>>> of the time portion of the key, so that it
> > > sort
> > > > > the
> > > > > > > > > other
> > > > > > > > > > > > way. Then
> > > > > > > > > > > > > > > > you
> > > > > > > > > > > > > > > > >>>> could do a forward scan.
> > > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > > >>>>> Let us know how it goes.
> > > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > > >>>>> -- Lars
> > > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > > >>>>> ----- Original Message -----
> > > > > > > > > > > > > > > > >>>>> From: Khaled Elmeleegy <kd...@hotmail.com>
> > > > > > > > > > > > > > > > >>>>> To: "user@hbase.apache.org" <
> > > > > user@hbase.apache.org
> > > > > > > >
> > > > > > > > > > > > > > > > >>>>> Cc:
> > > > > > > > > > > > > > > > >>>>> Sent: Thursday, October 2, 2014 12:12 AM
> > > > > > > > > > > > > > > > >>>>> Subject: HBase read performance
> > > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > > >>>>> Hi,
> > > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > > >>>>> I am trying to do a scatter/gather on hbase
> > > > > > > (0.98.6.1),
> > > > > > > > > > > > where I
> > > > > > > > > > > > > > > have
> > > > > > > > > > > > > > > > a
> > > > > > > > > > > > > > > > >>>> client reading ~1000 keys from an HBase
> > > table.
> > > > > These
> > > > > > > > > keys
> > > > > > > > > > > > happen to
> > > > > > > > > > > > > > > > fall on
> > > > > > > > > > > > > > > > >>>> the same region server. For my reads I use
> > > > > reverse
> > > > > > > scan
> > > > > > > > > to
> > > > > > > > > > > > read each
> > > > > > > > > > > > > > > > key as
> > > > > > > > > > > > > > > > >>>> I want the key prior to a specific time
> > > stamp
> > > > > (time
> > > > > > > > > stamps
> > > > > > > > > > > are
> > > > > > > > > > > > > > > stored
> > > > > > > > > > > > > > > > in
> > > > > > > > > > > > > > > > >>>> reverse order). I don't believe gets can
> > > > > accomplish
> > > > > > > > > that,
> > > > > > > > > > > > right? so
> > > > > > > > > > > > > > > I
> > > > > > > > > > > > > > > > use
> > > > > > > > > > > > > > > > >>>> scan, with caching set to 1.
> > > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > > >>>>> I use 2000 reader threads in the client
> > > and on
> > > > > > > HBase,
> > > > > > > > > I've
> > > > > > > > > > > > set
> > > > > > > > > > > > > > > > >>>> hbase.regionserver.handler.count to 1500.
> > > With
> > > > > this
> > > > > > > > > setup,
> > > > > > > > > > > my
> > > > > > > > > > > > > > > scatter
> > > > > > > > > > > > > > > > >>>> gather is very slow and can take up to 10s
> > > in
> > > > > total.
> > > > > > > > > Timing
> > > > > > > > > > > an
> > > > > > > > > > > > > > > > individual
> > > > > > > > > > > > > > > > >>>> getScanner(..) call on the client side, it
> > > can
> > > > > > > easily
> > > > > > > > > take
> > > > > > > > > > > few
> > > > > > > > > > > > > > > > hundreds of
> > > > > > > > > > > > > > > > >>>> ms. I also got the following metrics from
> > > the
> > > > > region
> > > > > > > > > server
> > > > > > > > > > > in
> > > > > > > > > > > > > > > > question:
> > > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > > >>>>> "queueCallTime_mean" : 2.190855525775637,
> > > > > > > > > > > > > > > > >>>>> "queueCallTime_median" : 0.0,
> > > > > > > > > > > > > > > > >>>>> "queueCallTime_75th_percentile" : 0.0,
> > > > > > > > > > > > > > > > >>>>> "queueCallTime_95th_percentile" : 1.0,
> > > > > > > > > > > > > > > > >>>>> "queueCallTime_99th_percentile" :
> > > > > > > 556.9799999999818,
> > > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > > >>>>> "processCallTime_min" : 0,
> > > > > > > > > > > > > > > > >>>>> "processCallTime_max" : 12755,
> > > > > > > > > > > > > > > > >>>>> "processCallTime_mean" :
> > > 105.64873440912682,
> > > > > > > > > > > > > > > > >>>>> "processCallTime_median" : 0.0,
> > > > > > > > > > > > > > > > >>>>> "processCallTime_75th_percentile" : 2.0,
> > > > > > > > > > > > > > > > >>>>> "processCallTime_95th_percentile" :
> > > 7917.95,
> > > > > > > > > > > > > > > > >>>>> "processCallTime_99th_percentile" :
> > > 8876.89,
> > > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > > >>>>
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > >
> > > "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_min"
> > > > > > > > > > > > > > > > >>>> : 89,
> > > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > > >>>>
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > >
> > > "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_max"
> > > > > > > > > > > > > > > > >>>> : 11300,
> > > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > > >>>>
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > >
> > > "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_mean"
> > > > > > > > > > > > > > > > >>>> : 654.4949739797315,
> > > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > > >>>>
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > >
> > > "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_median"
> > > > > > > > > > > > > > > > >>>> : 101.0,
> > > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > > >>>>
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > >
> > > "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_75th_percentile"
> > > > > > > > > > > > > > > > >>>> : 101.0,
> > > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > > >>>>
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > >
> > > "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_95th_percentile"
> > > > > > > > > > > > > > > > >>>> : 101.0,
> > > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > > >>>>
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > >
> > > "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_99th_percentile"
> > > > > > > > > > > > > > > > >>>> : 113.0,
> > > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > > >>>>> Where "delta" is the name of the table I am
> > > > > > > querying.
> > > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > > >>>>> In addition to all this, i monitored the
> > > > > hardware
> > > > > > > > > resources
> > > > > > > > > > > > (CPU,
> > > > > > > > > > > > > > > > disk,
> > > > > > > > > > > > > > > > >>>> and network) of both the client and the
> > > region
> > > > > > > server
> > > > > > > > > and
> > > > > > > > > > > > nothing
> > > > > > > > > > > > > > > > seems
> > > > > > > > > > > > > > > > >>>> anywhere near saturation. So I am puzzled by
> > > > > what's
> > > > > > > > > going on
> > > > > > > > > > > > and
> > > > > > > > > > > > > > > > where this
> > > > > > > > > > > > > > > > >>>> time is going.
> > > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > > >>>>> Few things to note based on the above
> > > > > measurements:
> > > > > > > > > both
> > > > > > > > > > > > medians of
> > > > > > > > > > > > > > > > IPC
> > > > > > > > > > > > > > > > >>>> processCallTime and queueCallTime are
> > > basically
> > > > > zero
> > > > > > > > > (ms I
> > > > > > > > > > > > presume,
> > > > > > > > > > > > > > > > >>>> right?). However, scanNext_median is 101 (ms
> > > > > too,
> > > > > > > > > right?). I
> > > > > > > > > > > > am not
> > > > > > > > > > > > > > > > sure
> > > > > > > > > > > > > > > > >>>> how this adds up. Also, even though the 101
> > > > > figure
> > > > > > > seems
> > > > > > > > > > > > > > > outrageously
> > > > > > > > > > > > > > > > high
> > > > > > > > > > > > > > > > >>>> and I don't know why, still all these scans
> > > > > should
> > > > > > > be
> > > > > > > > > > > > happening in
> > > > > > > > > > > > > > > > >>>> parallel, so the overall call should finish
> > > > > fast,
> > > > > > > given
> > > > > > > > > that
> > > > > > > > > > > > no
> > > > > > > > > > > > > > > > hardware
> > > > > > > > > > > > > > > > >>>> resource is contended, right? but this is
> > > not
> > > > > what's
> > > > > > > > > > > > happening, so I
> > > > > > > > > > > > > > > > have
> > > > > > > > > > > > > > > > >>>> to be missing something(s).
> > > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > > >>>>> So, any help is appreciated there.
> > > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > > >>>>> Thanks,
> > > > > > > > > > > > > > > > >>>>> Khaled
> > > > > > > > > > > > > > > > >>>>
> > > > > > > > > > > > > > > > >>>>
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > Best regards,
> > > > > > > >
> > > > > > > >    - Andy
> > > > > > > >
> > > > > > > > Problems worthy of attack prove their worth by hitting back. -
> > > Piet
> > > > > Hein
> > > > > > > > (via Tom White)
> > > > > > >
> > > > > > >
> > > > >
> > > > >
> > >
> > >

Re: HBase read performance

Posted by lars hofhansl <la...@apache.org>.

This looks like something to track.
If there's an easy way to reproduce this, could you file a jira with instructions?
(Although from the conversations here, it looks like it's not trivial to reproduce).

-- Lars



________________________________
 From: Khaled Elmeleegy <kd...@hotmail.com>
To: "user@hbase.apache.org" <us...@hbase.apache.org> 
Sent: Monday, October 13, 2014 12:13 PM
Subject: RE: HBase read performance
 

I went through a lengthy testing process with Ted. I tried both disabling just cache on write and both COW and BC. Disabling them makes the problem go away. I think disabling COW in my workload effectively doesn't exercise  the block cache as my workload is mostly writes, so disabling COW doesn't cause the cache blocks to churn.
I tried both overcommitting memory and not. This is irrelevant though. The cache operates on the virtual memory offered to it and it's agnostic to the physical memory size. Memory over commitment may only hurt performance due to swapping by the OS, but this has nothing to do with the bug.
Khaled

> Date: Sun, 12 Oct 2014 23:17:24 -0700
> Subject: Re: HBase read performance
> From: stack@duboce.net
> To: user@hbase.apache.org
> 
> On Sun, Oct 12, 2014 at 8:58 PM, Khaled Elmeleegy <kd...@hotmail.com> wrote:
> 
> > It goes away, but i think that's because i am basically churning the cache
> > far less. My workload is mostly writes.
> >
> >
> What exactly did you change?
> 
> Did you disable both BC and cacheonwrite or just one of these configs?
> 
> Disabling BC and/or cacheonwrite would churn your cache more?
> 
> What about the heap config?  Did you over commit mem?  If so, did you try
> doing a direct memory sizing that was w/i the bounds of physical memory
> (well within)?  Did you still see exceptions?
> 
> St.Ack
> 
> 
> 
> 
> > > Date: Sun, 12 Oct 2014 19:43:16 -0700
> > > Subject: Re: HBase read performance
> > > From: yuzhihong@gmail.com
> > > To: user@hbase.apache.org
> > >
> > > Hi,
> > > Can you turn off cacheonwrite and keep bucket cache to see if the problem
> > > goes away ?
> > >
> > > Cheers
> > >
> > > On Fri, Oct 10, 2014 at 10:59 AM, Khaled Elmeleegy <kd...@hotmail.com>
> > > wrote:
> > >
> > > > Yes, I can reproduce it with some work.
> > > > The workload is basically as follows:
> > > > There are writers streaming writes to a table. Then, there is a reader
> > > > (invoked via a web interface). The reader does a 1000 parallel reverse
> > > > scans, all end up hitting the same region in my case. The scans are
> > > > effectively "gets" as I just need to get one record off of each of
> > them. I
> > > > just need to do a "reverse" get, which is not supported (would be
> > great to
> > > > have :)), so I do it via reverse scan. After few tries, the reader
> > > > consistently hits this bug.
> > > >
> > > > This happens with these config changes:
> > > > hbase-env:HBASE_REGIONSERVER_OPTS=-Xmx6G -XX:MaxDirectMemorySize=5G
> > > > -XX:CMSInitiatingOccupancyFraction=88 -XX:+AggressiveOpts -verbose:gc
> > > > -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Xlog
> > > > gc:/tmp/hbase-regionserver-gc.loghbase-site:
> > > > hbase.bucketcache.ioengine=offheap
> > > > hbase.bucketcache.size=4196
> > > > hbase.rs.cacheblocksonwrite=true
> > > > hfile.block.index.cacheonwrite=true
> > > > hfile.block.bloom.cacheonwrite=true
> > > >
> > > > Interestingly, without these config changes, I can't reproduce the
> > problem.
> > > > Khaled
> > > >
> > > >
> > > > > Date: Fri, 10 Oct 2014 10:05:14 -0700
> > > > > Subject: Re: HBase read performance
> > > > > From: stack@duboce.net
> > > > > To: user@hbase.apache.org
> > > > >
> > > > > It looks like we are messing up our positioning math:
> > > > >
> > > > > 233 <
> > > >
> > http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/nio/Buffer.java#233
> > > > >
> > > > >
> > > > > <
> > > >
> > http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/nio/Buffer.java#
> > > > >
> > > > >
> > > > >     public final Buffer
> > > > > <
> > > >
> > http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/nio/Buffer.java#Buffer
> > > > >
> > > > >  <
> > > >
> > http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/nio/Buffer.java#
> > > > >position(int
> > > > > newPosition) {
> > > > >
> > > > > 234 <
> > > >
> > http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/nio/Buffer.java#234
> > > > >
> > > > >
> > > > > <
> > > >
> > http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/nio/Buffer.java#
> > > > >
> > > > >
> > > > >         if ((newPosition > limit
> > > > > <
> > > >
> > http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/nio/Buffer.java#Buffer.0limit
> > > > >)
> > > > > || (newPosition < 0))
> > > > >
> > > > > 235 <
> > > >
> > http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/nio/Buffer.java#235
> > > > >
> > > > >
> > > > > <
> > > >
> > http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/nio/Buffer.java#
> > > > >
> > > > >
> > > > >             throw new IllegalArgumentException
> > > > > <
> > > >
> > http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/lang/IllegalArgumentException.java#IllegalArgumentException
> > > > >();
> > > > >
> > > > >
> > > > > Is it easy to reproduce Khaled?  Always same region/store or spread
> > > > > across all reads?
> > > > >
> > > > >
> > > > > St.Ack
> > > > >
> > > > >
> > > > >
> > > > > On Fri, Oct 10, 2014 at 8:31 AM, Khaled Elmeleegy <kdiaa@hotmail.com
> > >
> > > > wrote:
> > > > >
> > > > > > Andrew thanks. I think this indeed was the problem.
> > > > > > To get over it, I increased the amount of memory given the region
> > > > server
> > > > > > to avoid IO on reads. I used the configs below. In my experiments,
> > I
> > > > have
> > > > > > writers streaming their output to HBase. The reader powers a web
> > page
> > > > and
> > > > > > does this scatter/gather, where it reads 1000 keys written last and
> > > > passes
> > > > > > them the the front end. With this workload, I get the exception
> > below
> > > > at
> > > > > > the region server. Again, I am using HBAse (0.98.6.1). Any help is
> > > > > > appreciated.
> > > > > > hbase-env:HBASE_REGIONSERVER_OPTS=-Xmx6G -XX:MaxDirectMemorySize=5G
> > > > > > -XX:CMSInitiatingOccupancyFraction=88 -XX:+AggressiveOpts
> > -verbose:gc
> > > > > > -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Xlog
> > > > > > gc:/tmp/hbase-regionserver-gc.log
> > > > > > hbase-site:hbase.bucketcache.ioengine=offheap
> > > > > >
> > > > > >                                    hfile.block.cache.size=0.4
> > > > > >
> > > > > >
> > > > > >  hbase.bucketcache.size=4196
> > > > > >
> > > > > >                         hbase.rs.cacheblocksonwrite=true
> > > > > >
> > > > > >
> > > > > >  hfile.block.index.cacheonwrite=true
> > > > > >
> > > > > >                         hfile.block.bloom.cacheonwrite=true
> > > > > >
> > > > > > 2014-10-10 15:06:44,173 ERROR
> > > > > > [B.DefaultRpcServer.handler=62,queue=2,port=60020] ipc.RpcServer:
> > > > > > Unexpected throwable object
> > > > > > java.lang.IllegalArgumentException
> > > > > >         at java.nio.Buffer.position(Buffer.java:236)
> > > > > >         at
> > > > > >
> > > >
> > org.apache.hadoop.hbase.util.ByteBufferUtils.skip(ByteBufferUtils.java:434)
> > > > > >         at
> > > > > >
> > > >
> > org.apache.hadoop.hbase.io.hfile.HFileReaderV2$ScannerV2.readKeyValueLen(HFileReaderV2.java:849)
> > > > > >         at
> > > > > >
> > > >
> > org.apache.hadoop.hbase.io.hfile.HFileReaderV2$ScannerV2.next(HFileReaderV2.java:760)
> > > > > >         at
> > > > > >
> > > >
> > org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekAtOrAfter(StoreFileScanner.java:248)
> > > > > >         at
> > > > > >
> > > >
> > org.apache.hadoop.hbase.regionserver.StoreFileScanner.seek(StoreFileScanner.java:152)
> > > > > >         at
> > > > > >
> > > >
> > org.apache.hadoop.hbase.regionserver.StoreScanner.seekScanners(StoreScanner.java:317)
> > > > > >         at
> > > > > >
> > > >
> > org.apache.hadoop.hbase.regionserver.StoreScanner.<init>(StoreScanner.java:176)
> > > > > >         at
> > > > > >
> > > >
> > org.apache.hadoop.hbase.regionserver.HStore.getScanner(HStore.java:1780)
> > > > > >         at
> > > > > >
> > > >
> > org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.<init>(HRegion.java:3758)
> > > > > >         at
> > > > > >
> > > >
> > org.apache.hadoop.hbase.regionserver.HRegion.instantiateRegionScanner(HRegion.java:1950)
> > > > > >         at
> > > > > >
> > > >
> > org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:1936)
> > > > > >         at
> > > > > >
> > > >
> > org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:1913)
> > > > > >         at
> > > > > >
> > > >
> > org.apache.hadoop.hbase.regionserver.HRegionServer.scan(HRegionServer.java:3157)
> > > > > >         at
> > > > > >
> > > >
> > org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:29587)
> > > > > >         at
> > > > org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2027)
> > > > > >         at
> > > > org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108)
> > > > > >         at
> > > > > >
> > > >
> > org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:114)
> > > > > >         at
> > > > > > org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:94)
> > > > > >         at java.lang.Thread.run(Thread.java:744)
> > > > > > > From: apurtell@apache.org
> > > > > > > Date: Tue, 7 Oct 2014 17:09:35 -0700
> > > > > > > Subject: Re: HBase read performance
> > > > > > > To: user@hbase.apache.org
> > > > > > >
> > > > > > > > The cluster has 2 m1.large nodes.
> > > > > > >
> > > > > > > That's the problem right there.
> > > > > > >
> > > > > > > You need to look at c3.4xlarge or i2 instances as a minimum
> > > > requirement.
> > > > > > M1
> > > > > > > and even M3 instance types have ridiculously poor IO.
> > > > > > >
> > > > > > >
> > > > > > > On Tue, Oct 7, 2014 at 3:01 PM, Khaled Elmeleegy <
> > kdiaa@hotmail.com>
> > > > > > wrote:
> > > > > > >
> > > > > > > > Thanks Nicolas, Qiang.
> > > > > > > >
> > > > > > > > I was able to write a simple program that reproduces the
> > problem
> > > > on a
> > > > > > tiny
> > > > > > > > HBase cluster on ec2. The cluster has 2 m1.large nodes. One
> > node
> > > > runs
> > > > > > the
> > > > > > > > master, name node and zookeeper. The other node runs a data
> > node
> > > > and a
> > > > > > > > region server, with heap size configured to be 6GB. There, the
> > 1000
> > > > > > > > parallel reverse gets (reverse scans) take 7-8 seconds. The
> > data
> > > > set is
> > > > > > > > tiny (10M records, each having a small number of bytes). As I
> > said
> > > > > > before,
> > > > > > > > all hardware resources are very idle there.
> > > > > > > >
> > > > > > > > Interestingly, running the same workload on my macbook, the
> > 1000
> > > > > > parallel
> > > > > > > > gets take ~200ms on a pseudo-distributed installation.
> > > > > > > >
> > > > > > > > Any help to resolve this mystery is highly appreciated.
> > > > > > > >
> > > > > > > > P.S. please find my test program attached.
> > > > > > > >
> > > > > > > > Best,
> > > > > > > > Khaled
> > > > > > > >
> > > > > > > > > From: nkeywal@gmail.com
> > > > > > > > > Date: Mon, 6 Oct 2014 09:40:48 +0200
> > > > > > > > > Subject: Re: HBase read performance
> > > > > > > > > To: user@hbase.apache.org
> > > > > > > > >
> > > > > > > > > Hi,
> > > > > > > > >
> > > > > > > > > I haven't seen it mentioned, but if I understand correctly
> > each
> > > > scan
> > > > > > > > > returns a single row? If so you should use Scan#setSmall to
> > save
> > > > > > some rpc
> > > > > > > > > calls.
> > > > > > > > >
> > > > > > > > > Cheers,
> > > > > > > > >
> > > > > > > > > Nicolas
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Sun, Oct 5, 2014 at 11:28 AM, Qiang Tian <
> > tianq01@gmail.com>
> > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > when using separate HConnection instance, both its
> > > > > > > > > > RpcClient instance(maintain connection to a regionserver)
> > and
> > > > > > Registry
> > > > > > > > > > instance(maintain connection to zookeeper) will be
> > separate..
> > > > > > > > > >
> > > > > > > > > > see
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> > http://shammijayasinghe.blogspot.com/2012/02/zookeeper-increase-maximum-number-of.html
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Sun, Oct 5, 2014 at 2:24 PM, Khaled Elmeleegy <
> > > > > > kdiaa@hotmail.com>
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > I tried creating my own HConnections pool to use for my
> > HBase
> > > > > > calls,
> > > > > > > > so
> > > > > > > > > > > that not all the (2K) threads share the same HConnection.
> > > > > > However, I
> > > > > > > > > > could
> > > > > > > > > > > only have 10 HConnections. Beyond that I get ZK
> > exceptions,
> > > > > > please
> > > > > > > > find
> > > > > > > > > > it
> > > > > > > > > > > below. Also, with 10 HConnections, I don't see noticeable
> > > > > > > > improvement in
> > > > > > > > > > > performance so far.
> > > > > > > > > > > 2014-10-05 06:11:26,490 WARN [main]
> > > > > > zookeeper.RecoverableZooKeeper
> > > > > > > > > > > (RecoverableZooKeeper.java:retryOrThrow(253)) - Possibly
> > > > > > transient
> > > > > > > > > > > ZooKeeper, quorum=54.68.206.252:2181,
> > > > > > > > > > >
> > > > > > > >
> > > > exception=org.apache.zookeeper.KeeperException$ConnectionLossException:
> > > > > > > > > > > KeeperErrorCode = ConnectionLoss for
> > /hbase/hbaseid2014-10-05
> > > > > > > > > > 06:11:26,490
> > > > > > > > > > > INFO [main] util.RetryCounter
> > > > > > > > > > (RetryCounter.java:sleepUntilNextRetry(155))
> > > > > > > > > > > - Sleeping 1000ms before retry #0...2014-10-05
> > 06:11:27,845
> > > > WARN
> > > > > > > > [main]
> > > > > > > > > > > zookeeper.RecoverableZooKeeper
> > > > > > > > > > > (RecoverableZooKeeper.java:retryOrThrow(253)) - Possibly
> > > > > > transient
> > > > > > > > > > > ZooKeeper, quorum=54.68.206.252:2181,
> > > > > > > > > > >
> > > > > > > >
> > > > exception=org.apache.zookeeper.KeeperException$ConnectionLossException:
> > > > > > > > > > > KeeperErrorCode = ConnectionLoss for
> > /hbase/hbaseid2014-10-05
> > > > > > > > > > 06:11:27,849
> > > > > > > > > > > INFO [main] util.RetryCounter
> > > > > > > > > > (RetryCounter.java:sleepUntilNextRetry(155))
> > > > > > > > > > > - Sleeping 2000ms before retry #1...2014-10-05
> > 06:11:30,405
> > > > WARN
> > > > > > > > [main]
> > > > > > > > > > > zookeeper.RecoverableZooKeeper
> > > > > > > > > > > (RecoverableZooKeeper.java:retryOrThrow(253)) - Possibly
> > > > > > transient
> > > > > > > > > > > ZooKeeper, quorum=54.68.206.252:2181,
> > > > > > > > > > >
> > > > > > > >
> > > > exception=org.apache.zookeeper.KeeperException$ConnectionLossException:
> > > > > > > > > > > KeeperErrorCode = ConnectionLoss for
> > /hbase/hbaseid2014-10-05
> > > > > > > > > > 06:11:30,405
> > > > > > > > > > > INFO [main] util.RetryCounter
> > > > > > > > > > (RetryCounter.java:sleepUntilNextRetry(155))
> > > > > > > > > > > - Sleeping 4000ms before retry #2...2014-10-05
> > 06:11:35,278
> > > > WARN
> > > > > > > > [main]
> > > > > > > > > > > zookeeper.RecoverableZooKeeper
> > > > > > > > > > > (RecoverableZooKeeper.java:retryOrThrow(253)) - Possibly
> > > > > > transient
> > > > > > > > > > > ZooKeeper, quorum=54.68.206.252:2181,
> > > > > > > > > > >
> > > > > > > >
> > > > exception=org.apache.zookeeper.KeeperException$ConnectionLossException:
> > > > > > > > > > > KeeperErrorCode = ConnectionLoss for
> > /hbase/hbaseid2014-10-05
> > > > > > > > > > 06:11:35,279
> > > > > > > > > > > INFO [main] util.RetryCounter
> > > > > > > > > > (RetryCounter.java:sleepUntilNextRetry(155))
> > > > > > > > > > > - Sleeping 8000ms before retry #3...2014-10-05
> > 06:11:44,393
> > > > WARN
> > > > > > > > [main]
> > > > > > > > > > > zookeeper.RecoverableZooKeeper
> > > > > > > > > > > (RecoverableZooKeeper.java:retryOrThrow(253)) - Possibly
> > > > > > transient
> > > > > > > > > > > ZooKeeper, quorum=54.68.206.252:2181,
> > > > > > > > > > >
> > > > > > > >
> > > > exception=org.apache.zookeeper.KeeperException$ConnectionLossException:
> > > > > > > > > > > KeeperErrorCode = ConnectionLoss for
> > /hbase/hbaseid2014-10-05
> > > > > > > > > > 06:11:44,393
> > > > > > > > > > > ERROR [main] zookeeper.RecoverableZooKeeper
> > > > > > > > > > > (RecoverableZooKeeper.java:retryOrThrow(255)) - ZooKeeper
> > > > exists
> > > > > > > > failed
> > > > > > > > > > > after 4 attempts2014-10-05 06:11:44,394 WARN [main]
> > > > > > zookeeper.ZKUtil
> > > > > > > > > > > (ZKUtil.java:checkExists(482)) - hconnection-0x4e174f3b,
> > > > quorum=
> > > > > > > > > > > 54.68.206.252:2181, baseZNode=/hbase Unable to set
> > watcher
> > > > on
> > > > > > znode
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> > (/hbase/hbaseid)org.apache.zookeeper.KeeperException$ConnectionLossException:
> > > > > > > > > > > KeeperErrorCode = ConnectionLoss for /hbase/hbaseid at
> > > > > > > > > > >
> > > > > >
> > org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
> > > > > > > > at
> > > > > > > > > > >
> > > > > >
> > org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
> > > > > > > > at
> > > > > > > > > > >
> > org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1041) at
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.exists(RecoverableZooKeeper.java:199)
> > > > > > > > > > > at
> > > > > > > >
> > > > org.apache.hadoop.hbase.zookeeper.ZKUtil.checkExists(ZKUtil.java:479)
> > > > > > > > > > > at
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> > org.apache.hadoop.hbase.zookeeper.ZKClusterId.readClusterIdZNode(ZKClusterId.java:65)
> > > > > > > > > > > at
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> > org.apache.hadoop.hbase.client.ZooKeeperRegistry.getClusterId(ZooKeeperRegistry.java:83)
> > > > > > > > > > > at
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> > org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.retrieveClusterId(HConnectionManager.java:857)
> > > > > > > > > > > at
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> > org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.<init>(HConnectionManager.java:662)
> > > > > > > > > > > at
> > > > sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> > > > > > > > > > > Method) at
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
> > > > > > > > > > > at
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> > > > > > > > > > > at
> > > > > > java.lang.reflect.Constructor.newInstance(Constructor.java:526)
> > > > > > > > > > at
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> > org.apache.hadoop.hbase.client.HConnectionManager.createConnection(HConnectionManager.java:414)
> > > > > > > > > > > at
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> > org.apache.hadoop.hbase.client.HConnectionManager.createConnection(HConnectionManager.java:335)...
> > > > > > > > > > > > From: kdiaa@hotmail.com
> > > > > > > > > > > > To: user@hbase.apache.org
> > > > > > > > > > > > Subject: RE: HBase read performance
> > > > > > > > > > > > Date: Fri, 3 Oct 2014 12:37:39 -0700
> > > > > > > > > > > >
> > > > > > > > > > > > Lars, Ted, and Qiang,
> > > > > > > > > > > > Thanks for all the input.
> > > > > > > > > > > > Qiang: yes all the threads are in the same client
> > process
> > > > > > sharing
> > > > > > > > the
> > > > > > > > > > > same connection. And since I don't see hardware
> > contention,
> > > > may
> > > > > > be
> > > > > > > > there
> > > > > > > > > > is
> > > > > > > > > > > contention over this code path. I'll try using many
> > > > connections
> > > > > > and
> > > > > > > > see
> > > > > > > > > > if
> > > > > > > > > > > it alleviates the problems and I'll report back.
> > > > > > > > > > > > Thanks again,Khaled
> > > > > > > > > > > >
> > > > > > > > > > > > > Date: Fri, 3 Oct 2014 15:18:30 +0800
> > > > > > > > > > > > > Subject: Re: HBase read performance
> > > > > > > > > > > > > From: tianq01@gmail.com
> > > > > > > > > > > > > To: user@hbase.apache.org
> > > > > > > > > > > > >
> > > > > > > > > > > > > Regarding to profiling, Andrew introduced
> > > > > > > > > > > > >
> > > > > > > >
> > http://www.brendangregg.com/blog/2014-06-12/java-flame-graphs.html
> > > > > > > > > > > months
> > > > > > > > > > > > > ago.
> > > > > > > > > > > > >
> > > > > > > > > > > > > processCallTime comes from RpcServer#call, so it
> > looks
> > > > good?
> > > > > > > > > > > > >
> > > > > > > > > > > > > I have a suspect:
> > > > > > > > https://issues.apache.org/jira/browse/HBASE-11306
> > > > > > > > > > > > >
> > > > > > > > > > > > > how many processes do you have for your 2000 threads?
> > > > > > > > > > > > > if olny 1 process, those threads will share just 1
> > > > > > connection to
> > > > > > > > that
> > > > > > > > > > > > > regionserver, there might be big contention on the
> > RPC
> > > > code
> > > > > > path.
> > > > > > > > > > > ---for
> > > > > > > > > > > > > such case, could you try using different connections?
> > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> > https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HConnectionManager.html
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Fri, Oct 3, 2014 at 9:55 AM, Ted Yu <
> > > > yuzhihong@gmail.com>
> > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Khaled:
> > > > > > > > > > > > > > Do you have profiler such as jprofiler ?
> > > > > > > > > > > > > > Profiling would give us more hint.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Otherwise capturing stack trace during the period
> > of
> > > > > > reverse
> > > > > > > > scan
> > > > > > > > > > > would
> > > > > > > > > > > > > > help.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Cheers
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Thu, Oct 2, 2014 at 4:52 PM, lars hofhansl <
> > > > > > > > larsh@apache.org>
> > > > > > > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > You might have the data in the OS buffer cache,
> > > > without
> > > > > > short
> > > > > > > > > > > circuit
> > > > > > > > > > > > > > > reading the region server has to request the
> > block
> > > > from
> > > > > > the
> > > > > > > > data
> > > > > > > > > > > node
> > > > > > > > > > > > > > > process, which then reads it from the block
> > cache.
> > > > > > > > > > > > > > > That is a few context switches per RPC that do
> > not
> > > > show
> > > > > > up
> > > > > > > > in CPU
> > > > > > > > > > > > > > metrics.
> > > > > > > > > > > > > > > In that you also would not see disk IO.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > If - as you say - you see a lot of evicted
> > blocks the
> > > > > > data
> > > > > > > > *has*
> > > > > > > > > > > to come
> > > > > > > > > > > > > > > from the OS. If you do not see disk IO is *has*
> > to
> > > > come
> > > > > > from
> > > > > > > > the
> > > > > > > > > > OS
> > > > > > > > > > > > > > cache.
> > > > > > > > > > > > > > > I.e. there's more RAM on your boxes, and you
> > should
> > > > > > increase
> > > > > > > > the
> > > > > > > > > > > heap
> > > > > > > > > > > > > > block
> > > > > > > > > > > > > > > cache.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > You can measure the context switches with vmstat.
> > > > Other
> > > > > > than
> > > > > > > > that
> > > > > > > > > > > I have
> > > > > > > > > > > > > > > no suggestion until I reproduce the problem.
> > > > > > > > > > > > > > > Also check the data locality index of the region
> > > > server
> > > > > > it
> > > > > > > > should
> > > > > > > > > > > be
> > > > > > > > > > > > > > close
> > > > > > > > > > > > > > > to 100%.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > -- Lars
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > ________________________________
> > > > > > > > > > > > > > > From: Khaled Elmeleegy <kd...@hotmail.com>
> > > > > > > > > > > > > > > To: "user@hbase.apache.org" <
> > user@hbase.apache.org>
> > > > > > > > > > > > > > > Sent: Thursday, October 2, 2014 3:24 PM
> > > > > > > > > > > > > > > Subject: RE: HBase read performance
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Lars thanks a lot about all the tips. I'll make
> > sure
> > > > I
> > > > > > cover
> > > > > > > > all
> > > > > > > > > > > of them
> > > > > > > > > > > > > > > and get back to you. I am not sure they are the
> > > > > > bottleneck
> > > > > > > > though
> > > > > > > > > > > as they
> > > > > > > > > > > > > > > all are about optimizing physical resource
> > usage. As
> > > > I
> > > > > > said,
> > > > > > > > I
> > > > > > > > > > > don't see
> > > > > > > > > > > > > > > any contended physical resources now. I'll also
> > try
> > > > to
> > > > > > > > reproduce
> > > > > > > > > > > this
> > > > > > > > > > > > > > > problem in a simpler environment and pass to you
> > the
> > > > test
> > > > > > > > program
> > > > > > > > > > > to play
> > > > > > > > > > > > > > > with.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Couple of high level points to make. You are
> > right
> > > > that
> > > > > > my
> > > > > > > > use
> > > > > > > > > > > case is
> > > > > > > > > > > > > > > kind of a worst case for HBase reads. But, if
> > things
> > > > go
> > > > > > the
> > > > > > > > way
> > > > > > > > > > you
> > > > > > > > > > > > > > > described them, there should be tons of disk IO
> > and
> > > > that
> > > > > > > > should
> > > > > > > > > > be
> > > > > > > > > > > > > > clearly
> > > > > > > > > > > > > > > the bottleneck. This is not the case though.
> > That's
> > > > for
> > > > > > the
> > > > > > > > > > simple
> > > > > > > > > > > reason
> > > > > > > > > > > > > > > that this is done in a test environment (I am
> > still
> > > > > > > > prototyping),
> > > > > > > > > > > and
> > > > > > > > > > > > > > not a
> > > > > > > > > > > > > > > lot of data is yet written to HBase. However for
> > the
> > > > > > real use
> > > > > > > > > > > case, there
> > > > > > > > > > > > > > > should writers constantly writing data to HBase
> > and
> > > > > > readers
> > > > > > > > > > > occasionally
> > > > > > > > > > > > > > > doing this scatter/gather. At steady state,
> > things
> > > > should
> > > > > > > > only
> > > > > > > > > > get
> > > > > > > > > > > worse
> > > > > > > > > > > > > > > and all the issues you mentioned should get far
> > more
> > > > > > > > pronounced.
> > > > > > > > > > > At this
> > > > > > > > > > > > > > > point, one can try to mitigate it using more
> > memory
> > > > or
> > > > > > so. I
> > > > > > > > am
> > > > > > > > > > > not there
> > > > > > > > > > > > > > > yet as I think I am hitting some software
> > bottleneck,
> > > > > > which I
> > > > > > > > > > > don't know
> > > > > > > > > > > > > > > how to work around.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Khaled
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > ----------------------------------------
> > > > > > > > > > > > > > > > Date: Thu, 2 Oct 2014 14:20:47 -0700
> > > > > > > > > > > > > > > > From: larsh@apache.org
> > > > > > > > > > > > > > > > Subject: Re: HBase read performance
> > > > > > > > > > > > > > > > To: user@hbase.apache.org
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > OK... We might need to investigate this.
> > > > > > > > > > > > > > > > Any chance that you can provide a minimal test
> > > > program
> > > > > > and
> > > > > > > > > > > instruction
> > > > > > > > > > > > > > > about how to set it up.
> > > > > > > > > > > > > > > > We can do some profiling then.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > One thing to note is that with scanning HBase
> > > > cannot
> > > > > > use
> > > > > > > > bloom
> > > > > > > > > > > filters
> > > > > > > > > > > > > > > to rule out HFiles ahead of time, it needs to
> > look
> > > > into
> > > > > > all
> > > > > > > > of
> > > > > > > > > > > them.
> > > > > > > > > > > > > > > > So you kind of hit on the absolute worst case:
> > > > > > > > > > > > > > > > - random reads that do not fit into the block
> > cache
> > > > > > > > > > > > > > > > - cannot use bloom filters
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Few more question/comments:
> > > > > > > > > > > > > > > > - Do you have short circuit reading enabled? If
> > > > not,
> > > > > > you
> > > > > > > > > > should.
> > > > > > > > > > > > > > > > - Is your table major compacted? That will
> > reduce
> > > > the
> > > > > > > > number of
> > > > > > > > > > > files
> > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > look at.
> > > > > > > > > > > > > > > > - Did you disable Nagle's everywhere (enabled
> > > > > > tcpnodelay)?
> > > > > > > > It
> > > > > > > > > > > disabled
> > > > > > > > > > > > > > > by default in HBase, but necessarily in your
> > install
> > > > of
> > > > > > HDFS.
> > > > > > > > > > > > > > > > - Which version of HDFS are you using as
> > backing
> > > > > > > > filesystem?
> > > > > > > > > > > > > > > > - If your disk is idle, it means the data fits
> > into
> > > > > > the OS
> > > > > > > > > > buffer
> > > > > > > > > > > > > > cache.
> > > > > > > > > > > > > > > In turn that means that you increase the heap
> > for the
> > > > > > region
> > > > > > > > > > > servers. You
> > > > > > > > > > > > > > > can also use block encoding (FAST_DIFF) to try to
> > > > make
> > > > > > sure
> > > > > > > > the
> > > > > > > > > > > entire
> > > > > > > > > > > > > > > working set fits into the cache.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > - Also try to reduce the block size - although
> > if
> > > > your
> > > > > > > > overall
> > > > > > > > > > > working
> > > > > > > > > > > > > > > set does not fit in the heap it won't help much.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > This is a good section of the book to read
> > through
> > > > > > > > generally
> > > > > > > > > > > (even
> > > > > > > > > > > > > > > though you might know most of this already):
> > > > > > > > > > > > > > >
> > > > http://hbase.apache.org/book.html#perf.configurations
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > -- Lars
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > ----- Original Message -----
> > > > > > > > > > > > > > > > From: Khaled Elmeleegy <kd...@hotmail.com>
> > > > > > > > > > > > > > > > To: "user@hbase.apache.org" <
> > user@hbase.apache.org
> > > > >
> > > > > > > > > > > > > > > > Cc:
> > > > > > > > > > > > > > > > Sent: Thursday, October 2, 2014 11:27 AM
> > > > > > > > > > > > > > > > Subject: RE: HBase read performance
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > I do see a very brief spike in CPU
> > (user/system),
> > > > but
> > > > > > it's
> > > > > > > > no
> > > > > > > > > > > where
> > > > > > > > > > > > > > near
> > > > > > > > > > > > > > > 0% idle. It goes from 99% idle down to something
> > > > like 40%
> > > > > > > > idle
> > > > > > > > > > for
> > > > > > > > > > > a
> > > > > > > > > > > > > > second
> > > > > > > > > > > > > > > or so. The thing to note, this is all on a test
> > > > cluster,
> > > > > > so
> > > > > > > > no
> > > > > > > > > > > real load.
> > > > > > > > > > > > > > > Things are generally idle until i issue 2-3 of
> > these
> > > > > > > > > > > multi-scan-requests
> > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > render a web page. Then, you see the spike in the
> > > > cpu and
> > > > > > > > some
> > > > > > > > > > > activity
> > > > > > > > > > > > > > in
> > > > > > > > > > > > > > > the network and disk, but nowhere near
> > saturation.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > If there are specific tests you'd like me to
> > do to
> > > > > > debug
> > > > > > > > this,
> > > > > > > > > > > I'd be
> > > > > > > > > > > > > > > more than happy to do it.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Khaled
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > ----------------------------------------
> > > > > > > > > > > > > > > >> Date: Thu, 2 Oct 2014 11:15:59 -0700
> > > > > > > > > > > > > > > >> From: larsh@apache.org
> > > > > > > > > > > > > > > >> Subject: Re: HBase read performance
> > > > > > > > > > > > > > > >> To: user@hbase.apache.org
> > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > >> I still think you're waiting on disk. No
> > IOWAIT?
> > > > So
> > > > > > CPU
> > > > > > > > is not
> > > > > > > > > > > waiting
> > > > > > > > > > > > > > > a lot for IO. No high User/System CPU either?
> > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > >> If you see a lot of evicted block then each
> > RPC
> > > > has a
> > > > > > high
> > > > > > > > > > > chance of
> > > > > > > > > > > > > > > requiring to bring an entire 64k block in. You'll
> > > > see bad
> > > > > > > > > > > performance
> > > > > > > > > > > > > > with
> > > > > > > > > > > > > > > this.
> > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > >> We might need to trace this specific scenario.
> > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > >> -- Lars
> > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > >> ________________________________
> > > > > > > > > > > > > > > >> From: Khaled Elmeleegy <kd...@hotmail.com>
> > > > > > > > > > > > > > > >> To: "user@hbase.apache.org" <
> > > > user@hbase.apache.org>
> > > > > > > > > > > > > > > >> Sent: Thursday, October 2, 2014 10:46 AM
> > > > > > > > > > > > > > > >> Subject: RE: HBase read performance
> > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > >> I've set the heap size to 6GB and I do have gc
> > > > > > logging. No
> > > > > > > > > > long
> > > > > > > > > > > pauses
> > > > > > > > > > > > > > > there -- occasional 0.1s or 0.2s.
> > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > >> Other than the discrepancy between what's
> > > > reported on
> > > > > > the
> > > > > > > > > > > client and
> > > > > > > > > > > > > > > what's reported at the RS, there is also the
> > issue
> > > > of not
> > > > > > > > getting
> > > > > > > > > > > proper
> > > > > > > > > > > > > > > concurrency. So, even if a reverse get takes
> > 100ms
> > > > or so
> > > > > > > > (this
> > > > > > > > > > has
> > > > > > > > > > > to be
> > > > > > > > > > > > > > > mostly blocking on various things as no physical
> > > > > > resource is
> > > > > > > > > > > contended),
> > > > > > > > > > > > > > > then the other gets/scans should be able to
> > proceed
> > > > in
> > > > > > > > parallel,
> > > > > > > > > > > so a
> > > > > > > > > > > > > > > thousand concurrent gets/scans should finish in
> > few
> > > > > > hundreds
> > > > > > > > of
> > > > > > > > > > ms
> > > > > > > > > > > not
> > > > > > > > > > > > > > many
> > > > > > > > > > > > > > > seconds. That's why I thought I'd increase the
> > > > handlers
> > > > > > > > count to
> > > > > > > > > > > try to
> > > > > > > > > > > > > > get
> > > > > > > > > > > > > > > more concurrency, but it didn't help. So, there
> > must
> > > > be
> > > > > > > > something
> > > > > > > > > > > else.
> > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > >> Khaled
> > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > >> ----------------------------------------
> > > > > > > > > > > > > > > >>> From: ndimiduk@gmail.com
> > > > > > > > > > > > > > > >>> Date: Thu, 2 Oct 2014 10:36:39 -0700
> > > > > > > > > > > > > > > >>> Subject: Re: HBase read performance
> > > > > > > > > > > > > > > >>> To: user@hbase.apache.org
> > > > > > > > > > > > > > > >>>
> > > > > > > > > > > > > > > >>> Do check again on the heap size of the region
> > > > > > servers.
> > > > > > > > The
> > > > > > > > > > > default
> > > > > > > > > > > > > > > >>> unconfigured size is 1G; too small for much
> > of
> > > > > > anything.
> > > > > > > > > > Check
> > > > > > > > > > > your
> > > > > > > > > > > > > > RS
> > > > > > > > > > > > > > > logs
> > > > > > > > > > > > > > > >>> -- look for lines produced by the
> > JVMPauseMonitor
> > > > > > thread.
> > > > > > > > > > They
> > > > > > > > > > > > > > usually
> > > > > > > > > > > > > > > >>> correlate with long GC pauses or other
> > > > process-freeze
> > > > > > > > events.
> > > > > > > > > > > > > > > >>>
> > > > > > > > > > > > > > > >>> Get is implemented as a Scan of a single
> > row, so
> > > > a
> > > > > > > > reverse
> > > > > > > > > > > scan of a
> > > > > > > > > > > > > > > single
> > > > > > > > > > > > > > > >>> row should be functionally equivalent.
> > > > > > > > > > > > > > > >>>
> > > > > > > > > > > > > > > >>> In practice, I have seen discrepancy between
> > the
> > > > > > > > latencies
> > > > > > > > > > > reported
> > > > > > > > > > > > > > by
> > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > >>> RS and the latencies experienced by the
> > client.
> > > > I've
> > > > > > not
> > > > > > > > > > > investigated
> > > > > > > > > > > > > > > this
> > > > > > > > > > > > > > > >>> area thoroughly.
> > > > > > > > > > > > > > > >>>
> > > > > > > > > > > > > > > >>> On Thu, Oct 2, 2014 at 10:05 AM, Khaled
> > > > Elmeleegy <
> > > > > > > > > > > kdiaa@hotmail.com
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > >>>
> > > > > > > > > > > > > > > >>>> Thanks Lars for your quick reply.
> > > > > > > > > > > > > > > >>>>
> > > > > > > > > > > > > > > >>>> Yes performance is similar with less
> > handlers (I
> > > > > > tried
> > > > > > > > with
> > > > > > > > > > > 100
> > > > > > > > > > > > > > > first).
> > > > > > > > > > > > > > > >>>>
> > > > > > > > > > > > > > > >>>> The payload is not big ~1KB or so. The
> > working
> > > > set
> > > > > > > > doesn't
> > > > > > > > > > > seem to
> > > > > > > > > > > > > > > fit in
> > > > > > > > > > > > > > > >>>> memory as there are many cache misses.
> > However,
> > > > > > disk is
> > > > > > > > far
> > > > > > > > > > > from
> > > > > > > > > > > > > > > being a
> > > > > > > > > > > > > > > >>>> bottleneck. I checked using iostat. I also
> > > > verified
> > > > > > that
> > > > > > > > > > > neither the
> > > > > > > > > > > > > > > >>>> network nor the CPU of the region server or
> > the
> > > > > > client
> > > > > > > > are a
> > > > > > > > > > > > > > > bottleneck.
> > > > > > > > > > > > > > > >>>> This leads me to believe that likely this
> > is a
> > > > > > software
> > > > > > > > > > > bottleneck,
> > > > > > > > > > > > > > > >>>> possibly due to a misconfiguration on my
> > side. I
> > > > > > just
> > > > > > > > don't
> > > > > > > > > > > know how
> > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > >>>> debug it. A clear disconnect I see is the
> > > > individual
> > > > > > > > request
> > > > > > > > > > > latency
> > > > > > > > > > > > > > > as
> > > > > > > > > > > > > > > >>>> reported by metrics on the region server
> > (IPC
> > > > > > > > > > processCallTime
> > > > > > > > > > > vs
> > > > > > > > > > > > > > > scanNext)
> > > > > > > > > > > > > > > >>>> vs what's measured on the client. Does this
> > > > sound
> > > > > > > > right? Any
> > > > > > > > > > > ideas
> > > > > > > > > > > > > > on
> > > > > > > > > > > > > > > how
> > > > > > > > > > > > > > > >>>> to better debug it?
> > > > > > > > > > > > > > > >>>>
> > > > > > > > > > > > > > > >>>> About this trick with the timestamps to be
> > able
> > > > to
> > > > > > do a
> > > > > > > > > > > forward
> > > > > > > > > > > > > > scan,
> > > > > > > > > > > > > > > >>>> thanks for pointing it out. Actually, I am
> > > > aware of
> > > > > > it.
> > > > > > > > The
> > > > > > > > > > > problem
> > > > > > > > > > > > > > I
> > > > > > > > > > > > > > > have
> > > > > > > > > > > > > > > >>>> is, sometimes I want to get the key after a
> > > > > > particular
> > > > > > > > > > > timestamp and
> > > > > > > > > > > > > > > >>>> sometimes I want to get the key before, so
> > just
> > > > > > relying
> > > > > > > > on
> > > > > > > > > > > the key
> > > > > > > > > > > > > > > order
> > > > > > > > > > > > > > > >>>> doesn't work. Ideally, I want a reverse
> > get(). I
> > > > > > thought
> > > > > > > > > > > reverse
> > > > > > > > > > > > > > scan
> > > > > > > > > > > > > > > can
> > > > > > > > > > > > > > > >>>> do the trick though.
> > > > > > > > > > > > > > > >>>>
> > > > > > > > > > > > > > > >>>> Khaled
> > > > > > > > > > > > > > > >>>>
> > > > > > > > > > > > > > > >>>> ----------------------------------------
> > > > > > > > > > > > > > > >>>>> Date: Thu, 2 Oct 2014 09:40:37 -0700
> > > > > > > > > > > > > > > >>>>> From: larsh@apache.org
> > > > > > > > > > > > > > > >>>>> Subject: Re: HBase read performance
> > > > > > > > > > > > > > > >>>>> To: user@hbase.apache.org
> > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > >>>>> Hi Khaled,
> > > > > > > > > > > > > > > >>>>> is it the same with fewer threads? 1500
> > handler
> > > > > > threads
> > > > > > > > > > > seems to
> > > > > > > > > > > > > > be a
> > > > > > > > > > > > > > > >>>> lot. Typically a good number of threads
> > depends
> > > > on
> > > > > > the
> > > > > > > > > > > hardware
> > > > > > > > > > > > > > > (number of
> > > > > > > > > > > > > > > >>>> cores, number of spindles, etc). I cannot
> > think
> > > > of
> > > > > > any
> > > > > > > > type
> > > > > > > > > > of
> > > > > > > > > > > > > > > scenario
> > > > > > > > > > > > > > > >>>> where more than 100 would give any
> > improvement.
> > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > >>>>> How large is the payload per KV retrieved
> > that
> > > > > > way? If
> > > > > > > > > > large
> > > > > > > > > > > (as
> > > > > > > > > > > > > > in a
> > > > > > > > > > > > > > > >>>> few 100k) you definitely want to lower the
> > > > number
> > > > > > of the
> > > > > > > > > > > handler
> > > > > > > > > > > > > > > threads.
> > > > > > > > > > > > > > > >>>>> How much heap do you give the region
> > server?
> > > > Does
> > > > > > the
> > > > > > > > > > > working set
> > > > > > > > > > > > > > fit
> > > > > > > > > > > > > > > >>>> into the cache? (i.e. in the metrics, do you
> > > > see the
> > > > > > > > > > eviction
> > > > > > > > > > > count
> > > > > > > > > > > > > > > going
> > > > > > > > > > > > > > > >>>> up, if so it does not fit into the cache).
> > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > >>>>> If the working set does not fit into the
> > cache
> > > > > > > > (eviction
> > > > > > > > > > > count goes
> > > > > > > > > > > > > > > up)
> > > > > > > > > > > > > > > >>>> then HBase will need to bring a new block in
> > > > from
> > > > > > disk
> > > > > > > > on
> > > > > > > > > > > each Get
> > > > > > > > > > > > > > > >>>> (assuming the Gets are more or less random
> > as
> > > > far
> > > > > > as the
> > > > > > > > > > > server is
> > > > > > > > > > > > > > > >>>> concerned).
> > > > > > > > > > > > > > > >>>>> In case you'll benefit from reducing the
> > HFile
> > > > > > block
> > > > > > > > size
> > > > > > > > > > > (from 64k
> > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > >>>> 8k or even 4k).
> > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > >>>>> Lastly I don't think we tested the
> > performance
> > > > of
> > > > > > using
> > > > > > > > > > > reverse
> > > > > > > > > > > > > > scan
> > > > > > > > > > > > > > > >>>> this way, there is probably room to optimize
> > > > this.
> > > > > > > > > > > > > > > >>>>> Can you restructure your keys to allow
> > forwards
> > > > > > > > scanning?
> > > > > > > > > > For
> > > > > > > > > > > > > > example
> > > > > > > > > > > > > > > >>>> you could store the time as MAX_LONG-time.
> > Or
> > > > you
> > > > > > could
> > > > > > > > > > > invert all
> > > > > > > > > > > > > > > the bits
> > > > > > > > > > > > > > > >>>> of the time portion of the key, so that it
> > sort
> > > > the
> > > > > > > > other
> > > > > > > > > > > way. Then
> > > > > > > > > > > > > > > you
> > > > > > > > > > > > > > > >>>> could do a forward scan.
> > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > >>>>> Let us know how it goes.
> > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > >>>>> -- Lars
> > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > >>>>> ----- Original Message -----
> > > > > > > > > > > > > > > >>>>> From: Khaled Elmeleegy <kd...@hotmail.com>
> > > > > > > > > > > > > > > >>>>> To: "user@hbase.apache.org" <
> > > > user@hbase.apache.org
> > > > > > >
> > > > > > > > > > > > > > > >>>>> Cc:
> > > > > > > > > > > > > > > >>>>> Sent: Thursday, October 2, 2014 12:12 AM
> > > > > > > > > > > > > > > >>>>> Subject: HBase read performance
> > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > >>>>> Hi,
> > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > >>>>> I am trying to do a scatter/gather on hbase
> > > > > > (0.98.6.1),
> > > > > > > > > > > where I
> > > > > > > > > > > > > > have
> > > > > > > > > > > > > > > a
> > > > > > > > > > > > > > > >>>> client reading ~1000 keys from an HBase
> > table.
> > > > These
> > > > > > > > keys
> > > > > > > > > > > happen to
> > > > > > > > > > > > > > > fall on
> > > > > > > > > > > > > > > >>>> the same region server. For my reads I use
> > > > reverse
> > > > > > scan
> > > > > > > > to
> > > > > > > > > > > read each
> > > > > > > > > > > > > > > key as
> > > > > > > > > > > > > > > >>>> I want the key prior to a specific time
> > stamp
> > > > (time
> > > > > > > > stamps
> > > > > > > > > > are
> > > > > > > > > > > > > > stored
> > > > > > > > > > > > > > > in
> > > > > > > > > > > > > > > >>>> reverse order). I don't believe gets can
> > > > accomplish
> > > > > > > > that,
> > > > > > > > > > > right? so
> > > > > > > > > > > > > > I
> > > > > > > > > > > > > > > use
> > > > > > > > > > > > > > > >>>> scan, with caching set to 1.
> > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > >>>>> I use 2000 reader threads in the client
> > and on
> > > > > > HBase,
> > > > > > > > I've
> > > > > > > > > > > set
> > > > > > > > > > > > > > > >>>> hbase.regionserver.handler.count to 1500.
> > With
> > > > this
> > > > > > > > setup,
> > > > > > > > > > my
> > > > > > > > > > > > > > scatter
> > > > > > > > > > > > > > > >>>> gather is very slow and can take up to 10s
> > in
> > > > total.
> > > > > > > > Timing
> > > > > > > > > > an
> > > > > > > > > > > > > > > individual
> > > > > > > > > > > > > > > >>>> getScanner(..) call on the client side, it
> > can
> > > > > > easily
> > > > > > > > take
> > > > > > > > > > few
> > > > > > > > > > > > > > > hundreds of
> > > > > > > > > > > > > > > >>>> ms. I also got the following metrics from
> > the
> > > > region
> > > > > > > > server
> > > > > > > > > > in
> > > > > > > > > > > > > > > question:
> > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > >>>>> "queueCallTime_mean" : 2.190855525775637,
> > > > > > > > > > > > > > > >>>>> "queueCallTime_median" : 0.0,
> > > > > > > > > > > > > > > >>>>> "queueCallTime_75th_percentile" : 0.0,
> > > > > > > > > > > > > > > >>>>> "queueCallTime_95th_percentile" : 1.0,
> > > > > > > > > > > > > > > >>>>> "queueCallTime_99th_percentile" :
> > > > > > 556.9799999999818,
> > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > >>>>> "processCallTime_min" : 0,
> > > > > > > > > > > > > > > >>>>> "processCallTime_max" : 12755,
> > > > > > > > > > > > > > > >>>>> "processCallTime_mean" :
> > 105.64873440912682,
> > > > > > > > > > > > > > > >>>>> "processCallTime_median" : 0.0,
> > > > > > > > > > > > > > > >>>>> "processCallTime_75th_percentile" : 2.0,
> > > > > > > > > > > > > > > >>>>> "processCallTime_95th_percentile" :
> > 7917.95,
> > > > > > > > > > > > > > > >>>>> "processCallTime_99th_percentile" :
> > 8876.89,
> > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > >>>>
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> > "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_min"
> > > > > > > > > > > > > > > >>>> : 89,
> > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > >>>>
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> > "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_max"
> > > > > > > > > > > > > > > >>>> : 11300,
> > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > >>>>
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> > "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_mean"
> > > > > > > > > > > > > > > >>>> : 654.4949739797315,
> > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > >>>>
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> > "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_median"
> > > > > > > > > > > > > > > >>>> : 101.0,
> > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > >>>>
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> > "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_75th_percentile"
> > > > > > > > > > > > > > > >>>> : 101.0,
> > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > >>>>
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> > "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_95th_percentile"
> > > > > > > > > > > > > > > >>>> : 101.0,
> > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > >>>>
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> > "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_99th_percentile"
> > > > > > > > > > > > > > > >>>> : 113.0,
> > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > >>>>> Where "delta" is the name of the table I am
> > > > > > querying.
> > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > >>>>> In addition to all this, i monitored the
> > > > hardware
> > > > > > > > resources
> > > > > > > > > > > (CPU,
> > > > > > > > > > > > > > > disk,
> > > > > > > > > > > > > > > >>>> and network) of both the client and the
> > region
> > > > > > server
> > > > > > > > and
> > > > > > > > > > > nothing
> > > > > > > > > > > > > > > seems
> > > > > > > > > > > > > > > >>>> anywhere near saturation. So I am puzzled by
> > > > what's
> > > > > > > > going on
> > > > > > > > > > > and
> > > > > > > > > > > > > > > where this
> > > > > > > > > > > > > > > >>>> time is going.
> > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > >>>>> Few things to note based on the above
> > > > measurements:
> > > > > > > > both
> > > > > > > > > > > medians of
> > > > > > > > > > > > > > > IPC
> > > > > > > > > > > > > > > >>>> processCallTime and queueCallTime are
> > basically
> > > > zero
> > > > > > > > (ms I
> > > > > > > > > > > presume,
> > > > > > > > > > > > > > > >>>> right?). However, scanNext_median is 101 (ms
> > > > too,
> > > > > > > > right?). I
> > > > > > > > > > > am not
> > > > > > > > > > > > > > > sure
> > > > > > > > > > > > > > > >>>> how this adds up. Also, even though the 101
> > > > figure
> > > > > > seems
> > > > > > > > > > > > > > outrageously
> > > > > > > > > > > > > > > high
> > > > > > > > > > > > > > > >>>> and I don't know why, still all these scans
> > > > should
> > > > > > be
> > > > > > > > > > > happening in
> > > > > > > > > > > > > > > >>>> parallel, so the overall call should finish
> > > > fast,
> > > > > > given
> > > > > > > > that
> > > > > > > > > > > no
> > > > > > > > > > > > > > > hardware
> > > > > > > > > > > > > > > >>>> resource is contended, right? but this is
> > not
> > > > what's
> > > > > > > > > > > happening, so I
> > > > > > > > > > > > > > > have
> > > > > > > > > > > > > > > >>>> to be missing something(s).
> > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > >>>>> So, any help is appreciated there.
> > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > >>>>> Thanks,
> > > > > > > > > > > > > > > >>>>> Khaled
> > > > > > > > > > > > > > > >>>>
> > > > > > > > > > > > > > > >>>>
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Best regards,
> > > > > > >
> > > > > > >    - Andy
> > > > > > >
> > > > > > > Problems worthy of attack prove their worth by hitting back. -
> > Piet
> > > > Hein
> > > > > > > (via Tom White)
> > > > > >
> > > > > >
> > > >
> > > >
> >
> >

RE: HBase read performance

Posted by Khaled Elmeleegy <kd...@hotmail.com>.

I went through a lengthy testing process with Ted. I tried both disabling just cache on write and both COW and BC. Disabling them makes the problem go away. I think disabling COW in my workload effectively doesn't exercise  the block cache as my workload is mostly writes, so disabling COW doesn't cause the cache blocks to churn.
I tried both overcommitting memory and not. This is irrelevant though. The cache operates on the virtual memory offered to it and it's agnostic to the physical memory size. Memory over commitment may only hurt performance due to swapping by the OS, but this has nothing to do with the bug.
Khaled

> Date: Sun, 12 Oct 2014 23:17:24 -0700
> Subject: Re: HBase read performance
> From: stack@duboce.net
> To: user@hbase.apache.org
> 
> On Sun, Oct 12, 2014 at 8:58 PM, Khaled Elmeleegy <kd...@hotmail.com> wrote:
> 
> > It goes away, but i think that's because i am basically churning the cache
> > far less. My workload is mostly writes.
> >
> >
> What exactly did you change?
> 
> Did you disable both BC and cacheonwrite or just one of these configs?
> 
> Disabling BC and/or cacheonwrite would churn your cache more?
> 
> What about the heap config?  Did you over commit mem?  If so, did you try
> doing a direct memory sizing that was w/i the bounds of physical memory
> (well within)?  Did you still see exceptions?
> 
> St.Ack
> 
> 
> 
> 
> > > Date: Sun, 12 Oct 2014 19:43:16 -0700
> > > Subject: Re: HBase read performance
> > > From: yuzhihong@gmail.com
> > > To: user@hbase.apache.org
> > >
> > > Hi,
> > > Can you turn off cacheonwrite and keep bucket cache to see if the problem
> > > goes away ?
> > >
> > > Cheers
> > >
> > > On Fri, Oct 10, 2014 at 10:59 AM, Khaled Elmeleegy <kd...@hotmail.com>
> > > wrote:
> > >
> > > > Yes, I can reproduce it with some work.
> > > > The workload is basically as follows:
> > > > There are writers streaming writes to a table. Then, there is a reader
> > > > (invoked via a web interface). The reader does a 1000 parallel reverse
> > > > scans, all end up hitting the same region in my case. The scans are
> > > > effectively "gets" as I just need to get one record off of each of
> > them. I
> > > > just need to do a "reverse" get, which is not supported (would be
> > great to
> > > > have :)), so I do it via reverse scan. After few tries, the reader
> > > > consistently hits this bug.
> > > >
> > > > This happens with these config changes:
> > > > hbase-env:HBASE_REGIONSERVER_OPTS=-Xmx6G -XX:MaxDirectMemorySize=5G
> > > > -XX:CMSInitiatingOccupancyFraction=88 -XX:+AggressiveOpts -verbose:gc
> > > > -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Xlog
> > > > gc:/tmp/hbase-regionserver-gc.loghbase-site:
> > > > hbase.bucketcache.ioengine=offheap
> > > > hbase.bucketcache.size=4196
> > > > hbase.rs.cacheblocksonwrite=true
> > > > hfile.block.index.cacheonwrite=true
> > > > hfile.block.bloom.cacheonwrite=true
> > > >
> > > > Interestingly, without these config changes, I can't reproduce the
> > problem.
> > > > Khaled
> > > >
> > > >
> > > > > Date: Fri, 10 Oct 2014 10:05:14 -0700
> > > > > Subject: Re: HBase read performance
> > > > > From: stack@duboce.net
> > > > > To: user@hbase.apache.org
> > > > >
> > > > > It looks like we are messing up our positioning math:
> > > > >
> > > > > 233 <
> > > >
> > http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/nio/Buffer.java#233
> > > > >
> > > > >
> > > > > <
> > > >
> > http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/nio/Buffer.java#
> > > > >
> > > > >
> > > > >     public final Buffer
> > > > > <
> > > >
> > http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/nio/Buffer.java#Buffer
> > > > >
> > > > >  <
> > > >
> > http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/nio/Buffer.java#
> > > > >position(int
> > > > > newPosition) {
> > > > >
> > > > > 234 <
> > > >
> > http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/nio/Buffer.java#234
> > > > >
> > > > >
> > > > > <
> > > >
> > http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/nio/Buffer.java#
> > > > >
> > > > >
> > > > >         if ((newPosition > limit
> > > > > <
> > > >
> > http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/nio/Buffer.java#Buffer.0limit
> > > > >)
> > > > > || (newPosition < 0))
> > > > >
> > > > > 235 <
> > > >
> > http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/nio/Buffer.java#235
> > > > >
> > > > >
> > > > > <
> > > >
> > http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/nio/Buffer.java#
> > > > >
> > > > >
> > > > >             throw new IllegalArgumentException
> > > > > <
> > > >
> > http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/lang/IllegalArgumentException.java#IllegalArgumentException
> > > > >();
> > > > >
> > > > >
> > > > > Is it easy to reproduce Khaled?  Always same region/store or spread
> > > > > across all reads?
> > > > >
> > > > >
> > > > > St.Ack
> > > > >
> > > > >
> > > > >
> > > > > On Fri, Oct 10, 2014 at 8:31 AM, Khaled Elmeleegy <kdiaa@hotmail.com
> > >
> > > > wrote:
> > > > >
> > > > > > Andrew thanks. I think this indeed was the problem.
> > > > > > To get over it, I increased the amount of memory given the region
> > > > server
> > > > > > to avoid IO on reads. I used the configs below. In my experiments,
> > I
> > > > have
> > > > > > writers streaming their output to HBase. The reader powers a web
> > page
> > > > and
> > > > > > does this scatter/gather, where it reads 1000 keys written last and
> > > > passes
> > > > > > them the the front end. With this workload, I get the exception
> > below
> > > > at
> > > > > > the region server. Again, I am using HBAse (0.98.6.1). Any help is
> > > > > > appreciated.
> > > > > > hbase-env:HBASE_REGIONSERVER_OPTS=-Xmx6G -XX:MaxDirectMemorySize=5G
> > > > > > -XX:CMSInitiatingOccupancyFraction=88 -XX:+AggressiveOpts
> > -verbose:gc
> > > > > > -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Xlog
> > > > > > gc:/tmp/hbase-regionserver-gc.log
> > > > > > hbase-site:hbase.bucketcache.ioengine=offheap
> > > > > >
> > > > > >                                    hfile.block.cache.size=0.4
> > > > > >
> > > > > >
> > > > > >  hbase.bucketcache.size=4196
> > > > > >
> > > > > >                         hbase.rs.cacheblocksonwrite=true
> > > > > >
> > > > > >
> > > > > >  hfile.block.index.cacheonwrite=true
> > > > > >
> > > > > >                         hfile.block.bloom.cacheonwrite=true
> > > > > >
> > > > > > 2014-10-10 15:06:44,173 ERROR
> > > > > > [B.DefaultRpcServer.handler=62,queue=2,port=60020] ipc.RpcServer:
> > > > > > Unexpected throwable object
> > > > > > java.lang.IllegalArgumentException
> > > > > >         at java.nio.Buffer.position(Buffer.java:236)
> > > > > >         at
> > > > > >
> > > >
> > org.apache.hadoop.hbase.util.ByteBufferUtils.skip(ByteBufferUtils.java:434)
> > > > > >         at
> > > > > >
> > > >
> > org.apache.hadoop.hbase.io.hfile.HFileReaderV2$ScannerV2.readKeyValueLen(HFileReaderV2.java:849)
> > > > > >         at
> > > > > >
> > > >
> > org.apache.hadoop.hbase.io.hfile.HFileReaderV2$ScannerV2.next(HFileReaderV2.java:760)
> > > > > >         at
> > > > > >
> > > >
> > org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekAtOrAfter(StoreFileScanner.java:248)
> > > > > >         at
> > > > > >
> > > >
> > org.apache.hadoop.hbase.regionserver.StoreFileScanner.seek(StoreFileScanner.java:152)
> > > > > >         at
> > > > > >
> > > >
> > org.apache.hadoop.hbase.regionserver.StoreScanner.seekScanners(StoreScanner.java:317)
> > > > > >         at
> > > > > >
> > > >
> > org.apache.hadoop.hbase.regionserver.StoreScanner.<init>(StoreScanner.java:176)
> > > > > >         at
> > > > > >
> > > >
> > org.apache.hadoop.hbase.regionserver.HStore.getScanner(HStore.java:1780)
> > > > > >         at
> > > > > >
> > > >
> > org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.<init>(HRegion.java:3758)
> > > > > >         at
> > > > > >
> > > >
> > org.apache.hadoop.hbase.regionserver.HRegion.instantiateRegionScanner(HRegion.java:1950)
> > > > > >         at
> > > > > >
> > > >
> > org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:1936)
> > > > > >         at
> > > > > >
> > > >
> > org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:1913)
> > > > > >         at
> > > > > >
> > > >
> > org.apache.hadoop.hbase.regionserver.HRegionServer.scan(HRegionServer.java:3157)
> > > > > >         at
> > > > > >
> > > >
> > org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:29587)
> > > > > >         at
> > > > org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2027)
> > > > > >         at
> > > > org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108)
> > > > > >         at
> > > > > >
> > > >
> > org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:114)
> > > > > >         at
> > > > > > org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:94)
> > > > > >         at java.lang.Thread.run(Thread.java:744)
> > > > > > > From: apurtell@apache.org
> > > > > > > Date: Tue, 7 Oct 2014 17:09:35 -0700
> > > > > > > Subject: Re: HBase read performance
> > > > > > > To: user@hbase.apache.org
> > > > > > >
> > > > > > > > The cluster has 2 m1.large nodes.
> > > > > > >
> > > > > > > That's the problem right there.
> > > > > > >
> > > > > > > You need to look at c3.4xlarge or i2 instances as a minimum
> > > > requirement.
> > > > > > M1
> > > > > > > and even M3 instance types have ridiculously poor IO.
> > > > > > >
> > > > > > >
> > > > > > > On Tue, Oct 7, 2014 at 3:01 PM, Khaled Elmeleegy <
> > kdiaa@hotmail.com>
> > > > > > wrote:
> > > > > > >
> > > > > > > > Thanks Nicolas, Qiang.
> > > > > > > >
> > > > > > > > I was able to write a simple program that reproduces the
> > problem
> > > > on a
> > > > > > tiny
> > > > > > > > HBase cluster on ec2. The cluster has 2 m1.large nodes. One
> > node
> > > > runs
> > > > > > the
> > > > > > > > master, name node and zookeeper. The other node runs a data
> > node
> > > > and a
> > > > > > > > region server, with heap size configured to be 6GB. There, the
> > 1000
> > > > > > > > parallel reverse gets (reverse scans) take 7-8 seconds. The
> > data
> > > > set is
> > > > > > > > tiny (10M records, each having a small number of bytes). As I
> > said
> > > > > > before,
> > > > > > > > all hardware resources are very idle there.
> > > > > > > >
> > > > > > > > Interestingly, running the same workload on my macbook, the
> > 1000
> > > > > > parallel
> > > > > > > > gets take ~200ms on a pseudo-distributed installation.
> > > > > > > >
> > > > > > > > Any help to resolve this mystery is highly appreciated.
> > > > > > > >
> > > > > > > > P.S. please find my test program attached.
> > > > > > > >
> > > > > > > > Best,
> > > > > > > > Khaled
> > > > > > > >
> > > > > > > > > From: nkeywal@gmail.com
> > > > > > > > > Date: Mon, 6 Oct 2014 09:40:48 +0200
> > > > > > > > > Subject: Re: HBase read performance
> > > > > > > > > To: user@hbase.apache.org
> > > > > > > > >
> > > > > > > > > Hi,
> > > > > > > > >
> > > > > > > > > I haven't seen it mentioned, but if I understand correctly
> > each
> > > > scan
> > > > > > > > > returns a single row? If so you should use Scan#setSmall to
> > save
> > > > > > some rpc
> > > > > > > > > calls.
> > > > > > > > >
> > > > > > > > > Cheers,
> > > > > > > > >
> > > > > > > > > Nicolas
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Sun, Oct 5, 2014 at 11:28 AM, Qiang Tian <
> > tianq01@gmail.com>
> > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > when using separate HConnection instance, both its
> > > > > > > > > > RpcClient instance(maintain connection to a regionserver)
> > and
> > > > > > Registry
> > > > > > > > > > instance(maintain connection to zookeeper) will be
> > separate..
> > > > > > > > > >
> > > > > > > > > > see
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> > http://shammijayasinghe.blogspot.com/2012/02/zookeeper-increase-maximum-number-of.html
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Sun, Oct 5, 2014 at 2:24 PM, Khaled Elmeleegy <
> > > > > > kdiaa@hotmail.com>
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > I tried creating my own HConnections pool to use for my
> > HBase
> > > > > > calls,
> > > > > > > > so
> > > > > > > > > > > that not all the (2K) threads share the same HConnection.
> > > > > > However, I
> > > > > > > > > > could
> > > > > > > > > > > only have 10 HConnections. Beyond that I get ZK
> > exceptions,
> > > > > > please
> > > > > > > > find
> > > > > > > > > > it
> > > > > > > > > > > below. Also, with 10 HConnections, I don't see noticeable
> > > > > > > > improvement in
> > > > > > > > > > > performance so far.
> > > > > > > > > > > 2014-10-05 06:11:26,490 WARN [main]
> > > > > > zookeeper.RecoverableZooKeeper
> > > > > > > > > > > (RecoverableZooKeeper.java:retryOrThrow(253)) - Possibly
> > > > > > transient
> > > > > > > > > > > ZooKeeper, quorum=54.68.206.252:2181,
> > > > > > > > > > >
> > > > > > > >
> > > > exception=org.apache.zookeeper.KeeperException$ConnectionLossException:
> > > > > > > > > > > KeeperErrorCode = ConnectionLoss for
> > /hbase/hbaseid2014-10-05
> > > > > > > > > > 06:11:26,490
> > > > > > > > > > > INFO [main] util.RetryCounter
> > > > > > > > > > (RetryCounter.java:sleepUntilNextRetry(155))
> > > > > > > > > > > - Sleeping 1000ms before retry #0...2014-10-05
> > 06:11:27,845
> > > > WARN
> > > > > > > > [main]
> > > > > > > > > > > zookeeper.RecoverableZooKeeper
> > > > > > > > > > > (RecoverableZooKeeper.java:retryOrThrow(253)) - Possibly
> > > > > > transient
> > > > > > > > > > > ZooKeeper, quorum=54.68.206.252:2181,
> > > > > > > > > > >
> > > > > > > >
> > > > exception=org.apache.zookeeper.KeeperException$ConnectionLossException:
> > > > > > > > > > > KeeperErrorCode = ConnectionLoss for
> > /hbase/hbaseid2014-10-05
> > > > > > > > > > 06:11:27,849
> > > > > > > > > > > INFO [main] util.RetryCounter
> > > > > > > > > > (RetryCounter.java:sleepUntilNextRetry(155))
> > > > > > > > > > > - Sleeping 2000ms before retry #1...2014-10-05
> > 06:11:30,405
> > > > WARN
> > > > > > > > [main]
> > > > > > > > > > > zookeeper.RecoverableZooKeeper
> > > > > > > > > > > (RecoverableZooKeeper.java:retryOrThrow(253)) - Possibly
> > > > > > transient
> > > > > > > > > > > ZooKeeper, quorum=54.68.206.252:2181,
> > > > > > > > > > >
> > > > > > > >
> > > > exception=org.apache.zookeeper.KeeperException$ConnectionLossException:
> > > > > > > > > > > KeeperErrorCode = ConnectionLoss for
> > /hbase/hbaseid2014-10-05
> > > > > > > > > > 06:11:30,405
> > > > > > > > > > > INFO [main] util.RetryCounter
> > > > > > > > > > (RetryCounter.java:sleepUntilNextRetry(155))
> > > > > > > > > > > - Sleeping 4000ms before retry #2...2014-10-05
> > 06:11:35,278
> > > > WARN
> > > > > > > > [main]
> > > > > > > > > > > zookeeper.RecoverableZooKeeper
> > > > > > > > > > > (RecoverableZooKeeper.java:retryOrThrow(253)) - Possibly
> > > > > > transient
> > > > > > > > > > > ZooKeeper, quorum=54.68.206.252:2181,
> > > > > > > > > > >
> > > > > > > >
> > > > exception=org.apache.zookeeper.KeeperException$ConnectionLossException:
> > > > > > > > > > > KeeperErrorCode = ConnectionLoss for
> > /hbase/hbaseid2014-10-05
> > > > > > > > > > 06:11:35,279
> > > > > > > > > > > INFO [main] util.RetryCounter
> > > > > > > > > > (RetryCounter.java:sleepUntilNextRetry(155))
> > > > > > > > > > > - Sleeping 8000ms before retry #3...2014-10-05
> > 06:11:44,393
> > > > WARN
> > > > > > > > [main]
> > > > > > > > > > > zookeeper.RecoverableZooKeeper
> > > > > > > > > > > (RecoverableZooKeeper.java:retryOrThrow(253)) - Possibly
> > > > > > transient
> > > > > > > > > > > ZooKeeper, quorum=54.68.206.252:2181,
> > > > > > > > > > >
> > > > > > > >
> > > > exception=org.apache.zookeeper.KeeperException$ConnectionLossException:
> > > > > > > > > > > KeeperErrorCode = ConnectionLoss for
> > /hbase/hbaseid2014-10-05
> > > > > > > > > > 06:11:44,393
> > > > > > > > > > > ERROR [main] zookeeper.RecoverableZooKeeper
> > > > > > > > > > > (RecoverableZooKeeper.java:retryOrThrow(255)) - ZooKeeper
> > > > exists
> > > > > > > > failed
> > > > > > > > > > > after 4 attempts2014-10-05 06:11:44,394 WARN [main]
> > > > > > zookeeper.ZKUtil
> > > > > > > > > > > (ZKUtil.java:checkExists(482)) - hconnection-0x4e174f3b,
> > > > quorum=
> > > > > > > > > > > 54.68.206.252:2181, baseZNode=/hbase Unable to set
> > watcher
> > > > on
> > > > > > znode
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> > (/hbase/hbaseid)org.apache.zookeeper.KeeperException$ConnectionLossException:
> > > > > > > > > > > KeeperErrorCode = ConnectionLoss for /hbase/hbaseid at
> > > > > > > > > > >
> > > > > >
> > org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
> > > > > > > > at
> > > > > > > > > > >
> > > > > >
> > org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
> > > > > > > > at
> > > > > > > > > > >
> > org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1041) at
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.exists(RecoverableZooKeeper.java:199)
> > > > > > > > > > > at
> > > > > > > >
> > > > org.apache.hadoop.hbase.zookeeper.ZKUtil.checkExists(ZKUtil.java:479)
> > > > > > > > > > > at
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> > org.apache.hadoop.hbase.zookeeper.ZKClusterId.readClusterIdZNode(ZKClusterId.java:65)
> > > > > > > > > > > at
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> > org.apache.hadoop.hbase.client.ZooKeeperRegistry.getClusterId(ZooKeeperRegistry.java:83)
> > > > > > > > > > > at
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> > org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.retrieveClusterId(HConnectionManager.java:857)
> > > > > > > > > > > at
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> > org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.<init>(HConnectionManager.java:662)
> > > > > > > > > > > at
> > > > sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> > > > > > > > > > > Method) at
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
> > > > > > > > > > > at
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> > > > > > > > > > > at
> > > > > > java.lang.reflect.Constructor.newInstance(Constructor.java:526)
> > > > > > > > > > at
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> > org.apache.hadoop.hbase.client.HConnectionManager.createConnection(HConnectionManager.java:414)
> > > > > > > > > > > at
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> > org.apache.hadoop.hbase.client.HConnectionManager.createConnection(HConnectionManager.java:335)...
> > > > > > > > > > > > From: kdiaa@hotmail.com
> > > > > > > > > > > > To: user@hbase.apache.org
> > > > > > > > > > > > Subject: RE: HBase read performance
> > > > > > > > > > > > Date: Fri, 3 Oct 2014 12:37:39 -0700
> > > > > > > > > > > >
> > > > > > > > > > > > Lars, Ted, and Qiang,
> > > > > > > > > > > > Thanks for all the input.
> > > > > > > > > > > > Qiang: yes all the threads are in the same client
> > process
> > > > > > sharing
> > > > > > > > the
> > > > > > > > > > > same connection. And since I don't see hardware
> > contention,
> > > > may
> > > > > > be
> > > > > > > > there
> > > > > > > > > > is
> > > > > > > > > > > contention over this code path. I'll try using many
> > > > connections
> > > > > > and
> > > > > > > > see
> > > > > > > > > > if
> > > > > > > > > > > it alleviates the problems and I'll report back.
> > > > > > > > > > > > Thanks again,Khaled
> > > > > > > > > > > >
> > > > > > > > > > > > > Date: Fri, 3 Oct 2014 15:18:30 +0800
> > > > > > > > > > > > > Subject: Re: HBase read performance
> > > > > > > > > > > > > From: tianq01@gmail.com
> > > > > > > > > > > > > To: user@hbase.apache.org
> > > > > > > > > > > > >
> > > > > > > > > > > > > Regarding to profiling, Andrew introduced
> > > > > > > > > > > > >
> > > > > > > >
> > http://www.brendangregg.com/blog/2014-06-12/java-flame-graphs.html
> > > > > > > > > > > months
> > > > > > > > > > > > > ago.
> > > > > > > > > > > > >
> > > > > > > > > > > > > processCallTime comes from RpcServer#call, so it
> > looks
> > > > good?
> > > > > > > > > > > > >
> > > > > > > > > > > > > I have a suspect:
> > > > > > > > https://issues.apache.org/jira/browse/HBASE-11306
> > > > > > > > > > > > >
> > > > > > > > > > > > > how many processes do you have for your 2000 threads?
> > > > > > > > > > > > > if olny 1 process, those threads will share just 1
> > > > > > connection to
> > > > > > > > that
> > > > > > > > > > > > > regionserver, there might be big contention on the
> > RPC
> > > > code
> > > > > > path.
> > > > > > > > > > > ---for
> > > > > > > > > > > > > such case, could you try using different connections?
> > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> > https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HConnectionManager.html
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Fri, Oct 3, 2014 at 9:55 AM, Ted Yu <
> > > > yuzhihong@gmail.com>
> > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Khaled:
> > > > > > > > > > > > > > Do you have profiler such as jprofiler ?
> > > > > > > > > > > > > > Profiling would give us more hint.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Otherwise capturing stack trace during the period
> > of
> > > > > > reverse
> > > > > > > > scan
> > > > > > > > > > > would
> > > > > > > > > > > > > > help.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Cheers
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Thu, Oct 2, 2014 at 4:52 PM, lars hofhansl <
> > > > > > > > larsh@apache.org>
> > > > > > > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > You might have the data in the OS buffer cache,
> > > > without
> > > > > > short
> > > > > > > > > > > circuit
> > > > > > > > > > > > > > > reading the region server has to request the
> > block
> > > > from
> > > > > > the
> > > > > > > > data
> > > > > > > > > > > node
> > > > > > > > > > > > > > > process, which then reads it from the block
> > cache.
> > > > > > > > > > > > > > > That is a few context switches per RPC that do
> > not
> > > > show
> > > > > > up
> > > > > > > > in CPU
> > > > > > > > > > > > > > metrics.
> > > > > > > > > > > > > > > In that you also would not see disk IO.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > If - as you say - you see a lot of evicted
> > blocks the
> > > > > > data
> > > > > > > > *has*
> > > > > > > > > > > to come
> > > > > > > > > > > > > > > from the OS. If you do not see disk IO is *has*
> > to
> > > > come
> > > > > > from
> > > > > > > > the
> > > > > > > > > > OS
> > > > > > > > > > > > > > cache.
> > > > > > > > > > > > > > > I.e. there's more RAM on your boxes, and you
> > should
> > > > > > increase
> > > > > > > > the
> > > > > > > > > > > heap
> > > > > > > > > > > > > > block
> > > > > > > > > > > > > > > cache.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > You can measure the context switches with vmstat.
> > > > Other
> > > > > > than
> > > > > > > > that
> > > > > > > > > > > I have
> > > > > > > > > > > > > > > no suggestion until I reproduce the problem.
> > > > > > > > > > > > > > > Also check the data locality index of the region
> > > > server
> > > > > > it
> > > > > > > > should
> > > > > > > > > > > be
> > > > > > > > > > > > > > close
> > > > > > > > > > > > > > > to 100%.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > -- Lars
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > ________________________________
> > > > > > > > > > > > > > > From: Khaled Elmeleegy <kd...@hotmail.com>
> > > > > > > > > > > > > > > To: "user@hbase.apache.org" <
> > user@hbase.apache.org>
> > > > > > > > > > > > > > > Sent: Thursday, October 2, 2014 3:24 PM
> > > > > > > > > > > > > > > Subject: RE: HBase read performance
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Lars thanks a lot about all the tips. I'll make
> > sure
> > > > I
> > > > > > cover
> > > > > > > > all
> > > > > > > > > > > of them
> > > > > > > > > > > > > > > and get back to you. I am not sure they are the
> > > > > > bottleneck
> > > > > > > > though
> > > > > > > > > > > as they
> > > > > > > > > > > > > > > all are about optimizing physical resource
> > usage. As
> > > > I
> > > > > > said,
> > > > > > > > I
> > > > > > > > > > > don't see
> > > > > > > > > > > > > > > any contended physical resources now. I'll also
> > try
> > > > to
> > > > > > > > reproduce
> > > > > > > > > > > this
> > > > > > > > > > > > > > > problem in a simpler environment and pass to you
> > the
> > > > test
> > > > > > > > program
> > > > > > > > > > > to play
> > > > > > > > > > > > > > > with.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Couple of high level points to make. You are
> > right
> > > > that
> > > > > > my
> > > > > > > > use
> > > > > > > > > > > case is
> > > > > > > > > > > > > > > kind of a worst case for HBase reads. But, if
> > things
> > > > go
> > > > > > the
> > > > > > > > way
> > > > > > > > > > you
> > > > > > > > > > > > > > > described them, there should be tons of disk IO
> > and
> > > > that
> > > > > > > > should
> > > > > > > > > > be
> > > > > > > > > > > > > > clearly
> > > > > > > > > > > > > > > the bottleneck. This is not the case though.
> > That's
> > > > for
> > > > > > the
> > > > > > > > > > simple
> > > > > > > > > > > reason
> > > > > > > > > > > > > > > that this is done in a test environment (I am
> > still
> > > > > > > > prototyping),
> > > > > > > > > > > and
> > > > > > > > > > > > > > not a
> > > > > > > > > > > > > > > lot of data is yet written to HBase. However for
> > the
> > > > > > real use
> > > > > > > > > > > case, there
> > > > > > > > > > > > > > > should writers constantly writing data to HBase
> > and
> > > > > > readers
> > > > > > > > > > > occasionally
> > > > > > > > > > > > > > > doing this scatter/gather. At steady state,
> > things
> > > > should
> > > > > > > > only
> > > > > > > > > > get
> > > > > > > > > > > worse
> > > > > > > > > > > > > > > and all the issues you mentioned should get far
> > more
> > > > > > > > pronounced.
> > > > > > > > > > > At this
> > > > > > > > > > > > > > > point, one can try to mitigate it using more
> > memory
> > > > or
> > > > > > so. I
> > > > > > > > am
> > > > > > > > > > > not there
> > > > > > > > > > > > > > > yet as I think I am hitting some software
> > bottleneck,
> > > > > > which I
> > > > > > > > > > > don't know
> > > > > > > > > > > > > > > how to work around.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Khaled
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > ----------------------------------------
> > > > > > > > > > > > > > > > Date: Thu, 2 Oct 2014 14:20:47 -0700
> > > > > > > > > > > > > > > > From: larsh@apache.org
> > > > > > > > > > > > > > > > Subject: Re: HBase read performance
> > > > > > > > > > > > > > > > To: user@hbase.apache.org
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > OK... We might need to investigate this.
> > > > > > > > > > > > > > > > Any chance that you can provide a minimal test
> > > > program
> > > > > > and
> > > > > > > > > > > instruction
> > > > > > > > > > > > > > > about how to set it up.
> > > > > > > > > > > > > > > > We can do some profiling then.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > One thing to note is that with scanning HBase
> > > > cannot
> > > > > > use
> > > > > > > > bloom
> > > > > > > > > > > filters
> > > > > > > > > > > > > > > to rule out HFiles ahead of time, it needs to
> > look
> > > > into
> > > > > > all
> > > > > > > > of
> > > > > > > > > > > them.
> > > > > > > > > > > > > > > > So you kind of hit on the absolute worst case:
> > > > > > > > > > > > > > > > - random reads that do not fit into the block
> > cache
> > > > > > > > > > > > > > > > - cannot use bloom filters
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Few more question/comments:
> > > > > > > > > > > > > > > > - Do you have short circuit reading enabled? If
> > > > not,
> > > > > > you
> > > > > > > > > > should.
> > > > > > > > > > > > > > > > - Is your table major compacted? That will
> > reduce
> > > > the
> > > > > > > > number of
> > > > > > > > > > > files
> > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > look at.
> > > > > > > > > > > > > > > > - Did you disable Nagle's everywhere (enabled
> > > > > > tcpnodelay)?
> > > > > > > > It
> > > > > > > > > > > disabled
> > > > > > > > > > > > > > > by default in HBase, but necessarily in your
> > install
> > > > of
> > > > > > HDFS.
> > > > > > > > > > > > > > > > - Which version of HDFS are you using as
> > backing
> > > > > > > > filesystem?
> > > > > > > > > > > > > > > > - If your disk is idle, it means the data fits
> > into
> > > > > > the OS
> > > > > > > > > > buffer
> > > > > > > > > > > > > > cache.
> > > > > > > > > > > > > > > In turn that means that you increase the heap
> > for the
> > > > > > region
> > > > > > > > > > > servers. You
> > > > > > > > > > > > > > > can also use block encoding (FAST_DIFF) to try to
> > > > make
> > > > > > sure
> > > > > > > > the
> > > > > > > > > > > entire
> > > > > > > > > > > > > > > working set fits into the cache.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > - Also try to reduce the block size - although
> > if
> > > > your
> > > > > > > > overall
> > > > > > > > > > > working
> > > > > > > > > > > > > > > set does not fit in the heap it won't help much.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > This is a good section of the book to read
> > through
> > > > > > > > generally
> > > > > > > > > > > (even
> > > > > > > > > > > > > > > though you might know most of this already):
> > > > > > > > > > > > > > >
> > > > http://hbase.apache.org/book.html#perf.configurations
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > -- Lars
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > ----- Original Message -----
> > > > > > > > > > > > > > > > From: Khaled Elmeleegy <kd...@hotmail.com>
> > > > > > > > > > > > > > > > To: "user@hbase.apache.org" <
> > user@hbase.apache.org
> > > > >
> > > > > > > > > > > > > > > > Cc:
> > > > > > > > > > > > > > > > Sent: Thursday, October 2, 2014 11:27 AM
> > > > > > > > > > > > > > > > Subject: RE: HBase read performance
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > I do see a very brief spike in CPU
> > (user/system),
> > > > but
> > > > > > it's
> > > > > > > > no
> > > > > > > > > > > where
> > > > > > > > > > > > > > near
> > > > > > > > > > > > > > > 0% idle. It goes from 99% idle down to something
> > > > like 40%
> > > > > > > > idle
> > > > > > > > > > for
> > > > > > > > > > > a
> > > > > > > > > > > > > > second
> > > > > > > > > > > > > > > or so. The thing to note, this is all on a test
> > > > cluster,
> > > > > > so
> > > > > > > > no
> > > > > > > > > > > real load.
> > > > > > > > > > > > > > > Things are generally idle until i issue 2-3 of
> > these
> > > > > > > > > > > multi-scan-requests
> > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > render a web page. Then, you see the spike in the
> > > > cpu and
> > > > > > > > some
> > > > > > > > > > > activity
> > > > > > > > > > > > > > in
> > > > > > > > > > > > > > > the network and disk, but nowhere near
> > saturation.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > If there are specific tests you'd like me to
> > do to
> > > > > > debug
> > > > > > > > this,
> > > > > > > > > > > I'd be
> > > > > > > > > > > > > > > more than happy to do it.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Khaled
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > ----------------------------------------
> > > > > > > > > > > > > > > >> Date: Thu, 2 Oct 2014 11:15:59 -0700
> > > > > > > > > > > > > > > >> From: larsh@apache.org
> > > > > > > > > > > > > > > >> Subject: Re: HBase read performance
> > > > > > > > > > > > > > > >> To: user@hbase.apache.org
> > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > >> I still think you're waiting on disk. No
> > IOWAIT?
> > > > So
> > > > > > CPU
> > > > > > > > is not
> > > > > > > > > > > waiting
> > > > > > > > > > > > > > > a lot for IO. No high User/System CPU either?
> > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > >> If you see a lot of evicted block then each
> > RPC
> > > > has a
> > > > > > high
> > > > > > > > > > > chance of
> > > > > > > > > > > > > > > requiring to bring an entire 64k block in. You'll
> > > > see bad
> > > > > > > > > > > performance
> > > > > > > > > > > > > > with
> > > > > > > > > > > > > > > this.
> > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > >> We might need to trace this specific scenario.
> > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > >> -- Lars
> > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > >> ________________________________
> > > > > > > > > > > > > > > >> From: Khaled Elmeleegy <kd...@hotmail.com>
> > > > > > > > > > > > > > > >> To: "user@hbase.apache.org" <
> > > > user@hbase.apache.org>
> > > > > > > > > > > > > > > >> Sent: Thursday, October 2, 2014 10:46 AM
> > > > > > > > > > > > > > > >> Subject: RE: HBase read performance
> > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > >> I've set the heap size to 6GB and I do have gc
> > > > > > logging. No
> > > > > > > > > > long
> > > > > > > > > > > pauses
> > > > > > > > > > > > > > > there -- occasional 0.1s or 0.2s.
> > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > >> Other than the discrepancy between what's
> > > > reported on
> > > > > > the
> > > > > > > > > > > client and
> > > > > > > > > > > > > > > what's reported at the RS, there is also the
> > issue
> > > > of not
> > > > > > > > getting
> > > > > > > > > > > proper
> > > > > > > > > > > > > > > concurrency. So, even if a reverse get takes
> > 100ms
> > > > or so
> > > > > > > > (this
> > > > > > > > > > has
> > > > > > > > > > > to be
> > > > > > > > > > > > > > > mostly blocking on various things as no physical
> > > > > > resource is
> > > > > > > > > > > contended),
> > > > > > > > > > > > > > > then the other gets/scans should be able to
> > proceed
> > > > in
> > > > > > > > parallel,
> > > > > > > > > > > so a
> > > > > > > > > > > > > > > thousand concurrent gets/scans should finish in
> > few
> > > > > > hundreds
> > > > > > > > of
> > > > > > > > > > ms
> > > > > > > > > > > not
> > > > > > > > > > > > > > many
> > > > > > > > > > > > > > > seconds. That's why I thought I'd increase the
> > > > handlers
> > > > > > > > count to
> > > > > > > > > > > try to
> > > > > > > > > > > > > > get
> > > > > > > > > > > > > > > more concurrency, but it didn't help. So, there
> > must
> > > > be
> > > > > > > > something
> > > > > > > > > > > else.
> > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > >> Khaled
> > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > >> ----------------------------------------
> > > > > > > > > > > > > > > >>> From: ndimiduk@gmail.com
> > > > > > > > > > > > > > > >>> Date: Thu, 2 Oct 2014 10:36:39 -0700
> > > > > > > > > > > > > > > >>> Subject: Re: HBase read performance
> > > > > > > > > > > > > > > >>> To: user@hbase.apache.org
> > > > > > > > > > > > > > > >>>
> > > > > > > > > > > > > > > >>> Do check again on the heap size of the region
> > > > > > servers.
> > > > > > > > The
> > > > > > > > > > > default
> > > > > > > > > > > > > > > >>> unconfigured size is 1G; too small for much
> > of
> > > > > > anything.
> > > > > > > > > > Check
> > > > > > > > > > > your
> > > > > > > > > > > > > > RS
> > > > > > > > > > > > > > > logs
> > > > > > > > > > > > > > > >>> -- look for lines produced by the
> > JVMPauseMonitor
> > > > > > thread.
> > > > > > > > > > They
> > > > > > > > > > > > > > usually
> > > > > > > > > > > > > > > >>> correlate with long GC pauses or other
> > > > process-freeze
> > > > > > > > events.
> > > > > > > > > > > > > > > >>>
> > > > > > > > > > > > > > > >>> Get is implemented as a Scan of a single
> > row, so
> > > > a
> > > > > > > > reverse
> > > > > > > > > > > scan of a
> > > > > > > > > > > > > > > single
> > > > > > > > > > > > > > > >>> row should be functionally equivalent.
> > > > > > > > > > > > > > > >>>
> > > > > > > > > > > > > > > >>> In practice, I have seen discrepancy between
> > the
> > > > > > > > latencies
> > > > > > > > > > > reported
> > > > > > > > > > > > > > by
> > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > >>> RS and the latencies experienced by the
> > client.
> > > > I've
> > > > > > not
> > > > > > > > > > > investigated
> > > > > > > > > > > > > > > this
> > > > > > > > > > > > > > > >>> area thoroughly.
> > > > > > > > > > > > > > > >>>
> > > > > > > > > > > > > > > >>> On Thu, Oct 2, 2014 at 10:05 AM, Khaled
> > > > Elmeleegy <
> > > > > > > > > > > kdiaa@hotmail.com
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > >>>
> > > > > > > > > > > > > > > >>>> Thanks Lars for your quick reply.
> > > > > > > > > > > > > > > >>>>
> > > > > > > > > > > > > > > >>>> Yes performance is similar with less
> > handlers (I
> > > > > > tried
> > > > > > > > with
> > > > > > > > > > > 100
> > > > > > > > > > > > > > > first).
> > > > > > > > > > > > > > > >>>>
> > > > > > > > > > > > > > > >>>> The payload is not big ~1KB or so. The
> > working
> > > > set
> > > > > > > > doesn't
> > > > > > > > > > > seem to
> > > > > > > > > > > > > > > fit in
> > > > > > > > > > > > > > > >>>> memory as there are many cache misses.
> > However,
> > > > > > disk is
> > > > > > > > far
> > > > > > > > > > > from
> > > > > > > > > > > > > > > being a
> > > > > > > > > > > > > > > >>>> bottleneck. I checked using iostat. I also
> > > > verified
> > > > > > that
> > > > > > > > > > > neither the
> > > > > > > > > > > > > > > >>>> network nor the CPU of the region server or
> > the
> > > > > > client
> > > > > > > > are a
> > > > > > > > > > > > > > > bottleneck.
> > > > > > > > > > > > > > > >>>> This leads me to believe that likely this
> > is a
> > > > > > software
> > > > > > > > > > > bottleneck,
> > > > > > > > > > > > > > > >>>> possibly due to a misconfiguration on my
> > side. I
> > > > > > just
> > > > > > > > don't
> > > > > > > > > > > know how
> > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > >>>> debug it. A clear disconnect I see is the
> > > > individual
> > > > > > > > request
> > > > > > > > > > > latency
> > > > > > > > > > > > > > > as
> > > > > > > > > > > > > > > >>>> reported by metrics on the region server
> > (IPC
> > > > > > > > > > processCallTime
> > > > > > > > > > > vs
> > > > > > > > > > > > > > > scanNext)
> > > > > > > > > > > > > > > >>>> vs what's measured on the client. Does this
> > > > sound
> > > > > > > > right? Any
> > > > > > > > > > > ideas
> > > > > > > > > > > > > > on
> > > > > > > > > > > > > > > how
> > > > > > > > > > > > > > > >>>> to better debug it?
> > > > > > > > > > > > > > > >>>>
> > > > > > > > > > > > > > > >>>> About this trick with the timestamps to be
> > able
> > > > to
> > > > > > do a
> > > > > > > > > > > forward
> > > > > > > > > > > > > > scan,
> > > > > > > > > > > > > > > >>>> thanks for pointing it out. Actually, I am
> > > > aware of
> > > > > > it.
> > > > > > > > The
> > > > > > > > > > > problem
> > > > > > > > > > > > > > I
> > > > > > > > > > > > > > > have
> > > > > > > > > > > > > > > >>>> is, sometimes I want to get the key after a
> > > > > > particular
> > > > > > > > > > > timestamp and
> > > > > > > > > > > > > > > >>>> sometimes I want to get the key before, so
> > just
> > > > > > relying
> > > > > > > > on
> > > > > > > > > > > the key
> > > > > > > > > > > > > > > order
> > > > > > > > > > > > > > > >>>> doesn't work. Ideally, I want a reverse
> > get(). I
> > > > > > thought
> > > > > > > > > > > reverse
> > > > > > > > > > > > > > scan
> > > > > > > > > > > > > > > can
> > > > > > > > > > > > > > > >>>> do the trick though.
> > > > > > > > > > > > > > > >>>>
> > > > > > > > > > > > > > > >>>> Khaled
> > > > > > > > > > > > > > > >>>>
> > > > > > > > > > > > > > > >>>> ----------------------------------------
> > > > > > > > > > > > > > > >>>>> Date: Thu, 2 Oct 2014 09:40:37 -0700
> > > > > > > > > > > > > > > >>>>> From: larsh@apache.org
> > > > > > > > > > > > > > > >>>>> Subject: Re: HBase read performance
> > > > > > > > > > > > > > > >>>>> To: user@hbase.apache.org
> > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > >>>>> Hi Khaled,
> > > > > > > > > > > > > > > >>>>> is it the same with fewer threads? 1500
> > handler
> > > > > > threads
> > > > > > > > > > > seems to
> > > > > > > > > > > > > > be a
> > > > > > > > > > > > > > > >>>> lot. Typically a good number of threads
> > depends
> > > > on
> > > > > > the
> > > > > > > > > > > hardware
> > > > > > > > > > > > > > > (number of
> > > > > > > > > > > > > > > >>>> cores, number of spindles, etc). I cannot
> > think
> > > > of
> > > > > > any
> > > > > > > > type
> > > > > > > > > > of
> > > > > > > > > > > > > > > scenario
> > > > > > > > > > > > > > > >>>> where more than 100 would give any
> > improvement.
> > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > >>>>> How large is the payload per KV retrieved
> > that
> > > > > > way? If
> > > > > > > > > > large
> > > > > > > > > > > (as
> > > > > > > > > > > > > > in a
> > > > > > > > > > > > > > > >>>> few 100k) you definitely want to lower the
> > > > number
> > > > > > of the
> > > > > > > > > > > handler
> > > > > > > > > > > > > > > threads.
> > > > > > > > > > > > > > > >>>>> How much heap do you give the region
> > server?
> > > > Does
> > > > > > the
> > > > > > > > > > > working set
> > > > > > > > > > > > > > fit
> > > > > > > > > > > > > > > >>>> into the cache? (i.e. in the metrics, do you
> > > > see the
> > > > > > > > > > eviction
> > > > > > > > > > > count
> > > > > > > > > > > > > > > going
> > > > > > > > > > > > > > > >>>> up, if so it does not fit into the cache).
> > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > >>>>> If the working set does not fit into the
> > cache
> > > > > > > > (eviction
> > > > > > > > > > > count goes
> > > > > > > > > > > > > > > up)
> > > > > > > > > > > > > > > >>>> then HBase will need to bring a new block in
> > > > from
> > > > > > disk
> > > > > > > > on
> > > > > > > > > > > each Get
> > > > > > > > > > > > > > > >>>> (assuming the Gets are more or less random
> > as
> > > > far
> > > > > > as the
> > > > > > > > > > > server is
> > > > > > > > > > > > > > > >>>> concerned).
> > > > > > > > > > > > > > > >>>>> In case you'll benefit from reducing the
> > HFile
> > > > > > block
> > > > > > > > size
> > > > > > > > > > > (from 64k
> > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > >>>> 8k or even 4k).
> > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > >>>>> Lastly I don't think we tested the
> > performance
> > > > of
> > > > > > using
> > > > > > > > > > > reverse
> > > > > > > > > > > > > > scan
> > > > > > > > > > > > > > > >>>> this way, there is probably room to optimize
> > > > this.
> > > > > > > > > > > > > > > >>>>> Can you restructure your keys to allow
> > forwards
> > > > > > > > scanning?
> > > > > > > > > > For
> > > > > > > > > > > > > > example
> > > > > > > > > > > > > > > >>>> you could store the time as MAX_LONG-time.
> > Or
> > > > you
> > > > > > could
> > > > > > > > > > > invert all
> > > > > > > > > > > > > > > the bits
> > > > > > > > > > > > > > > >>>> of the time portion of the key, so that it
> > sort
> > > > the
> > > > > > > > other
> > > > > > > > > > > way. Then
> > > > > > > > > > > > > > > you
> > > > > > > > > > > > > > > >>>> could do a forward scan.
> > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > >>>>> Let us know how it goes.
> > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > >>>>> -- Lars
> > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > >>>>> ----- Original Message -----
> > > > > > > > > > > > > > > >>>>> From: Khaled Elmeleegy <kd...@hotmail.com>
> > > > > > > > > > > > > > > >>>>> To: "user@hbase.apache.org" <
> > > > user@hbase.apache.org
> > > > > > >
> > > > > > > > > > > > > > > >>>>> Cc:
> > > > > > > > > > > > > > > >>>>> Sent: Thursday, October 2, 2014 12:12 AM
> > > > > > > > > > > > > > > >>>>> Subject: HBase read performance
> > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > >>>>> Hi,
> > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > >>>>> I am trying to do a scatter/gather on hbase
> > > > > > (0.98.6.1),
> > > > > > > > > > > where I
> > > > > > > > > > > > > > have
> > > > > > > > > > > > > > > a
> > > > > > > > > > > > > > > >>>> client reading ~1000 keys from an HBase
> > table.
> > > > These
> > > > > > > > keys
> > > > > > > > > > > happen to
> > > > > > > > > > > > > > > fall on
> > > > > > > > > > > > > > > >>>> the same region server. For my reads I use
> > > > reverse
> > > > > > scan
> > > > > > > > to
> > > > > > > > > > > read each
> > > > > > > > > > > > > > > key as
> > > > > > > > > > > > > > > >>>> I want the key prior to a specific time
> > stamp
> > > > (time
> > > > > > > > stamps
> > > > > > > > > > are
> > > > > > > > > > > > > > stored
> > > > > > > > > > > > > > > in
> > > > > > > > > > > > > > > >>>> reverse order). I don't believe gets can
> > > > accomplish
> > > > > > > > that,
> > > > > > > > > > > right? so
> > > > > > > > > > > > > > I
> > > > > > > > > > > > > > > use
> > > > > > > > > > > > > > > >>>> scan, with caching set to 1.
> > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > >>>>> I use 2000 reader threads in the client
> > and on
> > > > > > HBase,
> > > > > > > > I've
> > > > > > > > > > > set
> > > > > > > > > > > > > > > >>>> hbase.regionserver.handler.count to 1500.
> > With
> > > > this
> > > > > > > > setup,
> > > > > > > > > > my
> > > > > > > > > > > > > > scatter
> > > > > > > > > > > > > > > >>>> gather is very slow and can take up to 10s
> > in
> > > > total.
> > > > > > > > Timing
> > > > > > > > > > an
> > > > > > > > > > > > > > > individual
> > > > > > > > > > > > > > > >>>> getScanner(..) call on the client side, it
> > can
> > > > > > easily
> > > > > > > > take
> > > > > > > > > > few
> > > > > > > > > > > > > > > hundreds of
> > > > > > > > > > > > > > > >>>> ms. I also got the following metrics from
> > the
> > > > region
> > > > > > > > server
> > > > > > > > > > in
> > > > > > > > > > > > > > > question:
> > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > >>>>> "queueCallTime_mean" : 2.190855525775637,
> > > > > > > > > > > > > > > >>>>> "queueCallTime_median" : 0.0,
> > > > > > > > > > > > > > > >>>>> "queueCallTime_75th_percentile" : 0.0,
> > > > > > > > > > > > > > > >>>>> "queueCallTime_95th_percentile" : 1.0,
> > > > > > > > > > > > > > > >>>>> "queueCallTime_99th_percentile" :
> > > > > > 556.9799999999818,
> > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > >>>>> "processCallTime_min" : 0,
> > > > > > > > > > > > > > > >>>>> "processCallTime_max" : 12755,
> > > > > > > > > > > > > > > >>>>> "processCallTime_mean" :
> > 105.64873440912682,
> > > > > > > > > > > > > > > >>>>> "processCallTime_median" : 0.0,
> > > > > > > > > > > > > > > >>>>> "processCallTime_75th_percentile" : 2.0,
> > > > > > > > > > > > > > > >>>>> "processCallTime_95th_percentile" :
> > 7917.95,
> > > > > > > > > > > > > > > >>>>> "processCallTime_99th_percentile" :
> > 8876.89,
> > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > >>>>
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> > "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_min"
> > > > > > > > > > > > > > > >>>> : 89,
> > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > >>>>
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> > "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_max"
> > > > > > > > > > > > > > > >>>> : 11300,
> > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > >>>>
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> > "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_mean"
> > > > > > > > > > > > > > > >>>> : 654.4949739797315,
> > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > >>>>
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> > "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_median"
> > > > > > > > > > > > > > > >>>> : 101.0,
> > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > >>>>
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> > "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_75th_percentile"
> > > > > > > > > > > > > > > >>>> : 101.0,
> > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > >>>>
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> > "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_95th_percentile"
> > > > > > > > > > > > > > > >>>> : 101.0,
> > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > >>>>
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> > "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_99th_percentile"
> > > > > > > > > > > > > > > >>>> : 113.0,
> > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > >>>>> Where "delta" is the name of the table I am
> > > > > > querying.
> > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > >>>>> In addition to all this, i monitored the
> > > > hardware
> > > > > > > > resources
> > > > > > > > > > > (CPU,
> > > > > > > > > > > > > > > disk,
> > > > > > > > > > > > > > > >>>> and network) of both the client and the
> > region
> > > > > > server
> > > > > > > > and
> > > > > > > > > > > nothing
> > > > > > > > > > > > > > > seems
> > > > > > > > > > > > > > > >>>> anywhere near saturation. So I am puzzled by
> > > > what's
> > > > > > > > going on
> > > > > > > > > > > and
> > > > > > > > > > > > > > > where this
> > > > > > > > > > > > > > > >>>> time is going.
> > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > >>>>> Few things to note based on the above
> > > > measurements:
> > > > > > > > both
> > > > > > > > > > > medians of
> > > > > > > > > > > > > > > IPC
> > > > > > > > > > > > > > > >>>> processCallTime and queueCallTime are
> > basically
> > > > zero
> > > > > > > > (ms I
> > > > > > > > > > > presume,
> > > > > > > > > > > > > > > >>>> right?). However, scanNext_median is 101 (ms
> > > > too,
> > > > > > > > right?). I
> > > > > > > > > > > am not
> > > > > > > > > > > > > > > sure
> > > > > > > > > > > > > > > >>>> how this adds up. Also, even though the 101
> > > > figure
> > > > > > seems
> > > > > > > > > > > > > > outrageously
> > > > > > > > > > > > > > > high
> > > > > > > > > > > > > > > >>>> and I don't know why, still all these scans
> > > > should
> > > > > > be
> > > > > > > > > > > happening in
> > > > > > > > > > > > > > > >>>> parallel, so the overall call should finish
> > > > fast,
> > > > > > given
> > > > > > > > that
> > > > > > > > > > > no
> > > > > > > > > > > > > > > hardware
> > > > > > > > > > > > > > > >>>> resource is contended, right? but this is
> > not
> > > > what's
> > > > > > > > > > > happening, so I
> > > > > > > > > > > > > > > have
> > > > > > > > > > > > > > > >>>> to be missing something(s).
> > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > >>>>> So, any help is appreciated there.
> > > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > > >>>>> Thanks,
> > > > > > > > > > > > > > > >>>>> Khaled
> > > > > > > > > > > > > > > >>>>
> > > > > > > > > > > > > > > >>>>
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Best regards,
> > > > > > >
> > > > > > >    - Andy
> > > > > > >
> > > > > > > Problems worthy of attack prove their worth by hitting back. -
> > Piet
> > > > Hein
> > > > > > > (via Tom White)
> > > > > >
> > > > > >
> > > >
> > > >
> >
> >

Re: HBase read performance

Posted by Stack <st...@duboce.net>.

On Sun, Oct 12, 2014 at 8:58 PM, Khaled Elmeleegy <kd...@hotmail.com> wrote:

> It goes away, but i think that's because i am basically churning the cache
> far less. My workload is mostly writes.
>
>
What exactly did you change?

Did you disable both BC and cacheonwrite or just one of these configs?

Disabling BC and/or cacheonwrite would churn your cache more?

What about the heap config?  Did you over commit mem?  If so, did you try
doing a direct memory sizing that was w/i the bounds of physical memory
(well within)?  Did you still see exceptions?

St.Ack




> > Date: Sun, 12 Oct 2014 19:43:16 -0700
> > Subject: Re: HBase read performance
> > From: yuzhihong@gmail.com
> > To: user@hbase.apache.org
> >
> > Hi,
> > Can you turn off cacheonwrite and keep bucket cache to see if the problem
> > goes away ?
> >
> > Cheers
> >
> > On Fri, Oct 10, 2014 at 10:59 AM, Khaled Elmeleegy <kd...@hotmail.com>
> > wrote:
> >
> > > Yes, I can reproduce it with some work.
> > > The workload is basically as follows:
> > > There are writers streaming writes to a table. Then, there is a reader
> > > (invoked via a web interface). The reader does a 1000 parallel reverse
> > > scans, all end up hitting the same region in my case. The scans are
> > > effectively "gets" as I just need to get one record off of each of
> them. I
> > > just need to do a "reverse" get, which is not supported (would be
> great to
> > > have :)), so I do it via reverse scan. After few tries, the reader
> > > consistently hits this bug.
> > >
> > > This happens with these config changes:
> > > hbase-env:HBASE_REGIONSERVER_OPTS=-Xmx6G -XX:MaxDirectMemorySize=5G
> > > -XX:CMSInitiatingOccupancyFraction=88 -XX:+AggressiveOpts -verbose:gc
> > > -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Xlog
> > > gc:/tmp/hbase-regionserver-gc.loghbase-site:
> > > hbase.bucketcache.ioengine=offheap
> > > hbase.bucketcache.size=4196
> > > hbase.rs.cacheblocksonwrite=true
> > > hfile.block.index.cacheonwrite=true
> > > hfile.block.bloom.cacheonwrite=true
> > >
> > > Interestingly, without these config changes, I can't reproduce the
> problem.
> > > Khaled
> > >
> > >
> > > > Date: Fri, 10 Oct 2014 10:05:14 -0700
> > > > Subject: Re: HBase read performance
> > > > From: stack@duboce.net
> > > > To: user@hbase.apache.org
> > > >
> > > > It looks like we are messing up our positioning math:
> > > >
> > > > 233 <
> > >
> http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/nio/Buffer.java#233
> > > >
> > > >
> > > > <
> > >
> http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/nio/Buffer.java#
> > > >
> > > >
> > > >     public final Buffer
> > > > <
> > >
> http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/nio/Buffer.java#Buffer
> > > >
> > > >  <
> > >
> http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/nio/Buffer.java#
> > > >position(int
> > > > newPosition) {
> > > >
> > > > 234 <
> > >
> http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/nio/Buffer.java#234
> > > >
> > > >
> > > > <
> > >
> http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/nio/Buffer.java#
> > > >
> > > >
> > > >         if ((newPosition > limit
> > > > <
> > >
> http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/nio/Buffer.java#Buffer.0limit
> > > >)
> > > > || (newPosition < 0))
> > > >
> > > > 235 <
> > >
> http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/nio/Buffer.java#235
> > > >
> > > >
> > > > <
> > >
> http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/nio/Buffer.java#
> > > >
> > > >
> > > >             throw new IllegalArgumentException
> > > > <
> > >
> http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/lang/IllegalArgumentException.java#IllegalArgumentException
> > > >();
> > > >
> > > >
> > > > Is it easy to reproduce Khaled?  Always same region/store or spread
> > > > across all reads?
> > > >
> > > >
> > > > St.Ack
> > > >
> > > >
> > > >
> > > > On Fri, Oct 10, 2014 at 8:31 AM, Khaled Elmeleegy <kdiaa@hotmail.com
> >
> > > wrote:
> > > >
> > > > > Andrew thanks. I think this indeed was the problem.
> > > > > To get over it, I increased the amount of memory given the region
> > > server
> > > > > to avoid IO on reads. I used the configs below. In my experiments,
> I
> > > have
> > > > > writers streaming their output to HBase. The reader powers a web
> page
> > > and
> > > > > does this scatter/gather, where it reads 1000 keys written last and
> > > passes
> > > > > them the the front end. With this workload, I get the exception
> below
> > > at
> > > > > the region server. Again, I am using HBAse (0.98.6.1). Any help is
> > > > > appreciated.
> > > > > hbase-env:HBASE_REGIONSERVER_OPTS=-Xmx6G -XX:MaxDirectMemorySize=5G
> > > > > -XX:CMSInitiatingOccupancyFraction=88 -XX:+AggressiveOpts
> -verbose:gc
> > > > > -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Xlog
> > > > > gc:/tmp/hbase-regionserver-gc.log
> > > > > hbase-site:hbase.bucketcache.ioengine=offheap
> > > > >
> > > > >                                    hfile.block.cache.size=0.4
> > > > >
> > > > >
> > > > >  hbase.bucketcache.size=4196
> > > > >
> > > > >                         hbase.rs.cacheblocksonwrite=true
> > > > >
> > > > >
> > > > >  hfile.block.index.cacheonwrite=true
> > > > >
> > > > >                         hfile.block.bloom.cacheonwrite=true
> > > > >
> > > > > 2014-10-10 15:06:44,173 ERROR
> > > > > [B.DefaultRpcServer.handler=62,queue=2,port=60020] ipc.RpcServer:
> > > > > Unexpected throwable object
> > > > > java.lang.IllegalArgumentException
> > > > >         at java.nio.Buffer.position(Buffer.java:236)
> > > > >         at
> > > > >
> > >
> org.apache.hadoop.hbase.util.ByteBufferUtils.skip(ByteBufferUtils.java:434)
> > > > >         at
> > > > >
> > >
> org.apache.hadoop.hbase.io.hfile.HFileReaderV2$ScannerV2.readKeyValueLen(HFileReaderV2.java:849)
> > > > >         at
> > > > >
> > >
> org.apache.hadoop.hbase.io.hfile.HFileReaderV2$ScannerV2.next(HFileReaderV2.java:760)
> > > > >         at
> > > > >
> > >
> org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekAtOrAfter(StoreFileScanner.java:248)
> > > > >         at
> > > > >
> > >
> org.apache.hadoop.hbase.regionserver.StoreFileScanner.seek(StoreFileScanner.java:152)
> > > > >         at
> > > > >
> > >
> org.apache.hadoop.hbase.regionserver.StoreScanner.seekScanners(StoreScanner.java:317)
> > > > >         at
> > > > >
> > >
> org.apache.hadoop.hbase.regionserver.StoreScanner.<init>(StoreScanner.java:176)
> > > > >         at
> > > > >
> > >
> org.apache.hadoop.hbase.regionserver.HStore.getScanner(HStore.java:1780)
> > > > >         at
> > > > >
> > >
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.<init>(HRegion.java:3758)
> > > > >         at
> > > > >
> > >
> org.apache.hadoop.hbase.regionserver.HRegion.instantiateRegionScanner(HRegion.java:1950)
> > > > >         at
> > > > >
> > >
> org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:1936)
> > > > >         at
> > > > >
> > >
> org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:1913)
> > > > >         at
> > > > >
> > >
> org.apache.hadoop.hbase.regionserver.HRegionServer.scan(HRegionServer.java:3157)
> > > > >         at
> > > > >
> > >
> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:29587)
> > > > >         at
> > > org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2027)
> > > > >         at
> > > org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108)
> > > > >         at
> > > > >
> > >
> org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:114)
> > > > >         at
> > > > > org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:94)
> > > > >         at java.lang.Thread.run(Thread.java:744)
> > > > > > From: apurtell@apache.org
> > > > > > Date: Tue, 7 Oct 2014 17:09:35 -0700
> > > > > > Subject: Re: HBase read performance
> > > > > > To: user@hbase.apache.org
> > > > > >
> > > > > > > The cluster has 2 m1.large nodes.
> > > > > >
> > > > > > That's the problem right there.
> > > > > >
> > > > > > You need to look at c3.4xlarge or i2 instances as a minimum
> > > requirement.
> > > > > M1
> > > > > > and even M3 instance types have ridiculously poor IO.
> > > > > >
> > > > > >
> > > > > > On Tue, Oct 7, 2014 at 3:01 PM, Khaled Elmeleegy <
> kdiaa@hotmail.com>
> > > > > wrote:
> > > > > >
> > > > > > > Thanks Nicolas, Qiang.
> > > > > > >
> > > > > > > I was able to write a simple program that reproduces the
> problem
> > > on a
> > > > > tiny
> > > > > > > HBase cluster on ec2. The cluster has 2 m1.large nodes. One
> node
> > > runs
> > > > > the
> > > > > > > master, name node and zookeeper. The other node runs a data
> node
> > > and a
> > > > > > > region server, with heap size configured to be 6GB. There, the
> 1000
> > > > > > > parallel reverse gets (reverse scans) take 7-8 seconds. The
> data
> > > set is
> > > > > > > tiny (10M records, each having a small number of bytes). As I
> said
> > > > > before,
> > > > > > > all hardware resources are very idle there.
> > > > > > >
> > > > > > > Interestingly, running the same workload on my macbook, the
> 1000
> > > > > parallel
> > > > > > > gets take ~200ms on a pseudo-distributed installation.
> > > > > > >
> > > > > > > Any help to resolve this mystery is highly appreciated.
> > > > > > >
> > > > > > > P.S. please find my test program attached.
> > > > > > >
> > > > > > > Best,
> > > > > > > Khaled
> > > > > > >
> > > > > > > > From: nkeywal@gmail.com
> > > > > > > > Date: Mon, 6 Oct 2014 09:40:48 +0200
> > > > > > > > Subject: Re: HBase read performance
> > > > > > > > To: user@hbase.apache.org
> > > > > > > >
> > > > > > > > Hi,
> > > > > > > >
> > > > > > > > I haven't seen it mentioned, but if I understand correctly
> each
> > > scan
> > > > > > > > returns a single row? If so you should use Scan#setSmall to
> save
> > > > > some rpc
> > > > > > > > calls.
> > > > > > > >
> > > > > > > > Cheers,
> > > > > > > >
> > > > > > > > Nicolas
> > > > > > > >
> > > > > > > >
> > > > > > > > On Sun, Oct 5, 2014 at 11:28 AM, Qiang Tian <
> tianq01@gmail.com>
> > > > > wrote:
> > > > > > > >
> > > > > > > > > when using separate HConnection instance, both its
> > > > > > > > > RpcClient instance(maintain connection to a regionserver)
> and
> > > > > Registry
> > > > > > > > > instance(maintain connection to zookeeper) will be
> separate..
> > > > > > > > >
> > > > > > > > > see
> > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > >
> > >
> http://shammijayasinghe.blogspot.com/2012/02/zookeeper-increase-maximum-number-of.html
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Sun, Oct 5, 2014 at 2:24 PM, Khaled Elmeleegy <
> > > > > kdiaa@hotmail.com>
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > I tried creating my own HConnections pool to use for my
> HBase
> > > > > calls,
> > > > > > > so
> > > > > > > > > > that not all the (2K) threads share the same HConnection.
> > > > > However, I
> > > > > > > > > could
> > > > > > > > > > only have 10 HConnections. Beyond that I get ZK
> exceptions,
> > > > > please
> > > > > > > find
> > > > > > > > > it
> > > > > > > > > > below. Also, with 10 HConnections, I don't see noticeable
> > > > > > > improvement in
> > > > > > > > > > performance so far.
> > > > > > > > > > 2014-10-05 06:11:26,490 WARN [main]
> > > > > zookeeper.RecoverableZooKeeper
> > > > > > > > > > (RecoverableZooKeeper.java:retryOrThrow(253)) - Possibly
> > > > > transient
> > > > > > > > > > ZooKeeper, quorum=54.68.206.252:2181,
> > > > > > > > > >
> > > > > > >
> > > exception=org.apache.zookeeper.KeeperException$ConnectionLossException:
> > > > > > > > > > KeeperErrorCode = ConnectionLoss for
> /hbase/hbaseid2014-10-05
> > > > > > > > > 06:11:26,490
> > > > > > > > > > INFO [main] util.RetryCounter
> > > > > > > > > (RetryCounter.java:sleepUntilNextRetry(155))
> > > > > > > > > > - Sleeping 1000ms before retry #0...2014-10-05
> 06:11:27,845
> > > WARN
> > > > > > > [main]
> > > > > > > > > > zookeeper.RecoverableZooKeeper
> > > > > > > > > > (RecoverableZooKeeper.java:retryOrThrow(253)) - Possibly
> > > > > transient
> > > > > > > > > > ZooKeeper, quorum=54.68.206.252:2181,
> > > > > > > > > >
> > > > > > >
> > > exception=org.apache.zookeeper.KeeperException$ConnectionLossException:
> > > > > > > > > > KeeperErrorCode = ConnectionLoss for
> /hbase/hbaseid2014-10-05
> > > > > > > > > 06:11:27,849
> > > > > > > > > > INFO [main] util.RetryCounter
> > > > > > > > > (RetryCounter.java:sleepUntilNextRetry(155))
> > > > > > > > > > - Sleeping 2000ms before retry #1...2014-10-05
> 06:11:30,405
> > > WARN
> > > > > > > [main]
> > > > > > > > > > zookeeper.RecoverableZooKeeper
> > > > > > > > > > (RecoverableZooKeeper.java:retryOrThrow(253)) - Possibly
> > > > > transient
> > > > > > > > > > ZooKeeper, quorum=54.68.206.252:2181,
> > > > > > > > > >
> > > > > > >
> > > exception=org.apache.zookeeper.KeeperException$ConnectionLossException:
> > > > > > > > > > KeeperErrorCode = ConnectionLoss for
> /hbase/hbaseid2014-10-05
> > > > > > > > > 06:11:30,405
> > > > > > > > > > INFO [main] util.RetryCounter
> > > > > > > > > (RetryCounter.java:sleepUntilNextRetry(155))
> > > > > > > > > > - Sleeping 4000ms before retry #2...2014-10-05
> 06:11:35,278
> > > WARN
> > > > > > > [main]
> > > > > > > > > > zookeeper.RecoverableZooKeeper
> > > > > > > > > > (RecoverableZooKeeper.java:retryOrThrow(253)) - Possibly
> > > > > transient
> > > > > > > > > > ZooKeeper, quorum=54.68.206.252:2181,
> > > > > > > > > >
> > > > > > >
> > > exception=org.apache.zookeeper.KeeperException$ConnectionLossException:
> > > > > > > > > > KeeperErrorCode = ConnectionLoss for
> /hbase/hbaseid2014-10-05
> > > > > > > > > 06:11:35,279
> > > > > > > > > > INFO [main] util.RetryCounter
> > > > > > > > > (RetryCounter.java:sleepUntilNextRetry(155))
> > > > > > > > > > - Sleeping 8000ms before retry #3...2014-10-05
> 06:11:44,393
> > > WARN
> > > > > > > [main]
> > > > > > > > > > zookeeper.RecoverableZooKeeper
> > > > > > > > > > (RecoverableZooKeeper.java:retryOrThrow(253)) - Possibly
> > > > > transient
> > > > > > > > > > ZooKeeper, quorum=54.68.206.252:2181,
> > > > > > > > > >
> > > > > > >
> > > exception=org.apache.zookeeper.KeeperException$ConnectionLossException:
> > > > > > > > > > KeeperErrorCode = ConnectionLoss for
> /hbase/hbaseid2014-10-05
> > > > > > > > > 06:11:44,393
> > > > > > > > > > ERROR [main] zookeeper.RecoverableZooKeeper
> > > > > > > > > > (RecoverableZooKeeper.java:retryOrThrow(255)) - ZooKeeper
> > > exists
> > > > > > > failed
> > > > > > > > > > after 4 attempts2014-10-05 06:11:44,394 WARN [main]
> > > > > zookeeper.ZKUtil
> > > > > > > > > > (ZKUtil.java:checkExists(482)) - hconnection-0x4e174f3b,
> > > quorum=
> > > > > > > > > > 54.68.206.252:2181, baseZNode=/hbase Unable to set
> watcher
> > > on
> > > > > znode
> > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > >
> > >
> (/hbase/hbaseid)org.apache.zookeeper.KeeperException$ConnectionLossException:
> > > > > > > > > > KeeperErrorCode = ConnectionLoss for /hbase/hbaseid at
> > > > > > > > > >
> > > > >
> org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
> > > > > > > at
> > > > > > > > > >
> > > > >
> org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
> > > > > > > at
> > > > > > > > > >
> org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1041) at
> > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > >
> > >
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.exists(RecoverableZooKeeper.java:199)
> > > > > > > > > > at
> > > > > > >
> > > org.apache.hadoop.hbase.zookeeper.ZKUtil.checkExists(ZKUtil.java:479)
> > > > > > > > > > at
> > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > >
> > >
> org.apache.hadoop.hbase.zookeeper.ZKClusterId.readClusterIdZNode(ZKClusterId.java:65)
> > > > > > > > > > at
> > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > >
> > >
> org.apache.hadoop.hbase.client.ZooKeeperRegistry.getClusterId(ZooKeeperRegistry.java:83)
> > > > > > > > > > at
> > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > >
> > >
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.retrieveClusterId(HConnectionManager.java:857)
> > > > > > > > > > at
> > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > >
> > >
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.<init>(HConnectionManager.java:662)
> > > > > > > > > > at
> > > sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> > > > > > > > > > Method) at
> > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > >
> > >
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
> > > > > > > > > > at
> > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > >
> > >
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> > > > > > > > > > at
> > > > > java.lang.reflect.Constructor.newInstance(Constructor.java:526)
> > > > > > > > > at
> > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > >
> > >
> org.apache.hadoop.hbase.client.HConnectionManager.createConnection(HConnectionManager.java:414)
> > > > > > > > > > at
> > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > >
> > >
> org.apache.hadoop.hbase.client.HConnectionManager.createConnection(HConnectionManager.java:335)...
> > > > > > > > > > > From: kdiaa@hotmail.com
> > > > > > > > > > > To: user@hbase.apache.org
> > > > > > > > > > > Subject: RE: HBase read performance
> > > > > > > > > > > Date: Fri, 3 Oct 2014 12:37:39 -0700
> > > > > > > > > > >
> > > > > > > > > > > Lars, Ted, and Qiang,
> > > > > > > > > > > Thanks for all the input.
> > > > > > > > > > > Qiang: yes all the threads are in the same client
> process
> > > > > sharing
> > > > > > > the
> > > > > > > > > > same connection. And since I don't see hardware
> contention,
> > > may
> > > > > be
> > > > > > > there
> > > > > > > > > is
> > > > > > > > > > contention over this code path. I'll try using many
> > > connections
> > > > > and
> > > > > > > see
> > > > > > > > > if
> > > > > > > > > > it alleviates the problems and I'll report back.
> > > > > > > > > > > Thanks again,Khaled
> > > > > > > > > > >
> > > > > > > > > > > > Date: Fri, 3 Oct 2014 15:18:30 +0800
> > > > > > > > > > > > Subject: Re: HBase read performance
> > > > > > > > > > > > From: tianq01@gmail.com
> > > > > > > > > > > > To: user@hbase.apache.org
> > > > > > > > > > > >
> > > > > > > > > > > > Regarding to profiling, Andrew introduced
> > > > > > > > > > > >
> > > > > > >
> http://www.brendangregg.com/blog/2014-06-12/java-flame-graphs.html
> > > > > > > > > > months
> > > > > > > > > > > > ago.
> > > > > > > > > > > >
> > > > > > > > > > > > processCallTime comes from RpcServer#call, so it
> looks
> > > good?
> > > > > > > > > > > >
> > > > > > > > > > > > I have a suspect:
> > > > > > > https://issues.apache.org/jira/browse/HBASE-11306
> > > > > > > > > > > >
> > > > > > > > > > > > how many processes do you have for your 2000 threads?
> > > > > > > > > > > > if olny 1 process, those threads will share just 1
> > > > > connection to
> > > > > > > that
> > > > > > > > > > > > regionserver, there might be big contention on the
> RPC
> > > code
> > > > > path.
> > > > > > > > > > ---for
> > > > > > > > > > > > such case, could you try using different connections?
> > > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > >
> > >
> https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HConnectionManager.html
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > On Fri, Oct 3, 2014 at 9:55 AM, Ted Yu <
> > > yuzhihong@gmail.com>
> > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > Khaled:
> > > > > > > > > > > > > Do you have profiler such as jprofiler ?
> > > > > > > > > > > > > Profiling would give us more hint.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Otherwise capturing stack trace during the period
> of
> > > > > reverse
> > > > > > > scan
> > > > > > > > > > would
> > > > > > > > > > > > > help.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Cheers
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Thu, Oct 2, 2014 at 4:52 PM, lars hofhansl <
> > > > > > > larsh@apache.org>
> > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > You might have the data in the OS buffer cache,
> > > without
> > > > > short
> > > > > > > > > > circuit
> > > > > > > > > > > > > > reading the region server has to request the
> block
> > > from
> > > > > the
> > > > > > > data
> > > > > > > > > > node
> > > > > > > > > > > > > > process, which then reads it from the block
> cache.
> > > > > > > > > > > > > > That is a few context switches per RPC that do
> not
> > > show
> > > > > up
> > > > > > > in CPU
> > > > > > > > > > > > > metrics.
> > > > > > > > > > > > > > In that you also would not see disk IO.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > If - as you say - you see a lot of evicted
> blocks the
> > > > > data
> > > > > > > *has*
> > > > > > > > > > to come
> > > > > > > > > > > > > > from the OS. If you do not see disk IO is *has*
> to
> > > come
> > > > > from
> > > > > > > the
> > > > > > > > > OS
> > > > > > > > > > > > > cache.
> > > > > > > > > > > > > > I.e. there's more RAM on your boxes, and you
> should
> > > > > increase
> > > > > > > the
> > > > > > > > > > heap
> > > > > > > > > > > > > block
> > > > > > > > > > > > > > cache.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > You can measure the context switches with vmstat.
> > > Other
> > > > > than
> > > > > > > that
> > > > > > > > > > I have
> > > > > > > > > > > > > > no suggestion until I reproduce the problem.
> > > > > > > > > > > > > > Also check the data locality index of the region
> > > server
> > > > > it
> > > > > > > should
> > > > > > > > > > be
> > > > > > > > > > > > > close
> > > > > > > > > > > > > > to 100%.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > -- Lars
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > ________________________________
> > > > > > > > > > > > > > From: Khaled Elmeleegy <kd...@hotmail.com>
> > > > > > > > > > > > > > To: "user@hbase.apache.org" <
> user@hbase.apache.org>
> > > > > > > > > > > > > > Sent: Thursday, October 2, 2014 3:24 PM
> > > > > > > > > > > > > > Subject: RE: HBase read performance
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Lars thanks a lot about all the tips. I'll make
> sure
> > > I
> > > > > cover
> > > > > > > all
> > > > > > > > > > of them
> > > > > > > > > > > > > > and get back to you. I am not sure they are the
> > > > > bottleneck
> > > > > > > though
> > > > > > > > > > as they
> > > > > > > > > > > > > > all are about optimizing physical resource
> usage. As
> > > I
> > > > > said,
> > > > > > > I
> > > > > > > > > > don't see
> > > > > > > > > > > > > > any contended physical resources now. I'll also
> try
> > > to
> > > > > > > reproduce
> > > > > > > > > > this
> > > > > > > > > > > > > > problem in a simpler environment and pass to you
> the
> > > test
> > > > > > > program
> > > > > > > > > > to play
> > > > > > > > > > > > > > with.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Couple of high level points to make. You are
> right
> > > that
> > > > > my
> > > > > > > use
> > > > > > > > > > case is
> > > > > > > > > > > > > > kind of a worst case for HBase reads. But, if
> things
> > > go
> > > > > the
> > > > > > > way
> > > > > > > > > you
> > > > > > > > > > > > > > described them, there should be tons of disk IO
> and
> > > that
> > > > > > > should
> > > > > > > > > be
> > > > > > > > > > > > > clearly
> > > > > > > > > > > > > > the bottleneck. This is not the case though.
> That's
> > > for
> > > > > the
> > > > > > > > > simple
> > > > > > > > > > reason
> > > > > > > > > > > > > > that this is done in a test environment (I am
> still
> > > > > > > prototyping),
> > > > > > > > > > and
> > > > > > > > > > > > > not a
> > > > > > > > > > > > > > lot of data is yet written to HBase. However for
> the
> > > > > real use
> > > > > > > > > > case, there
> > > > > > > > > > > > > > should writers constantly writing data to HBase
> and
> > > > > readers
> > > > > > > > > > occasionally
> > > > > > > > > > > > > > doing this scatter/gather. At steady state,
> things
> > > should
> > > > > > > only
> > > > > > > > > get
> > > > > > > > > > worse
> > > > > > > > > > > > > > and all the issues you mentioned should get far
> more
> > > > > > > pronounced.
> > > > > > > > > > At this
> > > > > > > > > > > > > > point, one can try to mitigate it using more
> memory
> > > or
> > > > > so. I
> > > > > > > am
> > > > > > > > > > not there
> > > > > > > > > > > > > > yet as I think I am hitting some software
> bottleneck,
> > > > > which I
> > > > > > > > > > don't know
> > > > > > > > > > > > > > how to work around.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Khaled
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > ----------------------------------------
> > > > > > > > > > > > > > > Date: Thu, 2 Oct 2014 14:20:47 -0700
> > > > > > > > > > > > > > > From: larsh@apache.org
> > > > > > > > > > > > > > > Subject: Re: HBase read performance
> > > > > > > > > > > > > > > To: user@hbase.apache.org
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > OK... We might need to investigate this.
> > > > > > > > > > > > > > > Any chance that you can provide a minimal test
> > > program
> > > > > and
> > > > > > > > > > instruction
> > > > > > > > > > > > > > about how to set it up.
> > > > > > > > > > > > > > > We can do some profiling then.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > One thing to note is that with scanning HBase
> > > cannot
> > > > > use
> > > > > > > bloom
> > > > > > > > > > filters
> > > > > > > > > > > > > > to rule out HFiles ahead of time, it needs to
> look
> > > into
> > > > > all
> > > > > > > of
> > > > > > > > > > them.
> > > > > > > > > > > > > > > So you kind of hit on the absolute worst case:
> > > > > > > > > > > > > > > - random reads that do not fit into the block
> cache
> > > > > > > > > > > > > > > - cannot use bloom filters
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Few more question/comments:
> > > > > > > > > > > > > > > - Do you have short circuit reading enabled? If
> > > not,
> > > > > you
> > > > > > > > > should.
> > > > > > > > > > > > > > > - Is your table major compacted? That will
> reduce
> > > the
> > > > > > > number of
> > > > > > > > > > files
> > > > > > > > > > > > > to
> > > > > > > > > > > > > > look at.
> > > > > > > > > > > > > > > - Did you disable Nagle's everywhere (enabled
> > > > > tcpnodelay)?
> > > > > > > It
> > > > > > > > > > disabled
> > > > > > > > > > > > > > by default in HBase, but necessarily in your
> install
> > > of
> > > > > HDFS.
> > > > > > > > > > > > > > > - Which version of HDFS are you using as
> backing
> > > > > > > filesystem?
> > > > > > > > > > > > > > > - If your disk is idle, it means the data fits
> into
> > > > > the OS
> > > > > > > > > buffer
> > > > > > > > > > > > > cache.
> > > > > > > > > > > > > > In turn that means that you increase the heap
> for the
> > > > > region
> > > > > > > > > > servers. You
> > > > > > > > > > > > > > can also use block encoding (FAST_DIFF) to try to
> > > make
> > > > > sure
> > > > > > > the
> > > > > > > > > > entire
> > > > > > > > > > > > > > working set fits into the cache.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > - Also try to reduce the block size - although
> if
> > > your
> > > > > > > overall
> > > > > > > > > > working
> > > > > > > > > > > > > > set does not fit in the heap it won't help much.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > This is a good section of the book to read
> through
> > > > > > > generally
> > > > > > > > > > (even
> > > > > > > > > > > > > > though you might know most of this already):
> > > > > > > > > > > > > >
> > > http://hbase.apache.org/book.html#perf.configurations
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > -- Lars
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > ----- Original Message -----
> > > > > > > > > > > > > > > From: Khaled Elmeleegy <kd...@hotmail.com>
> > > > > > > > > > > > > > > To: "user@hbase.apache.org" <
> user@hbase.apache.org
> > > >
> > > > > > > > > > > > > > > Cc:
> > > > > > > > > > > > > > > Sent: Thursday, October 2, 2014 11:27 AM
> > > > > > > > > > > > > > > Subject: RE: HBase read performance
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > I do see a very brief spike in CPU
> (user/system),
> > > but
> > > > > it's
> > > > > > > no
> > > > > > > > > > where
> > > > > > > > > > > > > near
> > > > > > > > > > > > > > 0% idle. It goes from 99% idle down to something
> > > like 40%
> > > > > > > idle
> > > > > > > > > for
> > > > > > > > > > a
> > > > > > > > > > > > > second
> > > > > > > > > > > > > > or so. The thing to note, this is all on a test
> > > cluster,
> > > > > so
> > > > > > > no
> > > > > > > > > > real load.
> > > > > > > > > > > > > > Things are generally idle until i issue 2-3 of
> these
> > > > > > > > > > multi-scan-requests
> > > > > > > > > > > > > to
> > > > > > > > > > > > > > render a web page. Then, you see the spike in the
> > > cpu and
> > > > > > > some
> > > > > > > > > > activity
> > > > > > > > > > > > > in
> > > > > > > > > > > > > > the network and disk, but nowhere near
> saturation.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > If there are specific tests you'd like me to
> do to
> > > > > debug
> > > > > > > this,
> > > > > > > > > > I'd be
> > > > > > > > > > > > > > more than happy to do it.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Khaled
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > ----------------------------------------
> > > > > > > > > > > > > > >> Date: Thu, 2 Oct 2014 11:15:59 -0700
> > > > > > > > > > > > > > >> From: larsh@apache.org
> > > > > > > > > > > > > > >> Subject: Re: HBase read performance
> > > > > > > > > > > > > > >> To: user@hbase.apache.org
> > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > >> I still think you're waiting on disk. No
> IOWAIT?
> > > So
> > > > > CPU
> > > > > > > is not
> > > > > > > > > > waiting
> > > > > > > > > > > > > > a lot for IO. No high User/System CPU either?
> > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > >> If you see a lot of evicted block then each
> RPC
> > > has a
> > > > > high
> > > > > > > > > > chance of
> > > > > > > > > > > > > > requiring to bring an entire 64k block in. You'll
> > > see bad
> > > > > > > > > > performance
> > > > > > > > > > > > > with
> > > > > > > > > > > > > > this.
> > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > >> We might need to trace this specific scenario.
> > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > >> -- Lars
> > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > >> ________________________________
> > > > > > > > > > > > > > >> From: Khaled Elmeleegy <kd...@hotmail.com>
> > > > > > > > > > > > > > >> To: "user@hbase.apache.org" <
> > > user@hbase.apache.org>
> > > > > > > > > > > > > > >> Sent: Thursday, October 2, 2014 10:46 AM
> > > > > > > > > > > > > > >> Subject: RE: HBase read performance
> > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > >> I've set the heap size to 6GB and I do have gc
> > > > > logging. No
> > > > > > > > > long
> > > > > > > > > > pauses
> > > > > > > > > > > > > > there -- occasional 0.1s or 0.2s.
> > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > >> Other than the discrepancy between what's
> > > reported on
> > > > > the
> > > > > > > > > > client and
> > > > > > > > > > > > > > what's reported at the RS, there is also the
> issue
> > > of not
> > > > > > > getting
> > > > > > > > > > proper
> > > > > > > > > > > > > > concurrency. So, even if a reverse get takes
> 100ms
> > > or so
> > > > > > > (this
> > > > > > > > > has
> > > > > > > > > > to be
> > > > > > > > > > > > > > mostly blocking on various things as no physical
> > > > > resource is
> > > > > > > > > > contended),
> > > > > > > > > > > > > > then the other gets/scans should be able to
> proceed
> > > in
> > > > > > > parallel,
> > > > > > > > > > so a
> > > > > > > > > > > > > > thousand concurrent gets/scans should finish in
> few
> > > > > hundreds
> > > > > > > of
> > > > > > > > > ms
> > > > > > > > > > not
> > > > > > > > > > > > > many
> > > > > > > > > > > > > > seconds. That's why I thought I'd increase the
> > > handlers
> > > > > > > count to
> > > > > > > > > > try to
> > > > > > > > > > > > > get
> > > > > > > > > > > > > > more concurrency, but it didn't help. So, there
> must
> > > be
> > > > > > > something
> > > > > > > > > > else.
> > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > >> Khaled
> > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > >> ----------------------------------------
> > > > > > > > > > > > > > >>> From: ndimiduk@gmail.com
> > > > > > > > > > > > > > >>> Date: Thu, 2 Oct 2014 10:36:39 -0700
> > > > > > > > > > > > > > >>> Subject: Re: HBase read performance
> > > > > > > > > > > > > > >>> To: user@hbase.apache.org
> > > > > > > > > > > > > > >>>
> > > > > > > > > > > > > > >>> Do check again on the heap size of the region
> > > > > servers.
> > > > > > > The
> > > > > > > > > > default
> > > > > > > > > > > > > > >>> unconfigured size is 1G; too small for much
> of
> > > > > anything.
> > > > > > > > > Check
> > > > > > > > > > your
> > > > > > > > > > > > > RS
> > > > > > > > > > > > > > logs
> > > > > > > > > > > > > > >>> -- look for lines produced by the
> JVMPauseMonitor
> > > > > thread.
> > > > > > > > > They
> > > > > > > > > > > > > usually
> > > > > > > > > > > > > > >>> correlate with long GC pauses or other
> > > process-freeze
> > > > > > > events.
> > > > > > > > > > > > > > >>>
> > > > > > > > > > > > > > >>> Get is implemented as a Scan of a single
> row, so
> > > a
> > > > > > > reverse
> > > > > > > > > > scan of a
> > > > > > > > > > > > > > single
> > > > > > > > > > > > > > >>> row should be functionally equivalent.
> > > > > > > > > > > > > > >>>
> > > > > > > > > > > > > > >>> In practice, I have seen discrepancy between
> the
> > > > > > > latencies
> > > > > > > > > > reported
> > > > > > > > > > > > > by
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > >>> RS and the latencies experienced by the
> client.
> > > I've
> > > > > not
> > > > > > > > > > investigated
> > > > > > > > > > > > > > this
> > > > > > > > > > > > > > >>> area thoroughly.
> > > > > > > > > > > > > > >>>
> > > > > > > > > > > > > > >>> On Thu, Oct 2, 2014 at 10:05 AM, Khaled
> > > Elmeleegy <
> > > > > > > > > > kdiaa@hotmail.com
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > >>>
> > > > > > > > > > > > > > >>>> Thanks Lars for your quick reply.
> > > > > > > > > > > > > > >>>>
> > > > > > > > > > > > > > >>>> Yes performance is similar with less
> handlers (I
> > > > > tried
> > > > > > > with
> > > > > > > > > > 100
> > > > > > > > > > > > > > first).
> > > > > > > > > > > > > > >>>>
> > > > > > > > > > > > > > >>>> The payload is not big ~1KB or so. The
> working
> > > set
> > > > > > > doesn't
> > > > > > > > > > seem to
> > > > > > > > > > > > > > fit in
> > > > > > > > > > > > > > >>>> memory as there are many cache misses.
> However,
> > > > > disk is
> > > > > > > far
> > > > > > > > > > from
> > > > > > > > > > > > > > being a
> > > > > > > > > > > > > > >>>> bottleneck. I checked using iostat. I also
> > > verified
> > > > > that
> > > > > > > > > > neither the
> > > > > > > > > > > > > > >>>> network nor the CPU of the region server or
> the
> > > > > client
> > > > > > > are a
> > > > > > > > > > > > > > bottleneck.
> > > > > > > > > > > > > > >>>> This leads me to believe that likely this
> is a
> > > > > software
> > > > > > > > > > bottleneck,
> > > > > > > > > > > > > > >>>> possibly due to a misconfiguration on my
> side. I
> > > > > just
> > > > > > > don't
> > > > > > > > > > know how
> > > > > > > > > > > > > > to
> > > > > > > > > > > > > > >>>> debug it. A clear disconnect I see is the
> > > individual
> > > > > > > request
> > > > > > > > > > latency
> > > > > > > > > > > > > > as
> > > > > > > > > > > > > > >>>> reported by metrics on the region server
> (IPC
> > > > > > > > > processCallTime
> > > > > > > > > > vs
> > > > > > > > > > > > > > scanNext)
> > > > > > > > > > > > > > >>>> vs what's measured on the client. Does this
> > > sound
> > > > > > > right? Any
> > > > > > > > > > ideas
> > > > > > > > > > > > > on
> > > > > > > > > > > > > > how
> > > > > > > > > > > > > > >>>> to better debug it?
> > > > > > > > > > > > > > >>>>
> > > > > > > > > > > > > > >>>> About this trick with the timestamps to be
> able
> > > to
> > > > > do a
> > > > > > > > > > forward
> > > > > > > > > > > > > scan,
> > > > > > > > > > > > > > >>>> thanks for pointing it out. Actually, I am
> > > aware of
> > > > > it.
> > > > > > > The
> > > > > > > > > > problem
> > > > > > > > > > > > > I
> > > > > > > > > > > > > > have
> > > > > > > > > > > > > > >>>> is, sometimes I want to get the key after a
> > > > > particular
> > > > > > > > > > timestamp and
> > > > > > > > > > > > > > >>>> sometimes I want to get the key before, so
> just
> > > > > relying
> > > > > > > on
> > > > > > > > > > the key
> > > > > > > > > > > > > > order
> > > > > > > > > > > > > > >>>> doesn't work. Ideally, I want a reverse
> get(). I
> > > > > thought
> > > > > > > > > > reverse
> > > > > > > > > > > > > scan
> > > > > > > > > > > > > > can
> > > > > > > > > > > > > > >>>> do the trick though.
> > > > > > > > > > > > > > >>>>
> > > > > > > > > > > > > > >>>> Khaled
> > > > > > > > > > > > > > >>>>
> > > > > > > > > > > > > > >>>> ----------------------------------------
> > > > > > > > > > > > > > >>>>> Date: Thu, 2 Oct 2014 09:40:37 -0700
> > > > > > > > > > > > > > >>>>> From: larsh@apache.org
> > > > > > > > > > > > > > >>>>> Subject: Re: HBase read performance
> > > > > > > > > > > > > > >>>>> To: user@hbase.apache.org
> > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > >>>>> Hi Khaled,
> > > > > > > > > > > > > > >>>>> is it the same with fewer threads? 1500
> handler
> > > > > threads
> > > > > > > > > > seems to
> > > > > > > > > > > > > be a
> > > > > > > > > > > > > > >>>> lot. Typically a good number of threads
> depends
> > > on
> > > > > the
> > > > > > > > > > hardware
> > > > > > > > > > > > > > (number of
> > > > > > > > > > > > > > >>>> cores, number of spindles, etc). I cannot
> think
> > > of
> > > > > any
> > > > > > > type
> > > > > > > > > of
> > > > > > > > > > > > > > scenario
> > > > > > > > > > > > > > >>>> where more than 100 would give any
> improvement.
> > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > >>>>> How large is the payload per KV retrieved
> that
> > > > > way? If
> > > > > > > > > large
> > > > > > > > > > (as
> > > > > > > > > > > > > in a
> > > > > > > > > > > > > > >>>> few 100k) you definitely want to lower the
> > > number
> > > > > of the
> > > > > > > > > > handler
> > > > > > > > > > > > > > threads.
> > > > > > > > > > > > > > >>>>> How much heap do you give the region
> server?
> > > Does
> > > > > the
> > > > > > > > > > working set
> > > > > > > > > > > > > fit
> > > > > > > > > > > > > > >>>> into the cache? (i.e. in the metrics, do you
> > > see the
> > > > > > > > > eviction
> > > > > > > > > > count
> > > > > > > > > > > > > > going
> > > > > > > > > > > > > > >>>> up, if so it does not fit into the cache).
> > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > >>>>> If the working set does not fit into the
> cache
> > > > > > > (eviction
> > > > > > > > > > count goes
> > > > > > > > > > > > > > up)
> > > > > > > > > > > > > > >>>> then HBase will need to bring a new block in
> > > from
> > > > > disk
> > > > > > > on
> > > > > > > > > > each Get
> > > > > > > > > > > > > > >>>> (assuming the Gets are more or less random
> as
> > > far
> > > > > as the
> > > > > > > > > > server is
> > > > > > > > > > > > > > >>>> concerned).
> > > > > > > > > > > > > > >>>>> In case you'll benefit from reducing the
> HFile
> > > > > block
> > > > > > > size
> > > > > > > > > > (from 64k
> > > > > > > > > > > > > > to
> > > > > > > > > > > > > > >>>> 8k or even 4k).
> > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > >>>>> Lastly I don't think we tested the
> performance
> > > of
> > > > > using
> > > > > > > > > > reverse
> > > > > > > > > > > > > scan
> > > > > > > > > > > > > > >>>> this way, there is probably room to optimize
> > > this.
> > > > > > > > > > > > > > >>>>> Can you restructure your keys to allow
> forwards
> > > > > > > scanning?
> > > > > > > > > For
> > > > > > > > > > > > > example
> > > > > > > > > > > > > > >>>> you could store the time as MAX_LONG-time.
> Or
> > > you
> > > > > could
> > > > > > > > > > invert all
> > > > > > > > > > > > > > the bits
> > > > > > > > > > > > > > >>>> of the time portion of the key, so that it
> sort
> > > the
> > > > > > > other
> > > > > > > > > > way. Then
> > > > > > > > > > > > > > you
> > > > > > > > > > > > > > >>>> could do a forward scan.
> > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > >>>>> Let us know how it goes.
> > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > >>>>> -- Lars
> > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > >>>>> ----- Original Message -----
> > > > > > > > > > > > > > >>>>> From: Khaled Elmeleegy <kd...@hotmail.com>
> > > > > > > > > > > > > > >>>>> To: "user@hbase.apache.org" <
> > > user@hbase.apache.org
> > > > > >
> > > > > > > > > > > > > > >>>>> Cc:
> > > > > > > > > > > > > > >>>>> Sent: Thursday, October 2, 2014 12:12 AM
> > > > > > > > > > > > > > >>>>> Subject: HBase read performance
> > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > >>>>> Hi,
> > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > >>>>> I am trying to do a scatter/gather on hbase
> > > > > (0.98.6.1),
> > > > > > > > > > where I
> > > > > > > > > > > > > have
> > > > > > > > > > > > > > a
> > > > > > > > > > > > > > >>>> client reading ~1000 keys from an HBase
> table.
> > > These
> > > > > > > keys
> > > > > > > > > > happen to
> > > > > > > > > > > > > > fall on
> > > > > > > > > > > > > > >>>> the same region server. For my reads I use
> > > reverse
> > > > > scan
> > > > > > > to
> > > > > > > > > > read each
> > > > > > > > > > > > > > key as
> > > > > > > > > > > > > > >>>> I want the key prior to a specific time
> stamp
> > > (time
> > > > > > > stamps
> > > > > > > > > are
> > > > > > > > > > > > > stored
> > > > > > > > > > > > > > in
> > > > > > > > > > > > > > >>>> reverse order). I don't believe gets can
> > > accomplish
> > > > > > > that,
> > > > > > > > > > right? so
> > > > > > > > > > > > > I
> > > > > > > > > > > > > > use
> > > > > > > > > > > > > > >>>> scan, with caching set to 1.
> > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > >>>>> I use 2000 reader threads in the client
> and on
> > > > > HBase,
> > > > > > > I've
> > > > > > > > > > set
> > > > > > > > > > > > > > >>>> hbase.regionserver.handler.count to 1500.
> With
> > > this
> > > > > > > setup,
> > > > > > > > > my
> > > > > > > > > > > > > scatter
> > > > > > > > > > > > > > >>>> gather is very slow and can take up to 10s
> in
> > > total.
> > > > > > > Timing
> > > > > > > > > an
> > > > > > > > > > > > > > individual
> > > > > > > > > > > > > > >>>> getScanner(..) call on the client side, it
> can
> > > > > easily
> > > > > > > take
> > > > > > > > > few
> > > > > > > > > > > > > > hundreds of
> > > > > > > > > > > > > > >>>> ms. I also got the following metrics from
> the
> > > region
> > > > > > > server
> > > > > > > > > in
> > > > > > > > > > > > > > question:
> > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > >>>>> "queueCallTime_mean" : 2.190855525775637,
> > > > > > > > > > > > > > >>>>> "queueCallTime_median" : 0.0,
> > > > > > > > > > > > > > >>>>> "queueCallTime_75th_percentile" : 0.0,
> > > > > > > > > > > > > > >>>>> "queueCallTime_95th_percentile" : 1.0,
> > > > > > > > > > > > > > >>>>> "queueCallTime_99th_percentile" :
> > > > > 556.9799999999818,
> > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > >>>>> "processCallTime_min" : 0,
> > > > > > > > > > > > > > >>>>> "processCallTime_max" : 12755,
> > > > > > > > > > > > > > >>>>> "processCallTime_mean" :
> 105.64873440912682,
> > > > > > > > > > > > > > >>>>> "processCallTime_median" : 0.0,
> > > > > > > > > > > > > > >>>>> "processCallTime_75th_percentile" : 2.0,
> > > > > > > > > > > > > > >>>>> "processCallTime_95th_percentile" :
> 7917.95,
> > > > > > > > > > > > > > >>>>> "processCallTime_99th_percentile" :
> 8876.89,
> > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > >>>>
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > >
> > >
> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_min"
> > > > > > > > > > > > > > >>>> : 89,
> > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > >>>>
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > >
> > >
> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_max"
> > > > > > > > > > > > > > >>>> : 11300,
> > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > >>>>
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > >
> > >
> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_mean"
> > > > > > > > > > > > > > >>>> : 654.4949739797315,
> > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > >>>>
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > >
> > >
> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_median"
> > > > > > > > > > > > > > >>>> : 101.0,
> > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > >>>>
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > >
> > >
> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_75th_percentile"
> > > > > > > > > > > > > > >>>> : 101.0,
> > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > >>>>
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > >
> > >
> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_95th_percentile"
> > > > > > > > > > > > > > >>>> : 101.0,
> > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > >>>>
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > >
> > >
> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_99th_percentile"
> > > > > > > > > > > > > > >>>> : 113.0,
> > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > >>>>> Where "delta" is the name of the table I am
> > > > > querying.
> > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > >>>>> In addition to all this, i monitored the
> > > hardware
> > > > > > > resources
> > > > > > > > > > (CPU,
> > > > > > > > > > > > > > disk,
> > > > > > > > > > > > > > >>>> and network) of both the client and the
> region
> > > > > server
> > > > > > > and
> > > > > > > > > > nothing
> > > > > > > > > > > > > > seems
> > > > > > > > > > > > > > >>>> anywhere near saturation. So I am puzzled by
> > > what's
> > > > > > > going on
> > > > > > > > > > and
> > > > > > > > > > > > > > where this
> > > > > > > > > > > > > > >>>> time is going.
> > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > >>>>> Few things to note based on the above
> > > measurements:
> > > > > > > both
> > > > > > > > > > medians of
> > > > > > > > > > > > > > IPC
> > > > > > > > > > > > > > >>>> processCallTime and queueCallTime are
> basically
> > > zero
> > > > > > > (ms I
> > > > > > > > > > presume,
> > > > > > > > > > > > > > >>>> right?). However, scanNext_median is 101 (ms
> > > too,
> > > > > > > right?). I
> > > > > > > > > > am not
> > > > > > > > > > > > > > sure
> > > > > > > > > > > > > > >>>> how this adds up. Also, even though the 101
> > > figure
> > > > > seems
> > > > > > > > > > > > > outrageously
> > > > > > > > > > > > > > high
> > > > > > > > > > > > > > >>>> and I don't know why, still all these scans
> > > should
> > > > > be
> > > > > > > > > > happening in
> > > > > > > > > > > > > > >>>> parallel, so the overall call should finish
> > > fast,
> > > > > given
> > > > > > > that
> > > > > > > > > > no
> > > > > > > > > > > > > > hardware
> > > > > > > > > > > > > > >>>> resource is contended, right? but this is
> not
> > > what's
> > > > > > > > > > happening, so I
> > > > > > > > > > > > > > have
> > > > > > > > > > > > > > >>>> to be missing something(s).
> > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > >>>>> So, any help is appreciated there.
> > > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > > >>>>> Thanks,
> > > > > > > > > > > > > > >>>>> Khaled
> > > > > > > > > > > > > > >>>>
> > > > > > > > > > > > > > >>>>
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Best regards,
> > > > > >
> > > > > >    - Andy
> > > > > >
> > > > > > Problems worthy of attack prove their worth by hitting back. -
> Piet
> > > Hein
> > > > > > (via Tom White)
> > > > >
> > > > >
> > >
> > >
>
>

Re: HBase read performance

Posted by Ted Yu <yu...@gmail.com>.

Thanks for the confirmation. 

Looks like the combination of bucket cache and cache on write has a bug. 

On Oct 12, 2014, at 8:58 PM, Khaled Elmeleegy <kd...@hotmail.com> wrote:

> It goes away, but i think that's because i am basically churning the cache far less. My workload is mostly writes.
> 
>> Date: Sun, 12 Oct 2014 19:43:16 -0700
>> Subject: Re: HBase read performance
>> From: yuzhihong@gmail.com
>> To: user@hbase.apache.org
>> 
>> Hi,
>> Can you turn off cacheonwrite and keep bucket cache to see if the problem
>> goes away ?
>> 
>> Cheers
>> 
>> On Fri, Oct 10, 2014 at 10:59 AM, Khaled Elmeleegy <kd...@hotmail.com>
>> wrote:
>> 
>>> Yes, I can reproduce it with some work.
>>> The workload is basically as follows:
>>> There are writers streaming writes to a table. Then, there is a reader
>>> (invoked via a web interface). The reader does a 1000 parallel reverse
>>> scans, all end up hitting the same region in my case. The scans are
>>> effectively "gets" as I just need to get one record off of each of them. I
>>> just need to do a "reverse" get, which is not supported (would be great to
>>> have :)), so I do it via reverse scan. After few tries, the reader
>>> consistently hits this bug.
>>> 
>>> This happens with these config changes:
>>> hbase-env:HBASE_REGIONSERVER_OPTS=-Xmx6G -XX:MaxDirectMemorySize=5G
>>> -XX:CMSInitiatingOccupancyFraction=88 -XX:+AggressiveOpts -verbose:gc
>>> -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Xlog
>>> gc:/tmp/hbase-regionserver-gc.loghbase-site:
>>> hbase.bucketcache.ioengine=offheap
>>> hbase.bucketcache.size=4196
>>> hbase.rs.cacheblocksonwrite=true
>>> hfile.block.index.cacheonwrite=true
>>> hfile.block.bloom.cacheonwrite=true
>>> 
>>> Interestingly, without these config changes, I can't reproduce the problem.
>>> Khaled
>>> 
>>> 
>>>> Date: Fri, 10 Oct 2014 10:05:14 -0700
>>>> Subject: Re: HBase read performance
>>>> From: stack@duboce.net
>>>> To: user@hbase.apache.org
>>>> 
>>>> It looks like we are messing up our positioning math:
>>>> 
>>>> 233 <
>>> http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/nio/Buffer.java#233
>>>> 
>>>> 
>>>> <
>>> http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/nio/Buffer.java#
>>>> 
>>>> 
>>>>    public final Buffer
>>>> <
>>> http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/nio/Buffer.java#Buffer
>>>> 
>>>> <
>>> http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/nio/Buffer.java#
>>>> position(int
>>>> newPosition) {
>>>> 
>>>> 234 <
>>> http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/nio/Buffer.java#234
>>>> 
>>>> 
>>>> <
>>> http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/nio/Buffer.java#
>>>> 
>>>> 
>>>>        if ((newPosition > limit
>>>> <
>>> http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/nio/Buffer.java#Buffer.0limit
>>>> )
>>>> || (newPosition < 0))
>>>> 
>>>> 235 <
>>> http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/nio/Buffer.java#235
>>>> 
>>>> 
>>>> <
>>> http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/nio/Buffer.java#
>>>> 
>>>> 
>>>>            throw new IllegalArgumentException
>>>> <
>>> http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/lang/IllegalArgumentException.java#IllegalArgumentException
>>>> ();
>>>> 
>>>> 
>>>> Is it easy to reproduce Khaled?  Always same region/store or spread
>>>> across all reads?
>>>> 
>>>> 
>>>> St.Ack
>>>> 
>>>> 
>>>> 
>>>> On Fri, Oct 10, 2014 at 8:31 AM, Khaled Elmeleegy <kd...@hotmail.com>
>>> wrote:
>>>> 
>>>>> Andrew thanks. I think this indeed was the problem.
>>>>> To get over it, I increased the amount of memory given the region
>>> server
>>>>> to avoid IO on reads. I used the configs below. In my experiments, I
>>> have
>>>>> writers streaming their output to HBase. The reader powers a web page
>>> and
>>>>> does this scatter/gather, where it reads 1000 keys written last and
>>> passes
>>>>> them the the front end. With this workload, I get the exception below
>>> at
>>>>> the region server. Again, I am using HBAse (0.98.6.1). Any help is
>>>>> appreciated.
>>>>> hbase-env:HBASE_REGIONSERVER_OPTS=-Xmx6G -XX:MaxDirectMemorySize=5G
>>>>> -XX:CMSInitiatingOccupancyFraction=88 -XX:+AggressiveOpts -verbose:gc
>>>>> -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Xlog
>>>>> gc:/tmp/hbase-regionserver-gc.log
>>>>> hbase-site:hbase.bucketcache.ioengine=offheap
>>>>> 
>>>>>                                   hfile.block.cache.size=0.4
>>>>> 
>>>>> 
>>>>> hbase.bucketcache.size=4196
>>>>> 
>>>>>                        hbase.rs.cacheblocksonwrite=true
>>>>> 
>>>>> 
>>>>> hfile.block.index.cacheonwrite=true
>>>>> 
>>>>>                        hfile.block.bloom.cacheonwrite=true
>>>>> 
>>>>> 2014-10-10 15:06:44,173 ERROR
>>>>> [B.DefaultRpcServer.handler=62,queue=2,port=60020] ipc.RpcServer:
>>>>> Unexpected throwable object
>>>>> java.lang.IllegalArgumentException
>>>>>        at java.nio.Buffer.position(Buffer.java:236)
>>>>>        at
>>> org.apache.hadoop.hbase.util.ByteBufferUtils.skip(ByteBufferUtils.java:434)
>>>>>        at
>>> org.apache.hadoop.hbase.io.hfile.HFileReaderV2$ScannerV2.readKeyValueLen(HFileReaderV2.java:849)
>>>>>        at
>>> org.apache.hadoop.hbase.io.hfile.HFileReaderV2$ScannerV2.next(HFileReaderV2.java:760)
>>>>>        at
>>> org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekAtOrAfter(StoreFileScanner.java:248)
>>>>>        at
>>> org.apache.hadoop.hbase.regionserver.StoreFileScanner.seek(StoreFileScanner.java:152)
>>>>>        at
>>> org.apache.hadoop.hbase.regionserver.StoreScanner.seekScanners(StoreScanner.java:317)
>>>>>        at
>>> org.apache.hadoop.hbase.regionserver.StoreScanner.<init>(StoreScanner.java:176)
>>>>>        at
>>> org.apache.hadoop.hbase.regionserver.HStore.getScanner(HStore.java:1780)
>>>>>        at
>>> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.<init>(HRegion.java:3758)
>>>>>        at
>>> org.apache.hadoop.hbase.regionserver.HRegion.instantiateRegionScanner(HRegion.java:1950)
>>>>>        at
>>> org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:1936)
>>>>>        at
>>> org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:1913)
>>>>>        at
>>> org.apache.hadoop.hbase.regionserver.HRegionServer.scan(HRegionServer.java:3157)
>>>>>        at
>>> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:29587)
>>>>>        at
>>> org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2027)
>>>>>        at
>>> org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108)
>>>>>        at
>>> org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:114)
>>>>>        at
>>>>> org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:94)
>>>>>        at java.lang.Thread.run(Thread.java:744)
>>>>>> From: apurtell@apache.org
>>>>>> Date: Tue, 7 Oct 2014 17:09:35 -0700
>>>>>> Subject: Re: HBase read performance
>>>>>> To: user@hbase.apache.org
>>>>>> 
>>>>>>> The cluster has 2 m1.large nodes.
>>>>>> 
>>>>>> That's the problem right there.
>>>>>> 
>>>>>> You need to look at c3.4xlarge or i2 instances as a minimum
>>> requirement.
>>>>> M1
>>>>>> and even M3 instance types have ridiculously poor IO.
>>>>>> 
>>>>>> 
>>>>>> On Tue, Oct 7, 2014 at 3:01 PM, Khaled Elmeleegy <kd...@hotmail.com>
>>>>> wrote:
>>>>>> 
>>>>>>> Thanks Nicolas, Qiang.
>>>>>>> 
>>>>>>> I was able to write a simple program that reproduces the problem
>>> on a
>>>>> tiny
>>>>>>> HBase cluster on ec2. The cluster has 2 m1.large nodes. One node
>>> runs
>>>>> the
>>>>>>> master, name node and zookeeper. The other node runs a data node
>>> and a
>>>>>>> region server, with heap size configured to be 6GB. There, the 1000
>>>>>>> parallel reverse gets (reverse scans) take 7-8 seconds. The data
>>> set is
>>>>>>> tiny (10M records, each having a small number of bytes). As I said
>>>>> before,
>>>>>>> all hardware resources are very idle there.
>>>>>>> 
>>>>>>> Interestingly, running the same workload on my macbook, the 1000
>>>>> parallel
>>>>>>> gets take ~200ms on a pseudo-distributed installation.
>>>>>>> 
>>>>>>> Any help to resolve this mystery is highly appreciated.
>>>>>>> 
>>>>>>> P.S. please find my test program attached.
>>>>>>> 
>>>>>>> Best,
>>>>>>> Khaled
>>>>>>> 
>>>>>>>> From: nkeywal@gmail.com
>>>>>>>> Date: Mon, 6 Oct 2014 09:40:48 +0200
>>>>>>>> Subject: Re: HBase read performance
>>>>>>>> To: user@hbase.apache.org
>>>>>>>> 
>>>>>>>> Hi,
>>>>>>>> 
>>>>>>>> I haven't seen it mentioned, but if I understand correctly each
>>> scan
>>>>>>>> returns a single row? If so you should use Scan#setSmall to save
>>>>> some rpc
>>>>>>>> calls.
>>>>>>>> 
>>>>>>>> Cheers,
>>>>>>>> 
>>>>>>>> Nicolas
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Sun, Oct 5, 2014 at 11:28 AM, Qiang Tian <ti...@gmail.com>
>>>>> wrote:
>>>>>>>> 
>>>>>>>>> when using separate HConnection instance, both its
>>>>>>>>> RpcClient instance(maintain connection to a regionserver) and
>>>>> Registry
>>>>>>>>> instance(maintain connection to zookeeper) will be separate..
>>>>>>>>> 
>>>>>>>>> see
>>> http://shammijayasinghe.blogspot.com/2012/02/zookeeper-increase-maximum-number-of.html
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Sun, Oct 5, 2014 at 2:24 PM, Khaled Elmeleegy <
>>>>> kdiaa@hotmail.com>
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> I tried creating my own HConnections pool to use for my HBase
>>>>> calls,
>>>>>>> so
>>>>>>>>>> that not all the (2K) threads share the same HConnection.
>>>>> However, I
>>>>>>>>> could
>>>>>>>>>> only have 10 HConnections. Beyond that I get ZK exceptions,
>>>>> please
>>>>>>> find
>>>>>>>>> it
>>>>>>>>>> below. Also, with 10 HConnections, I don't see noticeable
>>>>>>> improvement in
>>>>>>>>>> performance so far.
>>>>>>>>>> 2014-10-05 06:11:26,490 WARN [main]
>>>>> zookeeper.RecoverableZooKeeper
>>>>>>>>>> (RecoverableZooKeeper.java:retryOrThrow(253)) - Possibly
>>>>> transient
>>>>>>>>>> ZooKeeper, quorum=54.68.206.252:2181,
>>> exception=org.apache.zookeeper.KeeperException$ConnectionLossException:
>>>>>>>>>> KeeperErrorCode = ConnectionLoss for /hbase/hbaseid2014-10-05
>>>>>>>>> 06:11:26,490
>>>>>>>>>> INFO [main] util.RetryCounter
>>>>>>>>> (RetryCounter.java:sleepUntilNextRetry(155))
>>>>>>>>>> - Sleeping 1000ms before retry #0...2014-10-05 06:11:27,845
>>> WARN
>>>>>>> [main]
>>>>>>>>>> zookeeper.RecoverableZooKeeper
>>>>>>>>>> (RecoverableZooKeeper.java:retryOrThrow(253)) - Possibly
>>>>> transient
>>>>>>>>>> ZooKeeper, quorum=54.68.206.252:2181,
>>> exception=org.apache.zookeeper.KeeperException$ConnectionLossException:
>>>>>>>>>> KeeperErrorCode = ConnectionLoss for /hbase/hbaseid2014-10-05
>>>>>>>>> 06:11:27,849
>>>>>>>>>> INFO [main] util.RetryCounter
>>>>>>>>> (RetryCounter.java:sleepUntilNextRetry(155))
>>>>>>>>>> - Sleeping 2000ms before retry #1...2014-10-05 06:11:30,405
>>> WARN
>>>>>>> [main]
>>>>>>>>>> zookeeper.RecoverableZooKeeper
>>>>>>>>>> (RecoverableZooKeeper.java:retryOrThrow(253)) - Possibly
>>>>> transient
>>>>>>>>>> ZooKeeper, quorum=54.68.206.252:2181,
>>> exception=org.apache.zookeeper.KeeperException$ConnectionLossException:
>>>>>>>>>> KeeperErrorCode = ConnectionLoss for /hbase/hbaseid2014-10-05
>>>>>>>>> 06:11:30,405
>>>>>>>>>> INFO [main] util.RetryCounter
>>>>>>>>> (RetryCounter.java:sleepUntilNextRetry(155))
>>>>>>>>>> - Sleeping 4000ms before retry #2...2014-10-05 06:11:35,278
>>> WARN
>>>>>>> [main]
>>>>>>>>>> zookeeper.RecoverableZooKeeper
>>>>>>>>>> (RecoverableZooKeeper.java:retryOrThrow(253)) - Possibly
>>>>> transient
>>>>>>>>>> ZooKeeper, quorum=54.68.206.252:2181,
>>> exception=org.apache.zookeeper.KeeperException$ConnectionLossException:
>>>>>>>>>> KeeperErrorCode = ConnectionLoss for /hbase/hbaseid2014-10-05
>>>>>>>>> 06:11:35,279
>>>>>>>>>> INFO [main] util.RetryCounter
>>>>>>>>> (RetryCounter.java:sleepUntilNextRetry(155))
>>>>>>>>>> - Sleeping 8000ms before retry #3...2014-10-05 06:11:44,393
>>> WARN
>>>>>>> [main]
>>>>>>>>>> zookeeper.RecoverableZooKeeper
>>>>>>>>>> (RecoverableZooKeeper.java:retryOrThrow(253)) - Possibly
>>>>> transient
>>>>>>>>>> ZooKeeper, quorum=54.68.206.252:2181,
>>> exception=org.apache.zookeeper.KeeperException$ConnectionLossException:
>>>>>>>>>> KeeperErrorCode = ConnectionLoss for /hbase/hbaseid2014-10-05
>>>>>>>>> 06:11:44,393
>>>>>>>>>> ERROR [main] zookeeper.RecoverableZooKeeper
>>>>>>>>>> (RecoverableZooKeeper.java:retryOrThrow(255)) - ZooKeeper
>>> exists
>>>>>>> failed
>>>>>>>>>> after 4 attempts2014-10-05 06:11:44,394 WARN [main]
>>>>> zookeeper.ZKUtil
>>>>>>>>>> (ZKUtil.java:checkExists(482)) - hconnection-0x4e174f3b,
>>> quorum=
>>>>>>>>>> 54.68.206.252:2181, baseZNode=/hbase Unable to set watcher
>>> on
>>>>> znode
>>> (/hbase/hbaseid)org.apache.zookeeper.KeeperException$ConnectionLossException:
>>>>>>>>>> KeeperErrorCode = ConnectionLoss for /hbase/hbaseid at
>>>>> org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
>>>>>>> at
>>>>> org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>>>>>>> at
>>>>>>>>>> org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1041) at
>>> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.exists(RecoverableZooKeeper.java:199)
>>>>>>>>>> at
>>> org.apache.hadoop.hbase.zookeeper.ZKUtil.checkExists(ZKUtil.java:479)
>>>>>>>>>> at
>>> org.apache.hadoop.hbase.zookeeper.ZKClusterId.readClusterIdZNode(ZKClusterId.java:65)
>>>>>>>>>> at
>>> org.apache.hadoop.hbase.client.ZooKeeperRegistry.getClusterId(ZooKeeperRegistry.java:83)
>>>>>>>>>> at
>>> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.retrieveClusterId(HConnectionManager.java:857)
>>>>>>>>>> at
>>> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.<init>(HConnectionManager.java:662)
>>>>>>>>>> at
>>> sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>>>>>>>>>> Method) at
>>> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>>>>>>>>>> at
>>> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>>>>>>>>>> at
>>>>> java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>>>>>>>>> at
>>> org.apache.hadoop.hbase.client.HConnectionManager.createConnection(HConnectionManager.java:414)
>>>>>>>>>> at
>>> org.apache.hadoop.hbase.client.HConnectionManager.createConnection(HConnectionManager.java:335)...
>>>>>>>>>>> From: kdiaa@hotmail.com
>>>>>>>>>>> To: user@hbase.apache.org
>>>>>>>>>>> Subject: RE: HBase read performance
>>>>>>>>>>> Date: Fri, 3 Oct 2014 12:37:39 -0700
>>>>>>>>>>> 
>>>>>>>>>>> Lars, Ted, and Qiang,
>>>>>>>>>>> Thanks for all the input.
>>>>>>>>>>> Qiang: yes all the threads are in the same client process
>>>>> sharing
>>>>>>> the
>>>>>>>>>> same connection. And since I don't see hardware contention,
>>> may
>>>>> be
>>>>>>> there
>>>>>>>>> is
>>>>>>>>>> contention over this code path. I'll try using many
>>> connections
>>>>> and
>>>>>>> see
>>>>>>>>> if
>>>>>>>>>> it alleviates the problems and I'll report back.
>>>>>>>>>>> Thanks again,Khaled
>>>>>>>>>>> 
>>>>>>>>>>>> Date: Fri, 3 Oct 2014 15:18:30 +0800
>>>>>>>>>>>> Subject: Re: HBase read performance
>>>>>>>>>>>> From: tianq01@gmail.com
>>>>>>>>>>>> To: user@hbase.apache.org
>>>>>>>>>>>> 
>>>>>>>>>>>> Regarding to profiling, Andrew introduced
>>>>>>> http://www.brendangregg.com/blog/2014-06-12/java-flame-graphs.html
>>>>>>>>>> months
>>>>>>>>>>>> ago.
>>>>>>>>>>>> 
>>>>>>>>>>>> processCallTime comes from RpcServer#call, so it looks
>>> good?
>>>>>>>>>>>> 
>>>>>>>>>>>> I have a suspect:
>>>>>>> https://issues.apache.org/jira/browse/HBASE-11306
>>>>>>>>>>>> 
>>>>>>>>>>>> how many processes do you have for your 2000 threads?
>>>>>>>>>>>> if olny 1 process, those threads will share just 1
>>>>> connection to
>>>>>>> that
>>>>>>>>>>>> regionserver, there might be big contention on the RPC
>>> code
>>>>> path.
>>>>>>>>>> ---for
>>>>>>>>>>>> such case, could you try using different connections?
>>> https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HConnectionManager.html
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> On Fri, Oct 3, 2014 at 9:55 AM, Ted Yu <
>>> yuzhihong@gmail.com>
>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> Khaled:
>>>>>>>>>>>>> Do you have profiler such as jprofiler ?
>>>>>>>>>>>>> Profiling would give us more hint.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Otherwise capturing stack trace during the period of
>>>>> reverse
>>>>>>> scan
>>>>>>>>>> would
>>>>>>>>>>>>> help.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Cheers
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Thu, Oct 2, 2014 at 4:52 PM, lars hofhansl <
>>>>>>> larsh@apache.org>
>>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> You might have the data in the OS buffer cache,
>>> without
>>>>> short
>>>>>>>>>> circuit
>>>>>>>>>>>>>> reading the region server has to request the block
>>> from
>>>>> the
>>>>>>> data
>>>>>>>>>> node
>>>>>>>>>>>>>> process, which then reads it from the block cache.
>>>>>>>>>>>>>> That is a few context switches per RPC that do not
>>> show
>>>>> up
>>>>>>> in CPU
>>>>>>>>>>>>> metrics.
>>>>>>>>>>>>>> In that you also would not see disk IO.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> If - as you say - you see a lot of evicted blocks the
>>>>> data
>>>>>>> *has*
>>>>>>>>>> to come
>>>>>>>>>>>>>> from the OS. If you do not see disk IO is *has* to
>>> come
>>>>> from
>>>>>>> the
>>>>>>>>> OS
>>>>>>>>>>>>> cache.
>>>>>>>>>>>>>> I.e. there's more RAM on your boxes, and you should
>>>>> increase
>>>>>>> the
>>>>>>>>>> heap
>>>>>>>>>>>>> block
>>>>>>>>>>>>>> cache.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> You can measure the context switches with vmstat.
>>> Other
>>>>> than
>>>>>>> that
>>>>>>>>>> I have
>>>>>>>>>>>>>> no suggestion until I reproduce the problem.
>>>>>>>>>>>>>> Also check the data locality index of the region
>>> server
>>>>> it
>>>>>>> should
>>>>>>>>>> be
>>>>>>>>>>>>> close
>>>>>>>>>>>>>> to 100%.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> -- Lars
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> ________________________________
>>>>>>>>>>>>>> From: Khaled Elmeleegy <kd...@hotmail.com>
>>>>>>>>>>>>>> To: "user@hbase.apache.org" <us...@hbase.apache.org>
>>>>>>>>>>>>>> Sent: Thursday, October 2, 2014 3:24 PM
>>>>>>>>>>>>>> Subject: RE: HBase read performance
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Lars thanks a lot about all the tips. I'll make sure
>>> I
>>>>> cover
>>>>>>> all
>>>>>>>>>> of them
>>>>>>>>>>>>>> and get back to you. I am not sure they are the
>>>>> bottleneck
>>>>>>> though
>>>>>>>>>> as they
>>>>>>>>>>>>>> all are about optimizing physical resource usage. As
>>> I
>>>>> said,
>>>>>>> I
>>>>>>>>>> don't see
>>>>>>>>>>>>>> any contended physical resources now. I'll also try
>>> to
>>>>>>> reproduce
>>>>>>>>>> this
>>>>>>>>>>>>>> problem in a simpler environment and pass to you the
>>> test
>>>>>>> program
>>>>>>>>>> to play
>>>>>>>>>>>>>> with.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Couple of high level points to make. You are right
>>> that
>>>>> my
>>>>>>> use
>>>>>>>>>> case is
>>>>>>>>>>>>>> kind of a worst case for HBase reads. But, if things
>>> go
>>>>> the
>>>>>>> way
>>>>>>>>> you
>>>>>>>>>>>>>> described them, there should be tons of disk IO and
>>> that
>>>>>>> should
>>>>>>>>> be
>>>>>>>>>>>>> clearly
>>>>>>>>>>>>>> the bottleneck. This is not the case though. That's
>>> for
>>>>> the
>>>>>>>>> simple
>>>>>>>>>> reason
>>>>>>>>>>>>>> that this is done in a test environment (I am still
>>>>>>> prototyping),
>>>>>>>>>> and
>>>>>>>>>>>>> not a
>>>>>>>>>>>>>> lot of data is yet written to HBase. However for the
>>>>> real use
>>>>>>>>>> case, there
>>>>>>>>>>>>>> should writers constantly writing data to HBase and
>>>>> readers
>>>>>>>>>> occasionally
>>>>>>>>>>>>>> doing this scatter/gather. At steady state, things
>>> should
>>>>>>> only
>>>>>>>>> get
>>>>>>>>>> worse
>>>>>>>>>>>>>> and all the issues you mentioned should get far more
>>>>>>> pronounced.
>>>>>>>>>> At this
>>>>>>>>>>>>>> point, one can try to mitigate it using more memory
>>> or
>>>>> so. I
>>>>>>> am
>>>>>>>>>> not there
>>>>>>>>>>>>>> yet as I think I am hitting some software bottleneck,
>>>>> which I
>>>>>>>>>> don't know
>>>>>>>>>>>>>> how to work around.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Khaled
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> ----------------------------------------
>>>>>>>>>>>>>>> Date: Thu, 2 Oct 2014 14:20:47 -0700
>>>>>>>>>>>>>>> From: larsh@apache.org
>>>>>>>>>>>>>>> Subject: Re: HBase read performance
>>>>>>>>>>>>>>> To: user@hbase.apache.org
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> OK... We might need to investigate this.
>>>>>>>>>>>>>>> Any chance that you can provide a minimal test
>>> program
>>>>> and
>>>>>>>>>> instruction
>>>>>>>>>>>>>> about how to set it up.
>>>>>>>>>>>>>>> We can do some profiling then.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> One thing to note is that with scanning HBase
>>> cannot
>>>>> use
>>>>>>> bloom
>>>>>>>>>> filters
>>>>>>>>>>>>>> to rule out HFiles ahead of time, it needs to look
>>> into
>>>>> all
>>>>>>> of
>>>>>>>>>> them.
>>>>>>>>>>>>>>> So you kind of hit on the absolute worst case:
>>>>>>>>>>>>>>> - random reads that do not fit into the block cache
>>>>>>>>>>>>>>> - cannot use bloom filters
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Few more question/comments:
>>>>>>>>>>>>>>> - Do you have short circuit reading enabled? If
>>> not,
>>>>> you
>>>>>>>>> should.
>>>>>>>>>>>>>>> - Is your table major compacted? That will reduce
>>> the
>>>>>>> number of
>>>>>>>>>> files
>>>>>>>>>>>>> to
>>>>>>>>>>>>>> look at.
>>>>>>>>>>>>>>> - Did you disable Nagle's everywhere (enabled
>>>>> tcpnodelay)?
>>>>>>> It
>>>>>>>>>> disabled
>>>>>>>>>>>>>> by default in HBase, but necessarily in your install
>>> of
>>>>> HDFS.
>>>>>>>>>>>>>>> - Which version of HDFS are you using as backing
>>>>>>> filesystem?
>>>>>>>>>>>>>>> - If your disk is idle, it means the data fits into
>>>>> the OS
>>>>>>>>> buffer
>>>>>>>>>>>>> cache.
>>>>>>>>>>>>>> In turn that means that you increase the heap for the
>>>>> region
>>>>>>>>>> servers. You
>>>>>>>>>>>>>> can also use block encoding (FAST_DIFF) to try to
>>> make
>>>>> sure
>>>>>>> the
>>>>>>>>>> entire
>>>>>>>>>>>>>> working set fits into the cache.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> - Also try to reduce the block size - although if
>>> your
>>>>>>> overall
>>>>>>>>>> working
>>>>>>>>>>>>>> set does not fit in the heap it won't help much.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> This is a good section of the book to read through
>>>>>>> generally
>>>>>>>>>> (even
>>>>>>>>>>>>>> though you might know most of this already):
>>> http://hbase.apache.org/book.html#perf.configurations
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> -- Lars
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> ----- Original Message -----
>>>>>>>>>>>>>>> From: Khaled Elmeleegy <kd...@hotmail.com>
>>>>>>>>>>>>>>> To: "user@hbase.apache.org" <user@hbase.apache.org
>>>> 
>>>>>>>>>>>>>>> Cc:
>>>>>>>>>>>>>>> Sent: Thursday, October 2, 2014 11:27 AM
>>>>>>>>>>>>>>> Subject: RE: HBase read performance
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> I do see a very brief spike in CPU (user/system),
>>> but
>>>>> it's
>>>>>>> no
>>>>>>>>>> where
>>>>>>>>>>>>> near
>>>>>>>>>>>>>> 0% idle. It goes from 99% idle down to something
>>> like 40%
>>>>>>> idle
>>>>>>>>> for
>>>>>>>>>> a
>>>>>>>>>>>>> second
>>>>>>>>>>>>>> or so. The thing to note, this is all on a test
>>> cluster,
>>>>> so
>>>>>>> no
>>>>>>>>>> real load.
>>>>>>>>>>>>>> Things are generally idle until i issue 2-3 of these
>>>>>>>>>> multi-scan-requests
>>>>>>>>>>>>> to
>>>>>>>>>>>>>> render a web page. Then, you see the spike in the
>>> cpu and
>>>>>>> some
>>>>>>>>>> activity
>>>>>>>>>>>>> in
>>>>>>>>>>>>>> the network and disk, but nowhere near saturation.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> If there are specific tests you'd like me to do to
>>>>> debug
>>>>>>> this,
>>>>>>>>>> I'd be
>>>>>>>>>>>>>> more than happy to do it.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Khaled
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> ----------------------------------------
>>>>>>>>>>>>>>>> Date: Thu, 2 Oct 2014 11:15:59 -0700
>>>>>>>>>>>>>>>> From: larsh@apache.org
>>>>>>>>>>>>>>>> Subject: Re: HBase read performance
>>>>>>>>>>>>>>>> To: user@hbase.apache.org
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> I still think you're waiting on disk. No IOWAIT?
>>> So
>>>>> CPU
>>>>>>> is not
>>>>>>>>>> waiting
>>>>>>>>>>>>>> a lot for IO. No high User/System CPU either?
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> If you see a lot of evicted block then each RPC
>>> has a
>>>>> high
>>>>>>>>>> chance of
>>>>>>>>>>>>>> requiring to bring an entire 64k block in. You'll
>>> see bad
>>>>>>>>>> performance
>>>>>>>>>>>>> with
>>>>>>>>>>>>>> this.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> We might need to trace this specific scenario.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> -- Lars
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> ________________________________
>>>>>>>>>>>>>>>> From: Khaled Elmeleegy <kd...@hotmail.com>
>>>>>>>>>>>>>>>> To: "user@hbase.apache.org" <
>>> user@hbase.apache.org>
>>>>>>>>>>>>>>>> Sent: Thursday, October 2, 2014 10:46 AM
>>>>>>>>>>>>>>>> Subject: RE: HBase read performance
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> I've set the heap size to 6GB and I do have gc
>>>>> logging. No
>>>>>>>>> long
>>>>>>>>>> pauses
>>>>>>>>>>>>>> there -- occasional 0.1s or 0.2s.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Other than the discrepancy between what's
>>> reported on
>>>>> the
>>>>>>>>>> client and
>>>>>>>>>>>>>> what's reported at the RS, there is also the issue
>>> of not
>>>>>>> getting
>>>>>>>>>> proper
>>>>>>>>>>>>>> concurrency. So, even if a reverse get takes 100ms
>>> or so
>>>>>>> (this
>>>>>>>>> has
>>>>>>>>>> to be
>>>>>>>>>>>>>> mostly blocking on various things as no physical
>>>>> resource is
>>>>>>>>>> contended),
>>>>>>>>>>>>>> then the other gets/scans should be able to proceed
>>> in
>>>>>>> parallel,
>>>>>>>>>> so a
>>>>>>>>>>>>>> thousand concurrent gets/scans should finish in few
>>>>> hundreds
>>>>>>> of
>>>>>>>>> ms
>>>>>>>>>> not
>>>>>>>>>>>>> many
>>>>>>>>>>>>>> seconds. That's why I thought I'd increase the
>>> handlers
>>>>>>> count to
>>>>>>>>>> try to
>>>>>>>>>>>>> get
>>>>>>>>>>>>>> more concurrency, but it didn't help. So, there must
>>> be
>>>>>>> something
>>>>>>>>>> else.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Khaled
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> ----------------------------------------
>>>>>>>>>>>>>>>>> From: ndimiduk@gmail.com
>>>>>>>>>>>>>>>>> Date: Thu, 2 Oct 2014 10:36:39 -0700
>>>>>>>>>>>>>>>>> Subject: Re: HBase read performance
>>>>>>>>>>>>>>>>> To: user@hbase.apache.org
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Do check again on the heap size of the region
>>>>> servers.
>>>>>>> The
>>>>>>>>>> default
>>>>>>>>>>>>>>>>> unconfigured size is 1G; too small for much of
>>>>> anything.
>>>>>>>>> Check
>>>>>>>>>> your
>>>>>>>>>>>>> RS
>>>>>>>>>>>>>> logs
>>>>>>>>>>>>>>>>> -- look for lines produced by the JVMPauseMonitor
>>>>> thread.
>>>>>>>>> They
>>>>>>>>>>>>> usually
>>>>>>>>>>>>>>>>> correlate with long GC pauses or other
>>> process-freeze
>>>>>>> events.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Get is implemented as a Scan of a single row, so
>>> a
>>>>>>> reverse
>>>>>>>>>> scan of a
>>>>>>>>>>>>>> single
>>>>>>>>>>>>>>>>> row should be functionally equivalent.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> In practice, I have seen discrepancy between the
>>>>>>> latencies
>>>>>>>>>> reported
>>>>>>>>>>>>> by
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> RS and the latencies experienced by the client.
>>> I've
>>>>> not
>>>>>>>>>> investigated
>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>> area thoroughly.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> On Thu, Oct 2, 2014 at 10:05 AM, Khaled
>>> Elmeleegy <
>>>>>>>>>> kdiaa@hotmail.com
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Thanks Lars for your quick reply.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Yes performance is similar with less handlers (I
>>>>> tried
>>>>>>> with
>>>>>>>>>> 100
>>>>>>>>>>>>>> first).
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> The payload is not big ~1KB or so. The working
>>> set
>>>>>>> doesn't
>>>>>>>>>> seem to
>>>>>>>>>>>>>> fit in
>>>>>>>>>>>>>>>>>> memory as there are many cache misses. However,
>>>>> disk is
>>>>>>> far
>>>>>>>>>> from
>>>>>>>>>>>>>> being a
>>>>>>>>>>>>>>>>>> bottleneck. I checked using iostat. I also
>>> verified
>>>>> that
>>>>>>>>>> neither the
>>>>>>>>>>>>>>>>>> network nor the CPU of the region server or the
>>>>> client
>>>>>>> are a
>>>>>>>>>>>>>> bottleneck.
>>>>>>>>>>>>>>>>>> This leads me to believe that likely this is a
>>>>> software
>>>>>>>>>> bottleneck,
>>>>>>>>>>>>>>>>>> possibly due to a misconfiguration on my side. I
>>>>> just
>>>>>>> don't
>>>>>>>>>> know how
>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>> debug it. A clear disconnect I see is the
>>> individual
>>>>>>> request
>>>>>>>>>> latency
>>>>>>>>>>>>>> as
>>>>>>>>>>>>>>>>>> reported by metrics on the region server (IPC
>>>>>>>>> processCallTime
>>>>>>>>>> vs
>>>>>>>>>>>>>> scanNext)
>>>>>>>>>>>>>>>>>> vs what's measured on the client. Does this
>>> sound
>>>>>>> right? Any
>>>>>>>>>> ideas
>>>>>>>>>>>>> on
>>>>>>>>>>>>>> how
>>>>>>>>>>>>>>>>>> to better debug it?
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> About this trick with the timestamps to be able
>>> to
>>>>> do a
>>>>>>>>>> forward
>>>>>>>>>>>>> scan,
>>>>>>>>>>>>>>>>>> thanks for pointing it out. Actually, I am
>>> aware of
>>>>> it.
>>>>>>> The
>>>>>>>>>> problem
>>>>>>>>>>>>> I
>>>>>>>>>>>>>> have
>>>>>>>>>>>>>>>>>> is, sometimes I want to get the key after a
>>>>> particular
>>>>>>>>>> timestamp and
>>>>>>>>>>>>>>>>>> sometimes I want to get the key before, so just
>>>>> relying
>>>>>>> on
>>>>>>>>>> the key
>>>>>>>>>>>>>> order
>>>>>>>>>>>>>>>>>> doesn't work. Ideally, I want a reverse get(). I
>>>>> thought
>>>>>>>>>> reverse
>>>>>>>>>>>>> scan
>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>>> do the trick though.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Khaled
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> ----------------------------------------
>>>>>>>>>>>>>>>>>>> Date: Thu, 2 Oct 2014 09:40:37 -0700
>>>>>>>>>>>>>>>>>>> From: larsh@apache.org
>>>>>>>>>>>>>>>>>>> Subject: Re: HBase read performance
>>>>>>>>>>>>>>>>>>> To: user@hbase.apache.org
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Hi Khaled,
>>>>>>>>>>>>>>>>>>> is it the same with fewer threads? 1500 handler
>>>>> threads
>>>>>>>>>> seems to
>>>>>>>>>>>>> be a
>>>>>>>>>>>>>>>>>> lot. Typically a good number of threads depends
>>> on
>>>>> the
>>>>>>>>>> hardware
>>>>>>>>>>>>>> (number of
>>>>>>>>>>>>>>>>>> cores, number of spindles, etc). I cannot think
>>> of
>>>>> any
>>>>>>> type
>>>>>>>>> of
>>>>>>>>>>>>>> scenario
>>>>>>>>>>>>>>>>>> where more than 100 would give any improvement.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> How large is the payload per KV retrieved that
>>>>> way? If
>>>>>>>>> large
>>>>>>>>>> (as
>>>>>>>>>>>>> in a
>>>>>>>>>>>>>>>>>> few 100k) you definitely want to lower the
>>> number
>>>>> of the
>>>>>>>>>> handler
>>>>>>>>>>>>>> threads.
>>>>>>>>>>>>>>>>>>> How much heap do you give the region server?
>>> Does
>>>>> the
>>>>>>>>>> working set
>>>>>>>>>>>>> fit
>>>>>>>>>>>>>>>>>> into the cache? (i.e. in the metrics, do you
>>> see the
>>>>>>>>> eviction
>>>>>>>>>> count
>>>>>>>>>>>>>> going
>>>>>>>>>>>>>>>>>> up, if so it does not fit into the cache).
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> If the working set does not fit into the cache
>>>>>>> (eviction
>>>>>>>>>> count goes
>>>>>>>>>>>>>> up)
>>>>>>>>>>>>>>>>>> then HBase will need to bring a new block in
>>> from
>>>>> disk
>>>>>>> on
>>>>>>>>>> each Get
>>>>>>>>>>>>>>>>>> (assuming the Gets are more or less random as
>>> far
>>>>> as the
>>>>>>>>>> server is
>>>>>>>>>>>>>>>>>> concerned).
>>>>>>>>>>>>>>>>>>> In case you'll benefit from reducing the HFile
>>>>> block
>>>>>>> size
>>>>>>>>>> (from 64k
>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>> 8k or even 4k).
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Lastly I don't think we tested the performance
>>> of
>>>>> using
>>>>>>>>>> reverse
>>>>>>>>>>>>> scan
>>>>>>>>>>>>>>>>>> this way, there is probably room to optimize
>>> this.
>>>>>>>>>>>>>>>>>>> Can you restructure your keys to allow forwards
>>>>>>> scanning?
>>>>>>>>> For
>>>>>>>>>>>>> example
>>>>>>>>>>>>>>>>>> you could store the time as MAX_LONG-time. Or
>>> you
>>>>> could
>>>>>>>>>> invert all
>>>>>>>>>>>>>> the bits
>>>>>>>>>>>>>>>>>> of the time portion of the key, so that it sort
>>> the
>>>>>>> other
>>>>>>>>>> way. Then
>>>>>>>>>>>>>> you
>>>>>>>>>>>>>>>>>> could do a forward scan.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Let us know how it goes.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> -- Lars
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> ----- Original Message -----
>>>>>>>>>>>>>>>>>>> From: Khaled Elmeleegy <kd...@hotmail.com>
>>>>>>>>>>>>>>>>>>> To: "user@hbase.apache.org" <
>>> user@hbase.apache.org
>>>>>> 
>>>>>>>>>>>>>>>>>>> Cc:
>>>>>>>>>>>>>>>>>>> Sent: Thursday, October 2, 2014 12:12 AM
>>>>>>>>>>>>>>>>>>> Subject: HBase read performance
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> I am trying to do a scatter/gather on hbase
>>>>> (0.98.6.1),
>>>>>>>>>> where I
>>>>>>>>>>>>> have
>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>> client reading ~1000 keys from an HBase table.
>>> These
>>>>>>> keys
>>>>>>>>>> happen to
>>>>>>>>>>>>>> fall on
>>>>>>>>>>>>>>>>>> the same region server. For my reads I use
>>> reverse
>>>>> scan
>>>>>>> to
>>>>>>>>>> read each
>>>>>>>>>>>>>> key as
>>>>>>>>>>>>>>>>>> I want the key prior to a specific time stamp
>>> (time
>>>>>>> stamps
>>>>>>>>> are
>>>>>>>>>>>>> stored
>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>> reverse order). I don't believe gets can
>>> accomplish
>>>>>>> that,
>>>>>>>>>> right? so
>>>>>>>>>>>>> I
>>>>>>>>>>>>>> use
>>>>>>>>>>>>>>>>>> scan, with caching set to 1.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> I use 2000 reader threads in the client and on
>>>>> HBase,
>>>>>>> I've
>>>>>>>>>> set
>>>>>>>>>>>>>>>>>> hbase.regionserver.handler.count to 1500. With
>>> this
>>>>>>> setup,
>>>>>>>>> my
>>>>>>>>>>>>> scatter
>>>>>>>>>>>>>>>>>> gather is very slow and can take up to 10s in
>>> total.
>>>>>>> Timing
>>>>>>>>> an
>>>>>>>>>>>>>> individual
>>>>>>>>>>>>>>>>>> getScanner(..) call on the client side, it can
>>>>> easily
>>>>>>> take
>>>>>>>>> few
>>>>>>>>>>>>>> hundreds of
>>>>>>>>>>>>>>>>>> ms. I also got the following metrics from the
>>> region
>>>>>>> server
>>>>>>>>> in
>>>>>>>>>>>>>> question:
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> "queueCallTime_mean" : 2.190855525775637,
>>>>>>>>>>>>>>>>>>> "queueCallTime_median" : 0.0,
>>>>>>>>>>>>>>>>>>> "queueCallTime_75th_percentile" : 0.0,
>>>>>>>>>>>>>>>>>>> "queueCallTime_95th_percentile" : 1.0,
>>>>>>>>>>>>>>>>>>> "queueCallTime_99th_percentile" :
>>>>> 556.9799999999818,
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> "processCallTime_min" : 0,
>>>>>>>>>>>>>>>>>>> "processCallTime_max" : 12755,
>>>>>>>>>>>>>>>>>>> "processCallTime_mean" : 105.64873440912682,
>>>>>>>>>>>>>>>>>>> "processCallTime_median" : 0.0,
>>>>>>>>>>>>>>>>>>> "processCallTime_75th_percentile" : 2.0,
>>>>>>>>>>>>>>>>>>> "processCallTime_95th_percentile" : 7917.95,
>>>>>>>>>>>>>>>>>>> "processCallTime_99th_percentile" : 8876.89,
>>> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_min"
>>>>>>>>>>>>>>>>>> : 89,
>>> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_max"
>>>>>>>>>>>>>>>>>> : 11300,
>>> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_mean"
>>>>>>>>>>>>>>>>>> : 654.4949739797315,
>>> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_median"
>>>>>>>>>>>>>>>>>> : 101.0,
>>> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_75th_percentile"
>>>>>>>>>>>>>>>>>> : 101.0,
>>> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_95th_percentile"
>>>>>>>>>>>>>>>>>> : 101.0,
>>> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_99th_percentile"
>>>>>>>>>>>>>>>>>> : 113.0,
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Where "delta" is the name of the table I am
>>>>> querying.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> In addition to all this, i monitored the
>>> hardware
>>>>>>> resources
>>>>>>>>>> (CPU,
>>>>>>>>>>>>>> disk,
>>>>>>>>>>>>>>>>>> and network) of both the client and the region
>>>>> server
>>>>>>> and
>>>>>>>>>> nothing
>>>>>>>>>>>>>> seems
>>>>>>>>>>>>>>>>>> anywhere near saturation. So I am puzzled by
>>> what's
>>>>>>> going on
>>>>>>>>>> and
>>>>>>>>>>>>>> where this
>>>>>>>>>>>>>>>>>> time is going.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Few things to note based on the above
>>> measurements:
>>>>>>> both
>>>>>>>>>> medians of
>>>>>>>>>>>>>> IPC
>>>>>>>>>>>>>>>>>> processCallTime and queueCallTime are basically
>>> zero
>>>>>>> (ms I
>>>>>>>>>> presume,
>>>>>>>>>>>>>>>>>> right?). However, scanNext_median is 101 (ms
>>> too,
>>>>>>> right?). I
>>>>>>>>>> am not
>>>>>>>>>>>>>> sure
>>>>>>>>>>>>>>>>>> how this adds up. Also, even though the 101
>>> figure
>>>>> seems
>>>>>>>>>>>>> outrageously
>>>>>>>>>>>>>> high
>>>>>>>>>>>>>>>>>> and I don't know why, still all these scans
>>> should
>>>>> be
>>>>>>>>>> happening in
>>>>>>>>>>>>>>>>>> parallel, so the overall call should finish
>>> fast,
>>>>> given
>>>>>>> that
>>>>>>>>>> no
>>>>>>>>>>>>>> hardware
>>>>>>>>>>>>>>>>>> resource is contended, right? but this is not
>>> what's
>>>>>>>>>> happening, so I
>>>>>>>>>>>>>> have
>>>>>>>>>>>>>>>>>> to be missing something(s).
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> So, any help is appreciated there.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>> Khaled
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> Best regards,
>>>>>> 
>>>>>>   - Andy
>>>>>> 
>>>>>> Problems worthy of attack prove their worth by hitting back. - Piet
>>> Hein
>>>>>> (via Tom White)
>

RE: HBase read performance

Posted by Khaled Elmeleegy <kd...@hotmail.com>.

It goes away, but i think that's because i am basically churning the cache far less. My workload is mostly writes.

> Date: Sun, 12 Oct 2014 19:43:16 -0700
> Subject: Re: HBase read performance
> From: yuzhihong@gmail.com
> To: user@hbase.apache.org
> 
> Hi,
> Can you turn off cacheonwrite and keep bucket cache to see if the problem
> goes away ?
> 
> Cheers
> 
> On Fri, Oct 10, 2014 at 10:59 AM, Khaled Elmeleegy <kd...@hotmail.com>
> wrote:
> 
> > Yes, I can reproduce it with some work.
> > The workload is basically as follows:
> > There are writers streaming writes to a table. Then, there is a reader
> > (invoked via a web interface). The reader does a 1000 parallel reverse
> > scans, all end up hitting the same region in my case. The scans are
> > effectively "gets" as I just need to get one record off of each of them. I
> > just need to do a "reverse" get, which is not supported (would be great to
> > have :)), so I do it via reverse scan. After few tries, the reader
> > consistently hits this bug.
> >
> > This happens with these config changes:
> > hbase-env:HBASE_REGIONSERVER_OPTS=-Xmx6G -XX:MaxDirectMemorySize=5G
> > -XX:CMSInitiatingOccupancyFraction=88 -XX:+AggressiveOpts -verbose:gc
> > -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Xlog
> > gc:/tmp/hbase-regionserver-gc.loghbase-site:
> > hbase.bucketcache.ioengine=offheap
> > hbase.bucketcache.size=4196
> > hbase.rs.cacheblocksonwrite=true
> > hfile.block.index.cacheonwrite=true
> > hfile.block.bloom.cacheonwrite=true
> >
> > Interestingly, without these config changes, I can't reproduce the problem.
> > Khaled
> >
> >
> > > Date: Fri, 10 Oct 2014 10:05:14 -0700
> > > Subject: Re: HBase read performance
> > > From: stack@duboce.net
> > > To: user@hbase.apache.org
> > >
> > > It looks like we are messing up our positioning math:
> > >
> > > 233 <
> > http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/nio/Buffer.java#233
> > >
> > >
> > > <
> > http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/nio/Buffer.java#
> > >
> > >
> > >     public final Buffer
> > > <
> > http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/nio/Buffer.java#Buffer
> > >
> > >  <
> > http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/nio/Buffer.java#
> > >position(int
> > > newPosition) {
> > >
> > > 234 <
> > http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/nio/Buffer.java#234
> > >
> > >
> > > <
> > http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/nio/Buffer.java#
> > >
> > >
> > >         if ((newPosition > limit
> > > <
> > http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/nio/Buffer.java#Buffer.0limit
> > >)
> > > || (newPosition < 0))
> > >
> > > 235 <
> > http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/nio/Buffer.java#235
> > >
> > >
> > > <
> > http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/nio/Buffer.java#
> > >
> > >
> > >             throw new IllegalArgumentException
> > > <
> > http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/lang/IllegalArgumentException.java#IllegalArgumentException
> > >();
> > >
> > >
> > > Is it easy to reproduce Khaled?  Always same region/store or spread
> > > across all reads?
> > >
> > >
> > > St.Ack
> > >
> > >
> > >
> > > On Fri, Oct 10, 2014 at 8:31 AM, Khaled Elmeleegy <kd...@hotmail.com>
> > wrote:
> > >
> > > > Andrew thanks. I think this indeed was the problem.
> > > > To get over it, I increased the amount of memory given the region
> > server
> > > > to avoid IO on reads. I used the configs below. In my experiments, I
> > have
> > > > writers streaming their output to HBase. The reader powers a web page
> > and
> > > > does this scatter/gather, where it reads 1000 keys written last and
> > passes
> > > > them the the front end. With this workload, I get the exception below
> > at
> > > > the region server. Again, I am using HBAse (0.98.6.1). Any help is
> > > > appreciated.
> > > > hbase-env:HBASE_REGIONSERVER_OPTS=-Xmx6G -XX:MaxDirectMemorySize=5G
> > > > -XX:CMSInitiatingOccupancyFraction=88 -XX:+AggressiveOpts -verbose:gc
> > > > -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Xlog
> > > > gc:/tmp/hbase-regionserver-gc.log
> > > > hbase-site:hbase.bucketcache.ioengine=offheap
> > > >
> > > >                                    hfile.block.cache.size=0.4
> > > >
> > > >
> > > >  hbase.bucketcache.size=4196
> > > >
> > > >                         hbase.rs.cacheblocksonwrite=true
> > > >
> > > >
> > > >  hfile.block.index.cacheonwrite=true
> > > >
> > > >                         hfile.block.bloom.cacheonwrite=true
> > > >
> > > > 2014-10-10 15:06:44,173 ERROR
> > > > [B.DefaultRpcServer.handler=62,queue=2,port=60020] ipc.RpcServer:
> > > > Unexpected throwable object
> > > > java.lang.IllegalArgumentException
> > > >         at java.nio.Buffer.position(Buffer.java:236)
> > > >         at
> > > >
> > org.apache.hadoop.hbase.util.ByteBufferUtils.skip(ByteBufferUtils.java:434)
> > > >         at
> > > >
> > org.apache.hadoop.hbase.io.hfile.HFileReaderV2$ScannerV2.readKeyValueLen(HFileReaderV2.java:849)
> > > >         at
> > > >
> > org.apache.hadoop.hbase.io.hfile.HFileReaderV2$ScannerV2.next(HFileReaderV2.java:760)
> > > >         at
> > > >
> > org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekAtOrAfter(StoreFileScanner.java:248)
> > > >         at
> > > >
> > org.apache.hadoop.hbase.regionserver.StoreFileScanner.seek(StoreFileScanner.java:152)
> > > >         at
> > > >
> > org.apache.hadoop.hbase.regionserver.StoreScanner.seekScanners(StoreScanner.java:317)
> > > >         at
> > > >
> > org.apache.hadoop.hbase.regionserver.StoreScanner.<init>(StoreScanner.java:176)
> > > >         at
> > > >
> > org.apache.hadoop.hbase.regionserver.HStore.getScanner(HStore.java:1780)
> > > >         at
> > > >
> > org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.<init>(HRegion.java:3758)
> > > >         at
> > > >
> > org.apache.hadoop.hbase.regionserver.HRegion.instantiateRegionScanner(HRegion.java:1950)
> > > >         at
> > > >
> > org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:1936)
> > > >         at
> > > >
> > org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:1913)
> > > >         at
> > > >
> > org.apache.hadoop.hbase.regionserver.HRegionServer.scan(HRegionServer.java:3157)
> > > >         at
> > > >
> > org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:29587)
> > > >         at
> > org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2027)
> > > >         at
> > org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108)
> > > >         at
> > > >
> > org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:114)
> > > >         at
> > > > org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:94)
> > > >         at java.lang.Thread.run(Thread.java:744)
> > > > > From: apurtell@apache.org
> > > > > Date: Tue, 7 Oct 2014 17:09:35 -0700
> > > > > Subject: Re: HBase read performance
> > > > > To: user@hbase.apache.org
> > > > >
> > > > > > The cluster has 2 m1.large nodes.
> > > > >
> > > > > That's the problem right there.
> > > > >
> > > > > You need to look at c3.4xlarge or i2 instances as a minimum
> > requirement.
> > > > M1
> > > > > and even M3 instance types have ridiculously poor IO.
> > > > >
> > > > >
> > > > > On Tue, Oct 7, 2014 at 3:01 PM, Khaled Elmeleegy <kd...@hotmail.com>
> > > > wrote:
> > > > >
> > > > > > Thanks Nicolas, Qiang.
> > > > > >
> > > > > > I was able to write a simple program that reproduces the problem
> > on a
> > > > tiny
> > > > > > HBase cluster on ec2. The cluster has 2 m1.large nodes. One node
> > runs
> > > > the
> > > > > > master, name node and zookeeper. The other node runs a data node
> > and a
> > > > > > region server, with heap size configured to be 6GB. There, the 1000
> > > > > > parallel reverse gets (reverse scans) take 7-8 seconds. The data
> > set is
> > > > > > tiny (10M records, each having a small number of bytes). As I said
> > > > before,
> > > > > > all hardware resources are very idle there.
> > > > > >
> > > > > > Interestingly, running the same workload on my macbook, the 1000
> > > > parallel
> > > > > > gets take ~200ms on a pseudo-distributed installation.
> > > > > >
> > > > > > Any help to resolve this mystery is highly appreciated.
> > > > > >
> > > > > > P.S. please find my test program attached.
> > > > > >
> > > > > > Best,
> > > > > > Khaled
> > > > > >
> > > > > > > From: nkeywal@gmail.com
> > > > > > > Date: Mon, 6 Oct 2014 09:40:48 +0200
> > > > > > > Subject: Re: HBase read performance
> > > > > > > To: user@hbase.apache.org
> > > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > I haven't seen it mentioned, but if I understand correctly each
> > scan
> > > > > > > returns a single row? If so you should use Scan#setSmall to save
> > > > some rpc
> > > > > > > calls.
> > > > > > >
> > > > > > > Cheers,
> > > > > > >
> > > > > > > Nicolas
> > > > > > >
> > > > > > >
> > > > > > > On Sun, Oct 5, 2014 at 11:28 AM, Qiang Tian <ti...@gmail.com>
> > > > wrote:
> > > > > > >
> > > > > > > > when using separate HConnection instance, both its
> > > > > > > > RpcClient instance(maintain connection to a regionserver) and
> > > > Registry
> > > > > > > > instance(maintain connection to zookeeper) will be separate..
> > > > > > > >
> > > > > > > > see
> > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> > http://shammijayasinghe.blogspot.com/2012/02/zookeeper-increase-maximum-number-of.html
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On Sun, Oct 5, 2014 at 2:24 PM, Khaled Elmeleegy <
> > > > kdiaa@hotmail.com>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > I tried creating my own HConnections pool to use for my HBase
> > > > calls,
> > > > > > so
> > > > > > > > > that not all the (2K) threads share the same HConnection.
> > > > However, I
> > > > > > > > could
> > > > > > > > > only have 10 HConnections. Beyond that I get ZK exceptions,
> > > > please
> > > > > > find
> > > > > > > > it
> > > > > > > > > below. Also, with 10 HConnections, I don't see noticeable
> > > > > > improvement in
> > > > > > > > > performance so far.
> > > > > > > > > 2014-10-05 06:11:26,490 WARN [main]
> > > > zookeeper.RecoverableZooKeeper
> > > > > > > > > (RecoverableZooKeeper.java:retryOrThrow(253)) - Possibly
> > > > transient
> > > > > > > > > ZooKeeper, quorum=54.68.206.252:2181,
> > > > > > > > >
> > > > > >
> > exception=org.apache.zookeeper.KeeperException$ConnectionLossException:
> > > > > > > > > KeeperErrorCode = ConnectionLoss for /hbase/hbaseid2014-10-05
> > > > > > > > 06:11:26,490
> > > > > > > > > INFO [main] util.RetryCounter
> > > > > > > > (RetryCounter.java:sleepUntilNextRetry(155))
> > > > > > > > > - Sleeping 1000ms before retry #0...2014-10-05 06:11:27,845
> > WARN
> > > > > > [main]
> > > > > > > > > zookeeper.RecoverableZooKeeper
> > > > > > > > > (RecoverableZooKeeper.java:retryOrThrow(253)) - Possibly
> > > > transient
> > > > > > > > > ZooKeeper, quorum=54.68.206.252:2181,
> > > > > > > > >
> > > > > >
> > exception=org.apache.zookeeper.KeeperException$ConnectionLossException:
> > > > > > > > > KeeperErrorCode = ConnectionLoss for /hbase/hbaseid2014-10-05
> > > > > > > > 06:11:27,849
> > > > > > > > > INFO [main] util.RetryCounter
> > > > > > > > (RetryCounter.java:sleepUntilNextRetry(155))
> > > > > > > > > - Sleeping 2000ms before retry #1...2014-10-05 06:11:30,405
> > WARN
> > > > > > [main]
> > > > > > > > > zookeeper.RecoverableZooKeeper
> > > > > > > > > (RecoverableZooKeeper.java:retryOrThrow(253)) - Possibly
> > > > transient
> > > > > > > > > ZooKeeper, quorum=54.68.206.252:2181,
> > > > > > > > >
> > > > > >
> > exception=org.apache.zookeeper.KeeperException$ConnectionLossException:
> > > > > > > > > KeeperErrorCode = ConnectionLoss for /hbase/hbaseid2014-10-05
> > > > > > > > 06:11:30,405
> > > > > > > > > INFO [main] util.RetryCounter
> > > > > > > > (RetryCounter.java:sleepUntilNextRetry(155))
> > > > > > > > > - Sleeping 4000ms before retry #2...2014-10-05 06:11:35,278
> > WARN
> > > > > > [main]
> > > > > > > > > zookeeper.RecoverableZooKeeper
> > > > > > > > > (RecoverableZooKeeper.java:retryOrThrow(253)) - Possibly
> > > > transient
> > > > > > > > > ZooKeeper, quorum=54.68.206.252:2181,
> > > > > > > > >
> > > > > >
> > exception=org.apache.zookeeper.KeeperException$ConnectionLossException:
> > > > > > > > > KeeperErrorCode = ConnectionLoss for /hbase/hbaseid2014-10-05
> > > > > > > > 06:11:35,279
> > > > > > > > > INFO [main] util.RetryCounter
> > > > > > > > (RetryCounter.java:sleepUntilNextRetry(155))
> > > > > > > > > - Sleeping 8000ms before retry #3...2014-10-05 06:11:44,393
> > WARN
> > > > > > [main]
> > > > > > > > > zookeeper.RecoverableZooKeeper
> > > > > > > > > (RecoverableZooKeeper.java:retryOrThrow(253)) - Possibly
> > > > transient
> > > > > > > > > ZooKeeper, quorum=54.68.206.252:2181,
> > > > > > > > >
> > > > > >
> > exception=org.apache.zookeeper.KeeperException$ConnectionLossException:
> > > > > > > > > KeeperErrorCode = ConnectionLoss for /hbase/hbaseid2014-10-05
> > > > > > > > 06:11:44,393
> > > > > > > > > ERROR [main] zookeeper.RecoverableZooKeeper
> > > > > > > > > (RecoverableZooKeeper.java:retryOrThrow(255)) - ZooKeeper
> > exists
> > > > > > failed
> > > > > > > > > after 4 attempts2014-10-05 06:11:44,394 WARN [main]
> > > > zookeeper.ZKUtil
> > > > > > > > > (ZKUtil.java:checkExists(482)) - hconnection-0x4e174f3b,
> > quorum=
> > > > > > > > > 54.68.206.252:2181, baseZNode=/hbase Unable to set watcher
> > on
> > > > znode
> > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> > (/hbase/hbaseid)org.apache.zookeeper.KeeperException$ConnectionLossException:
> > > > > > > > > KeeperErrorCode = ConnectionLoss for /hbase/hbaseid at
> > > > > > > > >
> > > > org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
> > > > > > at
> > > > > > > > >
> > > > org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
> > > > > > at
> > > > > > > > > org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1041) at
> > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.exists(RecoverableZooKeeper.java:199)
> > > > > > > > > at
> > > > > >
> > org.apache.hadoop.hbase.zookeeper.ZKUtil.checkExists(ZKUtil.java:479)
> > > > > > > > > at
> > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> > org.apache.hadoop.hbase.zookeeper.ZKClusterId.readClusterIdZNode(ZKClusterId.java:65)
> > > > > > > > > at
> > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> > org.apache.hadoop.hbase.client.ZooKeeperRegistry.getClusterId(ZooKeeperRegistry.java:83)
> > > > > > > > > at
> > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> > org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.retrieveClusterId(HConnectionManager.java:857)
> > > > > > > > > at
> > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> > org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.<init>(HConnectionManager.java:662)
> > > > > > > > > at
> > sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> > > > > > > > > Method) at
> > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
> > > > > > > > > at
> > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> > > > > > > > > at
> > > > java.lang.reflect.Constructor.newInstance(Constructor.java:526)
> > > > > > > > at
> > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> > org.apache.hadoop.hbase.client.HConnectionManager.createConnection(HConnectionManager.java:414)
> > > > > > > > > at
> > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> > org.apache.hadoop.hbase.client.HConnectionManager.createConnection(HConnectionManager.java:335)...
> > > > > > > > > > From: kdiaa@hotmail.com
> > > > > > > > > > To: user@hbase.apache.org
> > > > > > > > > > Subject: RE: HBase read performance
> > > > > > > > > > Date: Fri, 3 Oct 2014 12:37:39 -0700
> > > > > > > > > >
> > > > > > > > > > Lars, Ted, and Qiang,
> > > > > > > > > > Thanks for all the input.
> > > > > > > > > > Qiang: yes all the threads are in the same client process
> > > > sharing
> > > > > > the
> > > > > > > > > same connection. And since I don't see hardware contention,
> > may
> > > > be
> > > > > > there
> > > > > > > > is
> > > > > > > > > contention over this code path. I'll try using many
> > connections
> > > > and
> > > > > > see
> > > > > > > > if
> > > > > > > > > it alleviates the problems and I'll report back.
> > > > > > > > > > Thanks again,Khaled
> > > > > > > > > >
> > > > > > > > > > > Date: Fri, 3 Oct 2014 15:18:30 +0800
> > > > > > > > > > > Subject: Re: HBase read performance
> > > > > > > > > > > From: tianq01@gmail.com
> > > > > > > > > > > To: user@hbase.apache.org
> > > > > > > > > > >
> > > > > > > > > > > Regarding to profiling, Andrew introduced
> > > > > > > > > > >
> > > > > > http://www.brendangregg.com/blog/2014-06-12/java-flame-graphs.html
> > > > > > > > > months
> > > > > > > > > > > ago.
> > > > > > > > > > >
> > > > > > > > > > > processCallTime comes from RpcServer#call, so it looks
> > good?
> > > > > > > > > > >
> > > > > > > > > > > I have a suspect:
> > > > > > https://issues.apache.org/jira/browse/HBASE-11306
> > > > > > > > > > >
> > > > > > > > > > > how many processes do you have for your 2000 threads?
> > > > > > > > > > > if olny 1 process, those threads will share just 1
> > > > connection to
> > > > > > that
> > > > > > > > > > > regionserver, there might be big contention on the RPC
> > code
> > > > path.
> > > > > > > > > ---for
> > > > > > > > > > > such case, could you try using different connections?
> > > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> > https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HConnectionManager.html
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > On Fri, Oct 3, 2014 at 9:55 AM, Ted Yu <
> > yuzhihong@gmail.com>
> > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Khaled:
> > > > > > > > > > > > Do you have profiler such as jprofiler ?
> > > > > > > > > > > > Profiling would give us more hint.
> > > > > > > > > > > >
> > > > > > > > > > > > Otherwise capturing stack trace during the period of
> > > > reverse
> > > > > > scan
> > > > > > > > > would
> > > > > > > > > > > > help.
> > > > > > > > > > > >
> > > > > > > > > > > > Cheers
> > > > > > > > > > > >
> > > > > > > > > > > > On Thu, Oct 2, 2014 at 4:52 PM, lars hofhansl <
> > > > > > larsh@apache.org>
> > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > You might have the data in the OS buffer cache,
> > without
> > > > short
> > > > > > > > > circuit
> > > > > > > > > > > > > reading the region server has to request the block
> > from
> > > > the
> > > > > > data
> > > > > > > > > node
> > > > > > > > > > > > > process, which then reads it from the block cache.
> > > > > > > > > > > > > That is a few context switches per RPC that do not
> > show
> > > > up
> > > > > > in CPU
> > > > > > > > > > > > metrics.
> > > > > > > > > > > > > In that you also would not see disk IO.
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > If - as you say - you see a lot of evicted blocks the
> > > > data
> > > > > > *has*
> > > > > > > > > to come
> > > > > > > > > > > > > from the OS. If you do not see disk IO is *has* to
> > come
> > > > from
> > > > > > the
> > > > > > > > OS
> > > > > > > > > > > > cache.
> > > > > > > > > > > > > I.e. there's more RAM on your boxes, and you should
> > > > increase
> > > > > > the
> > > > > > > > > heap
> > > > > > > > > > > > block
> > > > > > > > > > > > > cache.
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > You can measure the context switches with vmstat.
> > Other
> > > > than
> > > > > > that
> > > > > > > > > I have
> > > > > > > > > > > > > no suggestion until I reproduce the problem.
> > > > > > > > > > > > > Also check the data locality index of the region
> > server
> > > > it
> > > > > > should
> > > > > > > > > be
> > > > > > > > > > > > close
> > > > > > > > > > > > > to 100%.
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > -- Lars
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > ________________________________
> > > > > > > > > > > > > From: Khaled Elmeleegy <kd...@hotmail.com>
> > > > > > > > > > > > > To: "user@hbase.apache.org" <us...@hbase.apache.org>
> > > > > > > > > > > > > Sent: Thursday, October 2, 2014 3:24 PM
> > > > > > > > > > > > > Subject: RE: HBase read performance
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > Lars thanks a lot about all the tips. I'll make sure
> > I
> > > > cover
> > > > > > all
> > > > > > > > > of them
> > > > > > > > > > > > > and get back to you. I am not sure they are the
> > > > bottleneck
> > > > > > though
> > > > > > > > > as they
> > > > > > > > > > > > > all are about optimizing physical resource usage. As
> > I
> > > > said,
> > > > > > I
> > > > > > > > > don't see
> > > > > > > > > > > > > any contended physical resources now. I'll also try
> > to
> > > > > > reproduce
> > > > > > > > > this
> > > > > > > > > > > > > problem in a simpler environment and pass to you the
> > test
> > > > > > program
> > > > > > > > > to play
> > > > > > > > > > > > > with.
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > Couple of high level points to make. You are right
> > that
> > > > my
> > > > > > use
> > > > > > > > > case is
> > > > > > > > > > > > > kind of a worst case for HBase reads. But, if things
> > go
> > > > the
> > > > > > way
> > > > > > > > you
> > > > > > > > > > > > > described them, there should be tons of disk IO and
> > that
> > > > > > should
> > > > > > > > be
> > > > > > > > > > > > clearly
> > > > > > > > > > > > > the bottleneck. This is not the case though. That's
> > for
> > > > the
> > > > > > > > simple
> > > > > > > > > reason
> > > > > > > > > > > > > that this is done in a test environment (I am still
> > > > > > prototyping),
> > > > > > > > > and
> > > > > > > > > > > > not a
> > > > > > > > > > > > > lot of data is yet written to HBase. However for the
> > > > real use
> > > > > > > > > case, there
> > > > > > > > > > > > > should writers constantly writing data to HBase and
> > > > readers
> > > > > > > > > occasionally
> > > > > > > > > > > > > doing this scatter/gather. At steady state, things
> > should
> > > > > > only
> > > > > > > > get
> > > > > > > > > worse
> > > > > > > > > > > > > and all the issues you mentioned should get far more
> > > > > > pronounced.
> > > > > > > > > At this
> > > > > > > > > > > > > point, one can try to mitigate it using more memory
> > or
> > > > so. I
> > > > > > am
> > > > > > > > > not there
> > > > > > > > > > > > > yet as I think I am hitting some software bottleneck,
> > > > which I
> > > > > > > > > don't know
> > > > > > > > > > > > > how to work around.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Khaled
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > ----------------------------------------
> > > > > > > > > > > > > > Date: Thu, 2 Oct 2014 14:20:47 -0700
> > > > > > > > > > > > > > From: larsh@apache.org
> > > > > > > > > > > > > > Subject: Re: HBase read performance
> > > > > > > > > > > > > > To: user@hbase.apache.org
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > OK... We might need to investigate this.
> > > > > > > > > > > > > > Any chance that you can provide a minimal test
> > program
> > > > and
> > > > > > > > > instruction
> > > > > > > > > > > > > about how to set it up.
> > > > > > > > > > > > > > We can do some profiling then.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > One thing to note is that with scanning HBase
> > cannot
> > > > use
> > > > > > bloom
> > > > > > > > > filters
> > > > > > > > > > > > > to rule out HFiles ahead of time, it needs to look
> > into
> > > > all
> > > > > > of
> > > > > > > > > them.
> > > > > > > > > > > > > > So you kind of hit on the absolute worst case:
> > > > > > > > > > > > > > - random reads that do not fit into the block cache
> > > > > > > > > > > > > > - cannot use bloom filters
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Few more question/comments:
> > > > > > > > > > > > > > - Do you have short circuit reading enabled? If
> > not,
> > > > you
> > > > > > > > should.
> > > > > > > > > > > > > > - Is your table major compacted? That will reduce
> > the
> > > > > > number of
> > > > > > > > > files
> > > > > > > > > > > > to
> > > > > > > > > > > > > look at.
> > > > > > > > > > > > > > - Did you disable Nagle's everywhere (enabled
> > > > tcpnodelay)?
> > > > > > It
> > > > > > > > > disabled
> > > > > > > > > > > > > by default in HBase, but necessarily in your install
> > of
> > > > HDFS.
> > > > > > > > > > > > > > - Which version of HDFS are you using as backing
> > > > > > filesystem?
> > > > > > > > > > > > > > - If your disk is idle, it means the data fits into
> > > > the OS
> > > > > > > > buffer
> > > > > > > > > > > > cache.
> > > > > > > > > > > > > In turn that means that you increase the heap for the
> > > > region
> > > > > > > > > servers. You
> > > > > > > > > > > > > can also use block encoding (FAST_DIFF) to try to
> > make
> > > > sure
> > > > > > the
> > > > > > > > > entire
> > > > > > > > > > > > > working set fits into the cache.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > - Also try to reduce the block size - although if
> > your
> > > > > > overall
> > > > > > > > > working
> > > > > > > > > > > > > set does not fit in the heap it won't help much.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > This is a good section of the book to read through
> > > > > > generally
> > > > > > > > > (even
> > > > > > > > > > > > > though you might know most of this already):
> > > > > > > > > > > > >
> > http://hbase.apache.org/book.html#perf.configurations
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > -- Lars
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > ----- Original Message -----
> > > > > > > > > > > > > > From: Khaled Elmeleegy <kd...@hotmail.com>
> > > > > > > > > > > > > > To: "user@hbase.apache.org" <user@hbase.apache.org
> > >
> > > > > > > > > > > > > > Cc:
> > > > > > > > > > > > > > Sent: Thursday, October 2, 2014 11:27 AM
> > > > > > > > > > > > > > Subject: RE: HBase read performance
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > I do see a very brief spike in CPU (user/system),
> > but
> > > > it's
> > > > > > no
> > > > > > > > > where
> > > > > > > > > > > > near
> > > > > > > > > > > > > 0% idle. It goes from 99% idle down to something
> > like 40%
> > > > > > idle
> > > > > > > > for
> > > > > > > > > a
> > > > > > > > > > > > second
> > > > > > > > > > > > > or so. The thing to note, this is all on a test
> > cluster,
> > > > so
> > > > > > no
> > > > > > > > > real load.
> > > > > > > > > > > > > Things are generally idle until i issue 2-3 of these
> > > > > > > > > multi-scan-requests
> > > > > > > > > > > > to
> > > > > > > > > > > > > render a web page. Then, you see the spike in the
> > cpu and
> > > > > > some
> > > > > > > > > activity
> > > > > > > > > > > > in
> > > > > > > > > > > > > the network and disk, but nowhere near saturation.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > If there are specific tests you'd like me to do to
> > > > debug
> > > > > > this,
> > > > > > > > > I'd be
> > > > > > > > > > > > > more than happy to do it.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Khaled
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > ----------------------------------------
> > > > > > > > > > > > > >> Date: Thu, 2 Oct 2014 11:15:59 -0700
> > > > > > > > > > > > > >> From: larsh@apache.org
> > > > > > > > > > > > > >> Subject: Re: HBase read performance
> > > > > > > > > > > > > >> To: user@hbase.apache.org
> > > > > > > > > > > > > >>
> > > > > > > > > > > > > >> I still think you're waiting on disk. No IOWAIT?
> > So
> > > > CPU
> > > > > > is not
> > > > > > > > > waiting
> > > > > > > > > > > > > a lot for IO. No high User/System CPU either?
> > > > > > > > > > > > > >>
> > > > > > > > > > > > > >> If you see a lot of evicted block then each RPC
> > has a
> > > > high
> > > > > > > > > chance of
> > > > > > > > > > > > > requiring to bring an entire 64k block in. You'll
> > see bad
> > > > > > > > > performance
> > > > > > > > > > > > with
> > > > > > > > > > > > > this.
> > > > > > > > > > > > > >>
> > > > > > > > > > > > > >> We might need to trace this specific scenario.
> > > > > > > > > > > > > >>
> > > > > > > > > > > > > >> -- Lars
> > > > > > > > > > > > > >>
> > > > > > > > > > > > > >>
> > > > > > > > > > > > > >>
> > > > > > > > > > > > > >> ________________________________
> > > > > > > > > > > > > >> From: Khaled Elmeleegy <kd...@hotmail.com>
> > > > > > > > > > > > > >> To: "user@hbase.apache.org" <
> > user@hbase.apache.org>
> > > > > > > > > > > > > >> Sent: Thursday, October 2, 2014 10:46 AM
> > > > > > > > > > > > > >> Subject: RE: HBase read performance
> > > > > > > > > > > > > >>
> > > > > > > > > > > > > >>
> > > > > > > > > > > > > >> I've set the heap size to 6GB and I do have gc
> > > > logging. No
> > > > > > > > long
> > > > > > > > > pauses
> > > > > > > > > > > > > there -- occasional 0.1s or 0.2s.
> > > > > > > > > > > > > >>
> > > > > > > > > > > > > >> Other than the discrepancy between what's
> > reported on
> > > > the
> > > > > > > > > client and
> > > > > > > > > > > > > what's reported at the RS, there is also the issue
> > of not
> > > > > > getting
> > > > > > > > > proper
> > > > > > > > > > > > > concurrency. So, even if a reverse get takes 100ms
> > or so
> > > > > > (this
> > > > > > > > has
> > > > > > > > > to be
> > > > > > > > > > > > > mostly blocking on various things as no physical
> > > > resource is
> > > > > > > > > contended),
> > > > > > > > > > > > > then the other gets/scans should be able to proceed
> > in
> > > > > > parallel,
> > > > > > > > > so a
> > > > > > > > > > > > > thousand concurrent gets/scans should finish in few
> > > > hundreds
> > > > > > of
> > > > > > > > ms
> > > > > > > > > not
> > > > > > > > > > > > many
> > > > > > > > > > > > > seconds. That's why I thought I'd increase the
> > handlers
> > > > > > count to
> > > > > > > > > try to
> > > > > > > > > > > > get
> > > > > > > > > > > > > more concurrency, but it didn't help. So, there must
> > be
> > > > > > something
> > > > > > > > > else.
> > > > > > > > > > > > > >>
> > > > > > > > > > > > > >> Khaled
> > > > > > > > > > > > > >>
> > > > > > > > > > > > > >> ----------------------------------------
> > > > > > > > > > > > > >>> From: ndimiduk@gmail.com
> > > > > > > > > > > > > >>> Date: Thu, 2 Oct 2014 10:36:39 -0700
> > > > > > > > > > > > > >>> Subject: Re: HBase read performance
> > > > > > > > > > > > > >>> To: user@hbase.apache.org
> > > > > > > > > > > > > >>>
> > > > > > > > > > > > > >>> Do check again on the heap size of the region
> > > > servers.
> > > > > > The
> > > > > > > > > default
> > > > > > > > > > > > > >>> unconfigured size is 1G; too small for much of
> > > > anything.
> > > > > > > > Check
> > > > > > > > > your
> > > > > > > > > > > > RS
> > > > > > > > > > > > > logs
> > > > > > > > > > > > > >>> -- look for lines produced by the JVMPauseMonitor
> > > > thread.
> > > > > > > > They
> > > > > > > > > > > > usually
> > > > > > > > > > > > > >>> correlate with long GC pauses or other
> > process-freeze
> > > > > > events.
> > > > > > > > > > > > > >>>
> > > > > > > > > > > > > >>> Get is implemented as a Scan of a single row, so
> > a
> > > > > > reverse
> > > > > > > > > scan of a
> > > > > > > > > > > > > single
> > > > > > > > > > > > > >>> row should be functionally equivalent.
> > > > > > > > > > > > > >>>
> > > > > > > > > > > > > >>> In practice, I have seen discrepancy between the
> > > > > > latencies
> > > > > > > > > reported
> > > > > > > > > > > > by
> > > > > > > > > > > > > the
> > > > > > > > > > > > > >>> RS and the latencies experienced by the client.
> > I've
> > > > not
> > > > > > > > > investigated
> > > > > > > > > > > > > this
> > > > > > > > > > > > > >>> area thoroughly.
> > > > > > > > > > > > > >>>
> > > > > > > > > > > > > >>> On Thu, Oct 2, 2014 at 10:05 AM, Khaled
> > Elmeleegy <
> > > > > > > > > kdiaa@hotmail.com
> > > > > > > > > > > > >
> > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > >>>
> > > > > > > > > > > > > >>>> Thanks Lars for your quick reply.
> > > > > > > > > > > > > >>>>
> > > > > > > > > > > > > >>>> Yes performance is similar with less handlers (I
> > > > tried
> > > > > > with
> > > > > > > > > 100
> > > > > > > > > > > > > first).
> > > > > > > > > > > > > >>>>
> > > > > > > > > > > > > >>>> The payload is not big ~1KB or so. The working
> > set
> > > > > > doesn't
> > > > > > > > > seem to
> > > > > > > > > > > > > fit in
> > > > > > > > > > > > > >>>> memory as there are many cache misses. However,
> > > > disk is
> > > > > > far
> > > > > > > > > from
> > > > > > > > > > > > > being a
> > > > > > > > > > > > > >>>> bottleneck. I checked using iostat. I also
> > verified
> > > > that
> > > > > > > > > neither the
> > > > > > > > > > > > > >>>> network nor the CPU of the region server or the
> > > > client
> > > > > > are a
> > > > > > > > > > > > > bottleneck.
> > > > > > > > > > > > > >>>> This leads me to believe that likely this is a
> > > > software
> > > > > > > > > bottleneck,
> > > > > > > > > > > > > >>>> possibly due to a misconfiguration on my side. I
> > > > just
> > > > > > don't
> > > > > > > > > know how
> > > > > > > > > > > > > to
> > > > > > > > > > > > > >>>> debug it. A clear disconnect I see is the
> > individual
> > > > > > request
> > > > > > > > > latency
> > > > > > > > > > > > > as
> > > > > > > > > > > > > >>>> reported by metrics on the region server (IPC
> > > > > > > > processCallTime
> > > > > > > > > vs
> > > > > > > > > > > > > scanNext)
> > > > > > > > > > > > > >>>> vs what's measured on the client. Does this
> > sound
> > > > > > right? Any
> > > > > > > > > ideas
> > > > > > > > > > > > on
> > > > > > > > > > > > > how
> > > > > > > > > > > > > >>>> to better debug it?
> > > > > > > > > > > > > >>>>
> > > > > > > > > > > > > >>>> About this trick with the timestamps to be able
> > to
> > > > do a
> > > > > > > > > forward
> > > > > > > > > > > > scan,
> > > > > > > > > > > > > >>>> thanks for pointing it out. Actually, I am
> > aware of
> > > > it.
> > > > > > The
> > > > > > > > > problem
> > > > > > > > > > > > I
> > > > > > > > > > > > > have
> > > > > > > > > > > > > >>>> is, sometimes I want to get the key after a
> > > > particular
> > > > > > > > > timestamp and
> > > > > > > > > > > > > >>>> sometimes I want to get the key before, so just
> > > > relying
> > > > > > on
> > > > > > > > > the key
> > > > > > > > > > > > > order
> > > > > > > > > > > > > >>>> doesn't work. Ideally, I want a reverse get(). I
> > > > thought
> > > > > > > > > reverse
> > > > > > > > > > > > scan
> > > > > > > > > > > > > can
> > > > > > > > > > > > > >>>> do the trick though.
> > > > > > > > > > > > > >>>>
> > > > > > > > > > > > > >>>> Khaled
> > > > > > > > > > > > > >>>>
> > > > > > > > > > > > > >>>> ----------------------------------------
> > > > > > > > > > > > > >>>>> Date: Thu, 2 Oct 2014 09:40:37 -0700
> > > > > > > > > > > > > >>>>> From: larsh@apache.org
> > > > > > > > > > > > > >>>>> Subject: Re: HBase read performance
> > > > > > > > > > > > > >>>>> To: user@hbase.apache.org
> > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > >>>>> Hi Khaled,
> > > > > > > > > > > > > >>>>> is it the same with fewer threads? 1500 handler
> > > > threads
> > > > > > > > > seems to
> > > > > > > > > > > > be a
> > > > > > > > > > > > > >>>> lot. Typically a good number of threads depends
> > on
> > > > the
> > > > > > > > > hardware
> > > > > > > > > > > > > (number of
> > > > > > > > > > > > > >>>> cores, number of spindles, etc). I cannot think
> > of
> > > > any
> > > > > > type
> > > > > > > > of
> > > > > > > > > > > > > scenario
> > > > > > > > > > > > > >>>> where more than 100 would give any improvement.
> > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > >>>>> How large is the payload per KV retrieved that
> > > > way? If
> > > > > > > > large
> > > > > > > > > (as
> > > > > > > > > > > > in a
> > > > > > > > > > > > > >>>> few 100k) you definitely want to lower the
> > number
> > > > of the
> > > > > > > > > handler
> > > > > > > > > > > > > threads.
> > > > > > > > > > > > > >>>>> How much heap do you give the region server?
> > Does
> > > > the
> > > > > > > > > working set
> > > > > > > > > > > > fit
> > > > > > > > > > > > > >>>> into the cache? (i.e. in the metrics, do you
> > see the
> > > > > > > > eviction
> > > > > > > > > count
> > > > > > > > > > > > > going
> > > > > > > > > > > > > >>>> up, if so it does not fit into the cache).
> > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > >>>>> If the working set does not fit into the cache
> > > > > > (eviction
> > > > > > > > > count goes
> > > > > > > > > > > > > up)
> > > > > > > > > > > > > >>>> then HBase will need to bring a new block in
> > from
> > > > disk
> > > > > > on
> > > > > > > > > each Get
> > > > > > > > > > > > > >>>> (assuming the Gets are more or less random as
> > far
> > > > as the
> > > > > > > > > server is
> > > > > > > > > > > > > >>>> concerned).
> > > > > > > > > > > > > >>>>> In case you'll benefit from reducing the HFile
> > > > block
> > > > > > size
> > > > > > > > > (from 64k
> > > > > > > > > > > > > to
> > > > > > > > > > > > > >>>> 8k or even 4k).
> > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > >>>>> Lastly I don't think we tested the performance
> > of
> > > > using
> > > > > > > > > reverse
> > > > > > > > > > > > scan
> > > > > > > > > > > > > >>>> this way, there is probably room to optimize
> > this.
> > > > > > > > > > > > > >>>>> Can you restructure your keys to allow forwards
> > > > > > scanning?
> > > > > > > > For
> > > > > > > > > > > > example
> > > > > > > > > > > > > >>>> you could store the time as MAX_LONG-time. Or
> > you
> > > > could
> > > > > > > > > invert all
> > > > > > > > > > > > > the bits
> > > > > > > > > > > > > >>>> of the time portion of the key, so that it sort
> > the
> > > > > > other
> > > > > > > > > way. Then
> > > > > > > > > > > > > you
> > > > > > > > > > > > > >>>> could do a forward scan.
> > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > >>>>> Let us know how it goes.
> > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > >>>>> -- Lars
> > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > >>>>> ----- Original Message -----
> > > > > > > > > > > > > >>>>> From: Khaled Elmeleegy <kd...@hotmail.com>
> > > > > > > > > > > > > >>>>> To: "user@hbase.apache.org" <
> > user@hbase.apache.org
> > > > >
> > > > > > > > > > > > > >>>>> Cc:
> > > > > > > > > > > > > >>>>> Sent: Thursday, October 2, 2014 12:12 AM
> > > > > > > > > > > > > >>>>> Subject: HBase read performance
> > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > >>>>> Hi,
> > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > >>>>> I am trying to do a scatter/gather on hbase
> > > > (0.98.6.1),
> > > > > > > > > where I
> > > > > > > > > > > > have
> > > > > > > > > > > > > a
> > > > > > > > > > > > > >>>> client reading ~1000 keys from an HBase table.
> > These
> > > > > > keys
> > > > > > > > > happen to
> > > > > > > > > > > > > fall on
> > > > > > > > > > > > > >>>> the same region server. For my reads I use
> > reverse
> > > > scan
> > > > > > to
> > > > > > > > > read each
> > > > > > > > > > > > > key as
> > > > > > > > > > > > > >>>> I want the key prior to a specific time stamp
> > (time
> > > > > > stamps
> > > > > > > > are
> > > > > > > > > > > > stored
> > > > > > > > > > > > > in
> > > > > > > > > > > > > >>>> reverse order). I don't believe gets can
> > accomplish
> > > > > > that,
> > > > > > > > > right? so
> > > > > > > > > > > > I
> > > > > > > > > > > > > use
> > > > > > > > > > > > > >>>> scan, with caching set to 1.
> > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > >>>>> I use 2000 reader threads in the client and on
> > > > HBase,
> > > > > > I've
> > > > > > > > > set
> > > > > > > > > > > > > >>>> hbase.regionserver.handler.count to 1500. With
> > this
> > > > > > setup,
> > > > > > > > my
> > > > > > > > > > > > scatter
> > > > > > > > > > > > > >>>> gather is very slow and can take up to 10s in
> > total.
> > > > > > Timing
> > > > > > > > an
> > > > > > > > > > > > > individual
> > > > > > > > > > > > > >>>> getScanner(..) call on the client side, it can
> > > > easily
> > > > > > take
> > > > > > > > few
> > > > > > > > > > > > > hundreds of
> > > > > > > > > > > > > >>>> ms. I also got the following metrics from the
> > region
> > > > > > server
> > > > > > > > in
> > > > > > > > > > > > > question:
> > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > >>>>> "queueCallTime_mean" : 2.190855525775637,
> > > > > > > > > > > > > >>>>> "queueCallTime_median" : 0.0,
> > > > > > > > > > > > > >>>>> "queueCallTime_75th_percentile" : 0.0,
> > > > > > > > > > > > > >>>>> "queueCallTime_95th_percentile" : 1.0,
> > > > > > > > > > > > > >>>>> "queueCallTime_99th_percentile" :
> > > > 556.9799999999818,
> > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > >>>>> "processCallTime_min" : 0,
> > > > > > > > > > > > > >>>>> "processCallTime_max" : 12755,
> > > > > > > > > > > > > >>>>> "processCallTime_mean" : 105.64873440912682,
> > > > > > > > > > > > > >>>>> "processCallTime_median" : 0.0,
> > > > > > > > > > > > > >>>>> "processCallTime_75th_percentile" : 2.0,
> > > > > > > > > > > > > >>>>> "processCallTime_95th_percentile" : 7917.95,
> > > > > > > > > > > > > >>>>> "processCallTime_99th_percentile" : 8876.89,
> > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > >>>>
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> > "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_min"
> > > > > > > > > > > > > >>>> : 89,
> > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > >>>>
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> > "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_max"
> > > > > > > > > > > > > >>>> : 11300,
> > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > >>>>
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> > "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_mean"
> > > > > > > > > > > > > >>>> : 654.4949739797315,
> > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > >>>>
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> > "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_median"
> > > > > > > > > > > > > >>>> : 101.0,
> > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > >>>>
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> > "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_75th_percentile"
> > > > > > > > > > > > > >>>> : 101.0,
> > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > >>>>
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> > "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_95th_percentile"
> > > > > > > > > > > > > >>>> : 101.0,
> > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > >>>>
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> > "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_99th_percentile"
> > > > > > > > > > > > > >>>> : 113.0,
> > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > >>>>> Where "delta" is the name of the table I am
> > > > querying.
> > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > >>>>> In addition to all this, i monitored the
> > hardware
> > > > > > resources
> > > > > > > > > (CPU,
> > > > > > > > > > > > > disk,
> > > > > > > > > > > > > >>>> and network) of both the client and the region
> > > > server
> > > > > > and
> > > > > > > > > nothing
> > > > > > > > > > > > > seems
> > > > > > > > > > > > > >>>> anywhere near saturation. So I am puzzled by
> > what's
> > > > > > going on
> > > > > > > > > and
> > > > > > > > > > > > > where this
> > > > > > > > > > > > > >>>> time is going.
> > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > >>>>> Few things to note based on the above
> > measurements:
> > > > > > both
> > > > > > > > > medians of
> > > > > > > > > > > > > IPC
> > > > > > > > > > > > > >>>> processCallTime and queueCallTime are basically
> > zero
> > > > > > (ms I
> > > > > > > > > presume,
> > > > > > > > > > > > > >>>> right?). However, scanNext_median is 101 (ms
> > too,
> > > > > > right?). I
> > > > > > > > > am not
> > > > > > > > > > > > > sure
> > > > > > > > > > > > > >>>> how this adds up. Also, even though the 101
> > figure
> > > > seems
> > > > > > > > > > > > outrageously
> > > > > > > > > > > > > high
> > > > > > > > > > > > > >>>> and I don't know why, still all these scans
> > should
> > > > be
> > > > > > > > > happening in
> > > > > > > > > > > > > >>>> parallel, so the overall call should finish
> > fast,
> > > > given
> > > > > > that
> > > > > > > > > no
> > > > > > > > > > > > > hardware
> > > > > > > > > > > > > >>>> resource is contended, right? but this is not
> > what's
> > > > > > > > > happening, so I
> > > > > > > > > > > > > have
> > > > > > > > > > > > > >>>> to be missing something(s).
> > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > >>>>> So, any help is appreciated there.
> > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > >>>>> Thanks,
> > > > > > > > > > > > > >>>>> Khaled
> > > > > > > > > > > > > >>>>
> > > > > > > > > > > > > >>>>
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Best regards,
> > > > >
> > > > >    - Andy
> > > > >
> > > > > Problems worthy of attack prove their worth by hitting back. - Piet
> > Hein
> > > > > (via Tom White)
> > > >
> > > >
> >
> >

Re: HBase read performance

Posted by Ted Yu <yu...@gmail.com>.

Hi,
Can you turn off cacheonwrite and keep bucket cache to see if the problem
goes away ?

Cheers

On Fri, Oct 10, 2014 at 10:59 AM, Khaled Elmeleegy <kd...@hotmail.com>
wrote:

> Yes, I can reproduce it with some work.
> The workload is basically as follows:
> There are writers streaming writes to a table. Then, there is a reader
> (invoked via a web interface). The reader does a 1000 parallel reverse
> scans, all end up hitting the same region in my case. The scans are
> effectively "gets" as I just need to get one record off of each of them. I
> just need to do a "reverse" get, which is not supported (would be great to
> have :)), so I do it via reverse scan. After few tries, the reader
> consistently hits this bug.
>
> This happens with these config changes:
> hbase-env:HBASE_REGIONSERVER_OPTS=-Xmx6G -XX:MaxDirectMemorySize=5G
> -XX:CMSInitiatingOccupancyFraction=88 -XX:+AggressiveOpts -verbose:gc
> -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Xlog
> gc:/tmp/hbase-regionserver-gc.loghbase-site:
> hbase.bucketcache.ioengine=offheap
> hbase.bucketcache.size=4196
> hbase.rs.cacheblocksonwrite=true
> hfile.block.index.cacheonwrite=true
> hfile.block.bloom.cacheonwrite=true
>
> Interestingly, without these config changes, I can't reproduce the problem.
> Khaled
>
>
> > Date: Fri, 10 Oct 2014 10:05:14 -0700
> > Subject: Re: HBase read performance
> > From: stack@duboce.net
> > To: user@hbase.apache.org
> >
> > It looks like we are messing up our positioning math:
> >
> > 233 <
> http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/nio/Buffer.java#233
> >
> >
> > <
> http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/nio/Buffer.java#
> >
> >
> >     public final Buffer
> > <
> http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/nio/Buffer.java#Buffer
> >
> >  <
> http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/nio/Buffer.java#
> >position(int
> > newPosition) {
> >
> > 234 <
> http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/nio/Buffer.java#234
> >
> >
> > <
> http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/nio/Buffer.java#
> >
> >
> >         if ((newPosition > limit
> > <
> http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/nio/Buffer.java#Buffer.0limit
> >)
> > || (newPosition < 0))
> >
> > 235 <
> http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/nio/Buffer.java#235
> >
> >
> > <
> http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/nio/Buffer.java#
> >
> >
> >             throw new IllegalArgumentException
> > <
> http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/lang/IllegalArgumentException.java#IllegalArgumentException
> >();
> >
> >
> > Is it easy to reproduce Khaled?  Always same region/store or spread
> > across all reads?
> >
> >
> > St.Ack
> >
> >
> >
> > On Fri, Oct 10, 2014 at 8:31 AM, Khaled Elmeleegy <kd...@hotmail.com>
> wrote:
> >
> > > Andrew thanks. I think this indeed was the problem.
> > > To get over it, I increased the amount of memory given the region
> server
> > > to avoid IO on reads. I used the configs below. In my experiments, I
> have
> > > writers streaming their output to HBase. The reader powers a web page
> and
> > > does this scatter/gather, where it reads 1000 keys written last and
> passes
> > > them the the front end. With this workload, I get the exception below
> at
> > > the region server. Again, I am using HBAse (0.98.6.1). Any help is
> > > appreciated.
> > > hbase-env:HBASE_REGIONSERVER_OPTS=-Xmx6G -XX:MaxDirectMemorySize=5G
> > > -XX:CMSInitiatingOccupancyFraction=88 -XX:+AggressiveOpts -verbose:gc
> > > -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Xlog
> > > gc:/tmp/hbase-regionserver-gc.log
> > > hbase-site:hbase.bucketcache.ioengine=offheap
> > >
> > >                                    hfile.block.cache.size=0.4
> > >
> > >
> > >  hbase.bucketcache.size=4196
> > >
> > >                         hbase.rs.cacheblocksonwrite=true
> > >
> > >
> > >  hfile.block.index.cacheonwrite=true
> > >
> > >                         hfile.block.bloom.cacheonwrite=true
> > >
> > > 2014-10-10 15:06:44,173 ERROR
> > > [B.DefaultRpcServer.handler=62,queue=2,port=60020] ipc.RpcServer:
> > > Unexpected throwable object
> > > java.lang.IllegalArgumentException
> > >         at java.nio.Buffer.position(Buffer.java:236)
> > >         at
> > >
> org.apache.hadoop.hbase.util.ByteBufferUtils.skip(ByteBufferUtils.java:434)
> > >         at
> > >
> org.apache.hadoop.hbase.io.hfile.HFileReaderV2$ScannerV2.readKeyValueLen(HFileReaderV2.java:849)
> > >         at
> > >
> org.apache.hadoop.hbase.io.hfile.HFileReaderV2$ScannerV2.next(HFileReaderV2.java:760)
> > >         at
> > >
> org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekAtOrAfter(StoreFileScanner.java:248)
> > >         at
> > >
> org.apache.hadoop.hbase.regionserver.StoreFileScanner.seek(StoreFileScanner.java:152)
> > >         at
> > >
> org.apache.hadoop.hbase.regionserver.StoreScanner.seekScanners(StoreScanner.java:317)
> > >         at
> > >
> org.apache.hadoop.hbase.regionserver.StoreScanner.<init>(StoreScanner.java:176)
> > >         at
> > >
> org.apache.hadoop.hbase.regionserver.HStore.getScanner(HStore.java:1780)
> > >         at
> > >
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.<init>(HRegion.java:3758)
> > >         at
> > >
> org.apache.hadoop.hbase.regionserver.HRegion.instantiateRegionScanner(HRegion.java:1950)
> > >         at
> > >
> org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:1936)
> > >         at
> > >
> org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:1913)
> > >         at
> > >
> org.apache.hadoop.hbase.regionserver.HRegionServer.scan(HRegionServer.java:3157)
> > >         at
> > >
> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:29587)
> > >         at
> org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2027)
> > >         at
> org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108)
> > >         at
> > >
> org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:114)
> > >         at
> > > org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:94)
> > >         at java.lang.Thread.run(Thread.java:744)
> > > > From: apurtell@apache.org
> > > > Date: Tue, 7 Oct 2014 17:09:35 -0700
> > > > Subject: Re: HBase read performance
> > > > To: user@hbase.apache.org
> > > >
> > > > > The cluster has 2 m1.large nodes.
> > > >
> > > > That's the problem right there.
> > > >
> > > > You need to look at c3.4xlarge or i2 instances as a minimum
> requirement.
> > > M1
> > > > and even M3 instance types have ridiculously poor IO.
> > > >
> > > >
> > > > On Tue, Oct 7, 2014 at 3:01 PM, Khaled Elmeleegy <kd...@hotmail.com>
> > > wrote:
> > > >
> > > > > Thanks Nicolas, Qiang.
> > > > >
> > > > > I was able to write a simple program that reproduces the problem
> on a
> > > tiny
> > > > > HBase cluster on ec2. The cluster has 2 m1.large nodes. One node
> runs
> > > the
> > > > > master, name node and zookeeper. The other node runs a data node
> and a
> > > > > region server, with heap size configured to be 6GB. There, the 1000
> > > > > parallel reverse gets (reverse scans) take 7-8 seconds. The data
> set is
> > > > > tiny (10M records, each having a small number of bytes). As I said
> > > before,
> > > > > all hardware resources are very idle there.
> > > > >
> > > > > Interestingly, running the same workload on my macbook, the 1000
> > > parallel
> > > > > gets take ~200ms on a pseudo-distributed installation.
> > > > >
> > > > > Any help to resolve this mystery is highly appreciated.
> > > > >
> > > > > P.S. please find my test program attached.
> > > > >
> > > > > Best,
> > > > > Khaled
> > > > >
> > > > > > From: nkeywal@gmail.com
> > > > > > Date: Mon, 6 Oct 2014 09:40:48 +0200
> > > > > > Subject: Re: HBase read performance
> > > > > > To: user@hbase.apache.org
> > > > > >
> > > > > > Hi,
> > > > > >
> > > > > > I haven't seen it mentioned, but if I understand correctly each
> scan
> > > > > > returns a single row? If so you should use Scan#setSmall to save
> > > some rpc
> > > > > > calls.
> > > > > >
> > > > > > Cheers,
> > > > > >
> > > > > > Nicolas
> > > > > >
> > > > > >
> > > > > > On Sun, Oct 5, 2014 at 11:28 AM, Qiang Tian <ti...@gmail.com>
> > > wrote:
> > > > > >
> > > > > > > when using separate HConnection instance, both its
> > > > > > > RpcClient instance(maintain connection to a regionserver) and
> > > Registry
> > > > > > > instance(maintain connection to zookeeper) will be separate..
> > > > > > >
> > > > > > > see
> > > > > > >
> > > > > > >
> > > > >
> > >
> http://shammijayasinghe.blogspot.com/2012/02/zookeeper-increase-maximum-number-of.html
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Sun, Oct 5, 2014 at 2:24 PM, Khaled Elmeleegy <
> > > kdiaa@hotmail.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > I tried creating my own HConnections pool to use for my HBase
> > > calls,
> > > > > so
> > > > > > > > that not all the (2K) threads share the same HConnection.
> > > However, I
> > > > > > > could
> > > > > > > > only have 10 HConnections. Beyond that I get ZK exceptions,
> > > please
> > > > > find
> > > > > > > it
> > > > > > > > below. Also, with 10 HConnections, I don't see noticeable
> > > > > improvement in
> > > > > > > > performance so far.
> > > > > > > > 2014-10-05 06:11:26,490 WARN [main]
> > > zookeeper.RecoverableZooKeeper
> > > > > > > > (RecoverableZooKeeper.java:retryOrThrow(253)) - Possibly
> > > transient
> > > > > > > > ZooKeeper, quorum=54.68.206.252:2181,
> > > > > > > >
> > > > >
> exception=org.apache.zookeeper.KeeperException$ConnectionLossException:
> > > > > > > > KeeperErrorCode = ConnectionLoss for /hbase/hbaseid2014-10-05
> > > > > > > 06:11:26,490
> > > > > > > > INFO [main] util.RetryCounter
> > > > > > > (RetryCounter.java:sleepUntilNextRetry(155))
> > > > > > > > - Sleeping 1000ms before retry #0...2014-10-05 06:11:27,845
> WARN
> > > > > [main]
> > > > > > > > zookeeper.RecoverableZooKeeper
> > > > > > > > (RecoverableZooKeeper.java:retryOrThrow(253)) - Possibly
> > > transient
> > > > > > > > ZooKeeper, quorum=54.68.206.252:2181,
> > > > > > > >
> > > > >
> exception=org.apache.zookeeper.KeeperException$ConnectionLossException:
> > > > > > > > KeeperErrorCode = ConnectionLoss for /hbase/hbaseid2014-10-05
> > > > > > > 06:11:27,849
> > > > > > > > INFO [main] util.RetryCounter
> > > > > > > (RetryCounter.java:sleepUntilNextRetry(155))
> > > > > > > > - Sleeping 2000ms before retry #1...2014-10-05 06:11:30,405
> WARN
> > > > > [main]
> > > > > > > > zookeeper.RecoverableZooKeeper
> > > > > > > > (RecoverableZooKeeper.java:retryOrThrow(253)) - Possibly
> > > transient
> > > > > > > > ZooKeeper, quorum=54.68.206.252:2181,
> > > > > > > >
> > > > >
> exception=org.apache.zookeeper.KeeperException$ConnectionLossException:
> > > > > > > > KeeperErrorCode = ConnectionLoss for /hbase/hbaseid2014-10-05
> > > > > > > 06:11:30,405
> > > > > > > > INFO [main] util.RetryCounter
> > > > > > > (RetryCounter.java:sleepUntilNextRetry(155))
> > > > > > > > - Sleeping 4000ms before retry #2...2014-10-05 06:11:35,278
> WARN
> > > > > [main]
> > > > > > > > zookeeper.RecoverableZooKeeper
> > > > > > > > (RecoverableZooKeeper.java:retryOrThrow(253)) - Possibly
> > > transient
> > > > > > > > ZooKeeper, quorum=54.68.206.252:2181,
> > > > > > > >
> > > > >
> exception=org.apache.zookeeper.KeeperException$ConnectionLossException:
> > > > > > > > KeeperErrorCode = ConnectionLoss for /hbase/hbaseid2014-10-05
> > > > > > > 06:11:35,279
> > > > > > > > INFO [main] util.RetryCounter
> > > > > > > (RetryCounter.java:sleepUntilNextRetry(155))
> > > > > > > > - Sleeping 8000ms before retry #3...2014-10-05 06:11:44,393
> WARN
> > > > > [main]
> > > > > > > > zookeeper.RecoverableZooKeeper
> > > > > > > > (RecoverableZooKeeper.java:retryOrThrow(253)) - Possibly
> > > transient
> > > > > > > > ZooKeeper, quorum=54.68.206.252:2181,
> > > > > > > >
> > > > >
> exception=org.apache.zookeeper.KeeperException$ConnectionLossException:
> > > > > > > > KeeperErrorCode = ConnectionLoss for /hbase/hbaseid2014-10-05
> > > > > > > 06:11:44,393
> > > > > > > > ERROR [main] zookeeper.RecoverableZooKeeper
> > > > > > > > (RecoverableZooKeeper.java:retryOrThrow(255)) - ZooKeeper
> exists
> > > > > failed
> > > > > > > > after 4 attempts2014-10-05 06:11:44,394 WARN [main]
> > > zookeeper.ZKUtil
> > > > > > > > (ZKUtil.java:checkExists(482)) - hconnection-0x4e174f3b,
> quorum=
> > > > > > > > 54.68.206.252:2181, baseZNode=/hbase Unable to set watcher
> on
> > > znode
> > > > > > > >
> > > > > > >
> > > > >
> > >
> (/hbase/hbaseid)org.apache.zookeeper.KeeperException$ConnectionLossException:
> > > > > > > > KeeperErrorCode = ConnectionLoss for /hbase/hbaseid at
> > > > > > > >
> > > org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
> > > > > at
> > > > > > > >
> > > org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
> > > > > at
> > > > > > > > org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1041) at
> > > > > > > >
> > > > > > >
> > > > >
> > >
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.exists(RecoverableZooKeeper.java:199)
> > > > > > > > at
> > > > >
> org.apache.hadoop.hbase.zookeeper.ZKUtil.checkExists(ZKUtil.java:479)
> > > > > > > > at
> > > > > > > >
> > > > > > >
> > > > >
> > >
> org.apache.hadoop.hbase.zookeeper.ZKClusterId.readClusterIdZNode(ZKClusterId.java:65)
> > > > > > > > at
> > > > > > > >
> > > > > > >
> > > > >
> > >
> org.apache.hadoop.hbase.client.ZooKeeperRegistry.getClusterId(ZooKeeperRegistry.java:83)
> > > > > > > > at
> > > > > > > >
> > > > > > >
> > > > >
> > >
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.retrieveClusterId(HConnectionManager.java:857)
> > > > > > > > at
> > > > > > > >
> > > > > > >
> > > > >
> > >
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.<init>(HConnectionManager.java:662)
> > > > > > > > at
> sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> > > > > > > > Method) at
> > > > > > > >
> > > > > > >
> > > > >
> > >
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
> > > > > > > > at
> > > > > > > >
> > > > > > >
> > > > >
> > >
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> > > > > > > > at
> > > java.lang.reflect.Constructor.newInstance(Constructor.java:526)
> > > > > > > at
> > > > > > > >
> > > > > > >
> > > > >
> > >
> org.apache.hadoop.hbase.client.HConnectionManager.createConnection(HConnectionManager.java:414)
> > > > > > > > at
> > > > > > > >
> > > > > > >
> > > > >
> > >
> org.apache.hadoop.hbase.client.HConnectionManager.createConnection(HConnectionManager.java:335)...
> > > > > > > > > From: kdiaa@hotmail.com
> > > > > > > > > To: user@hbase.apache.org
> > > > > > > > > Subject: RE: HBase read performance
> > > > > > > > > Date: Fri, 3 Oct 2014 12:37:39 -0700
> > > > > > > > >
> > > > > > > > > Lars, Ted, and Qiang,
> > > > > > > > > Thanks for all the input.
> > > > > > > > > Qiang: yes all the threads are in the same client process
> > > sharing
> > > > > the
> > > > > > > > same connection. And since I don't see hardware contention,
> may
> > > be
> > > > > there
> > > > > > > is
> > > > > > > > contention over this code path. I'll try using many
> connections
> > > and
> > > > > see
> > > > > > > if
> > > > > > > > it alleviates the problems and I'll report back.
> > > > > > > > > Thanks again,Khaled
> > > > > > > > >
> > > > > > > > > > Date: Fri, 3 Oct 2014 15:18:30 +0800
> > > > > > > > > > Subject: Re: HBase read performance
> > > > > > > > > > From: tianq01@gmail.com
> > > > > > > > > > To: user@hbase.apache.org
> > > > > > > > > >
> > > > > > > > > > Regarding to profiling, Andrew introduced
> > > > > > > > > >
> > > > > http://www.brendangregg.com/blog/2014-06-12/java-flame-graphs.html
> > > > > > > > months
> > > > > > > > > > ago.
> > > > > > > > > >
> > > > > > > > > > processCallTime comes from RpcServer#call, so it looks
> good?
> > > > > > > > > >
> > > > > > > > > > I have a suspect:
> > > > > https://issues.apache.org/jira/browse/HBASE-11306
> > > > > > > > > >
> > > > > > > > > > how many processes do you have for your 2000 threads?
> > > > > > > > > > if olny 1 process, those threads will share just 1
> > > connection to
> > > > > that
> > > > > > > > > > regionserver, there might be big contention on the RPC
> code
> > > path.
> > > > > > > > ---for
> > > > > > > > > > such case, could you try using different connections?
> > > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > >
> > >
> https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HConnectionManager.html
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Fri, Oct 3, 2014 at 9:55 AM, Ted Yu <
> yuzhihong@gmail.com>
> > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Khaled:
> > > > > > > > > > > Do you have profiler such as jprofiler ?
> > > > > > > > > > > Profiling would give us more hint.
> > > > > > > > > > >
> > > > > > > > > > > Otherwise capturing stack trace during the period of
> > > reverse
> > > > > scan
> > > > > > > > would
> > > > > > > > > > > help.
> > > > > > > > > > >
> > > > > > > > > > > Cheers
> > > > > > > > > > >
> > > > > > > > > > > On Thu, Oct 2, 2014 at 4:52 PM, lars hofhansl <
> > > > > larsh@apache.org>
> > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > You might have the data in the OS buffer cache,
> without
> > > short
> > > > > > > > circuit
> > > > > > > > > > > > reading the region server has to request the block
> from
> > > the
> > > > > data
> > > > > > > > node
> > > > > > > > > > > > process, which then reads it from the block cache.
> > > > > > > > > > > > That is a few context switches per RPC that do not
> show
> > > up
> > > > > in CPU
> > > > > > > > > > > metrics.
> > > > > > > > > > > > In that you also would not see disk IO.
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > If - as you say - you see a lot of evicted blocks the
> > > data
> > > > > *has*
> > > > > > > > to come
> > > > > > > > > > > > from the OS. If you do not see disk IO is *has* to
> come
> > > from
> > > > > the
> > > > > > > OS
> > > > > > > > > > > cache.
> > > > > > > > > > > > I.e. there's more RAM on your boxes, and you should
> > > increase
> > > > > the
> > > > > > > > heap
> > > > > > > > > > > block
> > > > > > > > > > > > cache.
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > You can measure the context switches with vmstat.
> Other
> > > than
> > > > > that
> > > > > > > > I have
> > > > > > > > > > > > no suggestion until I reproduce the problem.
> > > > > > > > > > > > Also check the data locality index of the region
> server
> > > it
> > > > > should
> > > > > > > > be
> > > > > > > > > > > close
> > > > > > > > > > > > to 100%.
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > -- Lars
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > ________________________________
> > > > > > > > > > > > From: Khaled Elmeleegy <kd...@hotmail.com>
> > > > > > > > > > > > To: "user@hbase.apache.org" <us...@hbase.apache.org>
> > > > > > > > > > > > Sent: Thursday, October 2, 2014 3:24 PM
> > > > > > > > > > > > Subject: RE: HBase read performance
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Lars thanks a lot about all the tips. I'll make sure
> I
> > > cover
> > > > > all
> > > > > > > > of them
> > > > > > > > > > > > and get back to you. I am not sure they are the
> > > bottleneck
> > > > > though
> > > > > > > > as they
> > > > > > > > > > > > all are about optimizing physical resource usage. As
> I
> > > said,
> > > > > I
> > > > > > > > don't see
> > > > > > > > > > > > any contended physical resources now. I'll also try
> to
> > > > > reproduce
> > > > > > > > this
> > > > > > > > > > > > problem in a simpler environment and pass to you the
> test
> > > > > program
> > > > > > > > to play
> > > > > > > > > > > > with.
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Couple of high level points to make. You are right
> that
> > > my
> > > > > use
> > > > > > > > case is
> > > > > > > > > > > > kind of a worst case for HBase reads. But, if things
> go
> > > the
> > > > > way
> > > > > > > you
> > > > > > > > > > > > described them, there should be tons of disk IO and
> that
> > > > > should
> > > > > > > be
> > > > > > > > > > > clearly
> > > > > > > > > > > > the bottleneck. This is not the case though. That's
> for
> > > the
> > > > > > > simple
> > > > > > > > reason
> > > > > > > > > > > > that this is done in a test environment (I am still
> > > > > prototyping),
> > > > > > > > and
> > > > > > > > > > > not a
> > > > > > > > > > > > lot of data is yet written to HBase. However for the
> > > real use
> > > > > > > > case, there
> > > > > > > > > > > > should writers constantly writing data to HBase and
> > > readers
> > > > > > > > occasionally
> > > > > > > > > > > > doing this scatter/gather. At steady state, things
> should
> > > > > only
> > > > > > > get
> > > > > > > > worse
> > > > > > > > > > > > and all the issues you mentioned should get far more
> > > > > pronounced.
> > > > > > > > At this
> > > > > > > > > > > > point, one can try to mitigate it using more memory
> or
> > > so. I
> > > > > am
> > > > > > > > not there
> > > > > > > > > > > > yet as I think I am hitting some software bottleneck,
> > > which I
> > > > > > > > don't know
> > > > > > > > > > > > how to work around.
> > > > > > > > > > > >
> > > > > > > > > > > > Khaled
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > ----------------------------------------
> > > > > > > > > > > > > Date: Thu, 2 Oct 2014 14:20:47 -0700
> > > > > > > > > > > > > From: larsh@apache.org
> > > > > > > > > > > > > Subject: Re: HBase read performance
> > > > > > > > > > > > > To: user@hbase.apache.org
> > > > > > > > > > > > >
> > > > > > > > > > > > > OK... We might need to investigate this.
> > > > > > > > > > > > > Any chance that you can provide a minimal test
> program
> > > and
> > > > > > > > instruction
> > > > > > > > > > > > about how to set it up.
> > > > > > > > > > > > > We can do some profiling then.
> > > > > > > > > > > > >
> > > > > > > > > > > > > One thing to note is that with scanning HBase
> cannot
> > > use
> > > > > bloom
> > > > > > > > filters
> > > > > > > > > > > > to rule out HFiles ahead of time, it needs to look
> into
> > > all
> > > > > of
> > > > > > > > them.
> > > > > > > > > > > > > So you kind of hit on the absolute worst case:
> > > > > > > > > > > > > - random reads that do not fit into the block cache
> > > > > > > > > > > > > - cannot use bloom filters
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > Few more question/comments:
> > > > > > > > > > > > > - Do you have short circuit reading enabled? If
> not,
> > > you
> > > > > > > should.
> > > > > > > > > > > > > - Is your table major compacted? That will reduce
> the
> > > > > number of
> > > > > > > > files
> > > > > > > > > > > to
> > > > > > > > > > > > look at.
> > > > > > > > > > > > > - Did you disable Nagle's everywhere (enabled
> > > tcpnodelay)?
> > > > > It
> > > > > > > > disabled
> > > > > > > > > > > > by default in HBase, but necessarily in your install
> of
> > > HDFS.
> > > > > > > > > > > > > - Which version of HDFS are you using as backing
> > > > > filesystem?
> > > > > > > > > > > > > - If your disk is idle, it means the data fits into
> > > the OS
> > > > > > > buffer
> > > > > > > > > > > cache.
> > > > > > > > > > > > In turn that means that you increase the heap for the
> > > region
> > > > > > > > servers. You
> > > > > > > > > > > > can also use block encoding (FAST_DIFF) to try to
> make
> > > sure
> > > > > the
> > > > > > > > entire
> > > > > > > > > > > > working set fits into the cache.
> > > > > > > > > > > > >
> > > > > > > > > > > > > - Also try to reduce the block size - although if
> your
> > > > > overall
> > > > > > > > working
> > > > > > > > > > > > set does not fit in the heap it won't help much.
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > This is a good section of the book to read through
> > > > > generally
> > > > > > > > (even
> > > > > > > > > > > > though you might know most of this already):
> > > > > > > > > > > >
> http://hbase.apache.org/book.html#perf.configurations
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > -- Lars
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > ----- Original Message -----
> > > > > > > > > > > > > From: Khaled Elmeleegy <kd...@hotmail.com>
> > > > > > > > > > > > > To: "user@hbase.apache.org" <user@hbase.apache.org
> >
> > > > > > > > > > > > > Cc:
> > > > > > > > > > > > > Sent: Thursday, October 2, 2014 11:27 AM
> > > > > > > > > > > > > Subject: RE: HBase read performance
> > > > > > > > > > > > >
> > > > > > > > > > > > > I do see a very brief spike in CPU (user/system),
> but
> > > it's
> > > > > no
> > > > > > > > where
> > > > > > > > > > > near
> > > > > > > > > > > > 0% idle. It goes from 99% idle down to something
> like 40%
> > > > > idle
> > > > > > > for
> > > > > > > > a
> > > > > > > > > > > second
> > > > > > > > > > > > or so. The thing to note, this is all on a test
> cluster,
> > > so
> > > > > no
> > > > > > > > real load.
> > > > > > > > > > > > Things are generally idle until i issue 2-3 of these
> > > > > > > > multi-scan-requests
> > > > > > > > > > > to
> > > > > > > > > > > > render a web page. Then, you see the spike in the
> cpu and
> > > > > some
> > > > > > > > activity
> > > > > > > > > > > in
> > > > > > > > > > > > the network and disk, but nowhere near saturation.
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > If there are specific tests you'd like me to do to
> > > debug
> > > > > this,
> > > > > > > > I'd be
> > > > > > > > > > > > more than happy to do it.
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > Khaled
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > ----------------------------------------
> > > > > > > > > > > > >> Date: Thu, 2 Oct 2014 11:15:59 -0700
> > > > > > > > > > > > >> From: larsh@apache.org
> > > > > > > > > > > > >> Subject: Re: HBase read performance
> > > > > > > > > > > > >> To: user@hbase.apache.org
> > > > > > > > > > > > >>
> > > > > > > > > > > > >> I still think you're waiting on disk. No IOWAIT?
> So
> > > CPU
> > > > > is not
> > > > > > > > waiting
> > > > > > > > > > > > a lot for IO. No high User/System CPU either?
> > > > > > > > > > > > >>
> > > > > > > > > > > > >> If you see a lot of evicted block then each RPC
> has a
> > > high
> > > > > > > > chance of
> > > > > > > > > > > > requiring to bring an entire 64k block in. You'll
> see bad
> > > > > > > > performance
> > > > > > > > > > > with
> > > > > > > > > > > > this.
> > > > > > > > > > > > >>
> > > > > > > > > > > > >> We might need to trace this specific scenario.
> > > > > > > > > > > > >>
> > > > > > > > > > > > >> -- Lars
> > > > > > > > > > > > >>
> > > > > > > > > > > > >>
> > > > > > > > > > > > >>
> > > > > > > > > > > > >> ________________________________
> > > > > > > > > > > > >> From: Khaled Elmeleegy <kd...@hotmail.com>
> > > > > > > > > > > > >> To: "user@hbase.apache.org" <
> user@hbase.apache.org>
> > > > > > > > > > > > >> Sent: Thursday, October 2, 2014 10:46 AM
> > > > > > > > > > > > >> Subject: RE: HBase read performance
> > > > > > > > > > > > >>
> > > > > > > > > > > > >>
> > > > > > > > > > > > >> I've set the heap size to 6GB and I do have gc
> > > logging. No
> > > > > > > long
> > > > > > > > pauses
> > > > > > > > > > > > there -- occasional 0.1s or 0.2s.
> > > > > > > > > > > > >>
> > > > > > > > > > > > >> Other than the discrepancy between what's
> reported on
> > > the
> > > > > > > > client and
> > > > > > > > > > > > what's reported at the RS, there is also the issue
> of not
> > > > > getting
> > > > > > > > proper
> > > > > > > > > > > > concurrency. So, even if a reverse get takes 100ms
> or so
> > > > > (this
> > > > > > > has
> > > > > > > > to be
> > > > > > > > > > > > mostly blocking on various things as no physical
> > > resource is
> > > > > > > > contended),
> > > > > > > > > > > > then the other gets/scans should be able to proceed
> in
> > > > > parallel,
> > > > > > > > so a
> > > > > > > > > > > > thousand concurrent gets/scans should finish in few
> > > hundreds
> > > > > of
> > > > > > > ms
> > > > > > > > not
> > > > > > > > > > > many
> > > > > > > > > > > > seconds. That's why I thought I'd increase the
> handlers
> > > > > count to
> > > > > > > > try to
> > > > > > > > > > > get
> > > > > > > > > > > > more concurrency, but it didn't help. So, there must
> be
> > > > > something
> > > > > > > > else.
> > > > > > > > > > > > >>
> > > > > > > > > > > > >> Khaled
> > > > > > > > > > > > >>
> > > > > > > > > > > > >> ----------------------------------------
> > > > > > > > > > > > >>> From: ndimiduk@gmail.com
> > > > > > > > > > > > >>> Date: Thu, 2 Oct 2014 10:36:39 -0700
> > > > > > > > > > > > >>> Subject: Re: HBase read performance
> > > > > > > > > > > > >>> To: user@hbase.apache.org
> > > > > > > > > > > > >>>
> > > > > > > > > > > > >>> Do check again on the heap size of the region
> > > servers.
> > > > > The
> > > > > > > > default
> > > > > > > > > > > > >>> unconfigured size is 1G; too small for much of
> > > anything.
> > > > > > > Check
> > > > > > > > your
> > > > > > > > > > > RS
> > > > > > > > > > > > logs
> > > > > > > > > > > > >>> -- look for lines produced by the JVMPauseMonitor
> > > thread.
> > > > > > > They
> > > > > > > > > > > usually
> > > > > > > > > > > > >>> correlate with long GC pauses or other
> process-freeze
> > > > > events.
> > > > > > > > > > > > >>>
> > > > > > > > > > > > >>> Get is implemented as a Scan of a single row, so
> a
> > > > > reverse
> > > > > > > > scan of a
> > > > > > > > > > > > single
> > > > > > > > > > > > >>> row should be functionally equivalent.
> > > > > > > > > > > > >>>
> > > > > > > > > > > > >>> In practice, I have seen discrepancy between the
> > > > > latencies
> > > > > > > > reported
> > > > > > > > > > > by
> > > > > > > > > > > > the
> > > > > > > > > > > > >>> RS and the latencies experienced by the client.
> I've
> > > not
> > > > > > > > investigated
> > > > > > > > > > > > this
> > > > > > > > > > > > >>> area thoroughly.
> > > > > > > > > > > > >>>
> > > > > > > > > > > > >>> On Thu, Oct 2, 2014 at 10:05 AM, Khaled
> Elmeleegy <
> > > > > > > > kdiaa@hotmail.com
> > > > > > > > > > > >
> > > > > > > > > > > > wrote:
> > > > > > > > > > > > >>>
> > > > > > > > > > > > >>>> Thanks Lars for your quick reply.
> > > > > > > > > > > > >>>>
> > > > > > > > > > > > >>>> Yes performance is similar with less handlers (I
> > > tried
> > > > > with
> > > > > > > > 100
> > > > > > > > > > > > first).
> > > > > > > > > > > > >>>>
> > > > > > > > > > > > >>>> The payload is not big ~1KB or so. The working
> set
> > > > > doesn't
> > > > > > > > seem to
> > > > > > > > > > > > fit in
> > > > > > > > > > > > >>>> memory as there are many cache misses. However,
> > > disk is
> > > > > far
> > > > > > > > from
> > > > > > > > > > > > being a
> > > > > > > > > > > > >>>> bottleneck. I checked using iostat. I also
> verified
> > > that
> > > > > > > > neither the
> > > > > > > > > > > > >>>> network nor the CPU of the region server or the
> > > client
> > > > > are a
> > > > > > > > > > > > bottleneck.
> > > > > > > > > > > > >>>> This leads me to believe that likely this is a
> > > software
> > > > > > > > bottleneck,
> > > > > > > > > > > > >>>> possibly due to a misconfiguration on my side. I
> > > just
> > > > > don't
> > > > > > > > know how
> > > > > > > > > > > > to
> > > > > > > > > > > > >>>> debug it. A clear disconnect I see is the
> individual
> > > > > request
> > > > > > > > latency
> > > > > > > > > > > > as
> > > > > > > > > > > > >>>> reported by metrics on the region server (IPC
> > > > > > > processCallTime
> > > > > > > > vs
> > > > > > > > > > > > scanNext)
> > > > > > > > > > > > >>>> vs what's measured on the client. Does this
> sound
> > > > > right? Any
> > > > > > > > ideas
> > > > > > > > > > > on
> > > > > > > > > > > > how
> > > > > > > > > > > > >>>> to better debug it?
> > > > > > > > > > > > >>>>
> > > > > > > > > > > > >>>> About this trick with the timestamps to be able
> to
> > > do a
> > > > > > > > forward
> > > > > > > > > > > scan,
> > > > > > > > > > > > >>>> thanks for pointing it out. Actually, I am
> aware of
> > > it.
> > > > > The
> > > > > > > > problem
> > > > > > > > > > > I
> > > > > > > > > > > > have
> > > > > > > > > > > > >>>> is, sometimes I want to get the key after a
> > > particular
> > > > > > > > timestamp and
> > > > > > > > > > > > >>>> sometimes I want to get the key before, so just
> > > relying
> > > > > on
> > > > > > > > the key
> > > > > > > > > > > > order
> > > > > > > > > > > > >>>> doesn't work. Ideally, I want a reverse get(). I
> > > thought
> > > > > > > > reverse
> > > > > > > > > > > scan
> > > > > > > > > > > > can
> > > > > > > > > > > > >>>> do the trick though.
> > > > > > > > > > > > >>>>
> > > > > > > > > > > > >>>> Khaled
> > > > > > > > > > > > >>>>
> > > > > > > > > > > > >>>> ----------------------------------------
> > > > > > > > > > > > >>>>> Date: Thu, 2 Oct 2014 09:40:37 -0700
> > > > > > > > > > > > >>>>> From: larsh@apache.org
> > > > > > > > > > > > >>>>> Subject: Re: HBase read performance
> > > > > > > > > > > > >>>>> To: user@hbase.apache.org
> > > > > > > > > > > > >>>>>
> > > > > > > > > > > > >>>>> Hi Khaled,
> > > > > > > > > > > > >>>>> is it the same with fewer threads? 1500 handler
> > > threads
> > > > > > > > seems to
> > > > > > > > > > > be a
> > > > > > > > > > > > >>>> lot. Typically a good number of threads depends
> on
> > > the
> > > > > > > > hardware
> > > > > > > > > > > > (number of
> > > > > > > > > > > > >>>> cores, number of spindles, etc). I cannot think
> of
> > > any
> > > > > type
> > > > > > > of
> > > > > > > > > > > > scenario
> > > > > > > > > > > > >>>> where more than 100 would give any improvement.
> > > > > > > > > > > > >>>>>
> > > > > > > > > > > > >>>>> How large is the payload per KV retrieved that
> > > way? If
> > > > > > > large
> > > > > > > > (as
> > > > > > > > > > > in a
> > > > > > > > > > > > >>>> few 100k) you definitely want to lower the
> number
> > > of the
> > > > > > > > handler
> > > > > > > > > > > > threads.
> > > > > > > > > > > > >>>>> How much heap do you give the region server?
> Does
> > > the
> > > > > > > > working set
> > > > > > > > > > > fit
> > > > > > > > > > > > >>>> into the cache? (i.e. in the metrics, do you
> see the
> > > > > > > eviction
> > > > > > > > count
> > > > > > > > > > > > going
> > > > > > > > > > > > >>>> up, if so it does not fit into the cache).
> > > > > > > > > > > > >>>>>
> > > > > > > > > > > > >>>>> If the working set does not fit into the cache
> > > > > (eviction
> > > > > > > > count goes
> > > > > > > > > > > > up)
> > > > > > > > > > > > >>>> then HBase will need to bring a new block in
> from
> > > disk
> > > > > on
> > > > > > > > each Get
> > > > > > > > > > > > >>>> (assuming the Gets are more or less random as
> far
> > > as the
> > > > > > > > server is
> > > > > > > > > > > > >>>> concerned).
> > > > > > > > > > > > >>>>> In case you'll benefit from reducing the HFile
> > > block
> > > > > size
> > > > > > > > (from 64k
> > > > > > > > > > > > to
> > > > > > > > > > > > >>>> 8k or even 4k).
> > > > > > > > > > > > >>>>>
> > > > > > > > > > > > >>>>> Lastly I don't think we tested the performance
> of
> > > using
> > > > > > > > reverse
> > > > > > > > > > > scan
> > > > > > > > > > > > >>>> this way, there is probably room to optimize
> this.
> > > > > > > > > > > > >>>>> Can you restructure your keys to allow forwards
> > > > > scanning?
> > > > > > > For
> > > > > > > > > > > example
> > > > > > > > > > > > >>>> you could store the time as MAX_LONG-time. Or
> you
> > > could
> > > > > > > > invert all
> > > > > > > > > > > > the bits
> > > > > > > > > > > > >>>> of the time portion of the key, so that it sort
> the
> > > > > other
> > > > > > > > way. Then
> > > > > > > > > > > > you
> > > > > > > > > > > > >>>> could do a forward scan.
> > > > > > > > > > > > >>>>>
> > > > > > > > > > > > >>>>> Let us know how it goes.
> > > > > > > > > > > > >>>>>
> > > > > > > > > > > > >>>>> -- Lars
> > > > > > > > > > > > >>>>>
> > > > > > > > > > > > >>>>>
> > > > > > > > > > > > >>>>> ----- Original Message -----
> > > > > > > > > > > > >>>>> From: Khaled Elmeleegy <kd...@hotmail.com>
> > > > > > > > > > > > >>>>> To: "user@hbase.apache.org" <
> user@hbase.apache.org
> > > >
> > > > > > > > > > > > >>>>> Cc:
> > > > > > > > > > > > >>>>> Sent: Thursday, October 2, 2014 12:12 AM
> > > > > > > > > > > > >>>>> Subject: HBase read performance
> > > > > > > > > > > > >>>>>
> > > > > > > > > > > > >>>>> Hi,
> > > > > > > > > > > > >>>>>
> > > > > > > > > > > > >>>>> I am trying to do a scatter/gather on hbase
> > > (0.98.6.1),
> > > > > > > > where I
> > > > > > > > > > > have
> > > > > > > > > > > > a
> > > > > > > > > > > > >>>> client reading ~1000 keys from an HBase table.
> These
> > > > > keys
> > > > > > > > happen to
> > > > > > > > > > > > fall on
> > > > > > > > > > > > >>>> the same region server. For my reads I use
> reverse
> > > scan
> > > > > to
> > > > > > > > read each
> > > > > > > > > > > > key as
> > > > > > > > > > > > >>>> I want the key prior to a specific time stamp
> (time
> > > > > stamps
> > > > > > > are
> > > > > > > > > > > stored
> > > > > > > > > > > > in
> > > > > > > > > > > > >>>> reverse order). I don't believe gets can
> accomplish
> > > > > that,
> > > > > > > > right? so
> > > > > > > > > > > I
> > > > > > > > > > > > use
> > > > > > > > > > > > >>>> scan, with caching set to 1.
> > > > > > > > > > > > >>>>>
> > > > > > > > > > > > >>>>> I use 2000 reader threads in the client and on
> > > HBase,
> > > > > I've
> > > > > > > > set
> > > > > > > > > > > > >>>> hbase.regionserver.handler.count to 1500. With
> this
> > > > > setup,
> > > > > > > my
> > > > > > > > > > > scatter
> > > > > > > > > > > > >>>> gather is very slow and can take up to 10s in
> total.
> > > > > Timing
> > > > > > > an
> > > > > > > > > > > > individual
> > > > > > > > > > > > >>>> getScanner(..) call on the client side, it can
> > > easily
> > > > > take
> > > > > > > few
> > > > > > > > > > > > hundreds of
> > > > > > > > > > > > >>>> ms. I also got the following metrics from the
> region
> > > > > server
> > > > > > > in
> > > > > > > > > > > > question:
> > > > > > > > > > > > >>>>>
> > > > > > > > > > > > >>>>> "queueCallTime_mean" : 2.190855525775637,
> > > > > > > > > > > > >>>>> "queueCallTime_median" : 0.0,
> > > > > > > > > > > > >>>>> "queueCallTime_75th_percentile" : 0.0,
> > > > > > > > > > > > >>>>> "queueCallTime_95th_percentile" : 1.0,
> > > > > > > > > > > > >>>>> "queueCallTime_99th_percentile" :
> > > 556.9799999999818,
> > > > > > > > > > > > >>>>>
> > > > > > > > > > > > >>>>> "processCallTime_min" : 0,
> > > > > > > > > > > > >>>>> "processCallTime_max" : 12755,
> > > > > > > > > > > > >>>>> "processCallTime_mean" : 105.64873440912682,
> > > > > > > > > > > > >>>>> "processCallTime_median" : 0.0,
> > > > > > > > > > > > >>>>> "processCallTime_75th_percentile" : 2.0,
> > > > > > > > > > > > >>>>> "processCallTime_95th_percentile" : 7917.95,
> > > > > > > > > > > > >>>>> "processCallTime_99th_percentile" : 8876.89,
> > > > > > > > > > > > >>>>>
> > > > > > > > > > > > >>>>>
> > > > > > > > > > > > >>>>
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > >
> > >
> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_min"
> > > > > > > > > > > > >>>> : 89,
> > > > > > > > > > > > >>>>>
> > > > > > > > > > > > >>>>
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > >
> > >
> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_max"
> > > > > > > > > > > > >>>> : 11300,
> > > > > > > > > > > > >>>>>
> > > > > > > > > > > > >>>>
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > >
> > >
> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_mean"
> > > > > > > > > > > > >>>> : 654.4949739797315,
> > > > > > > > > > > > >>>>>
> > > > > > > > > > > > >>>>
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > >
> > >
> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_median"
> > > > > > > > > > > > >>>> : 101.0,
> > > > > > > > > > > > >>>>>
> > > > > > > > > > > > >>>>
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > >
> > >
> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_75th_percentile"
> > > > > > > > > > > > >>>> : 101.0,
> > > > > > > > > > > > >>>>>
> > > > > > > > > > > > >>>>
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > >
> > >
> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_95th_percentile"
> > > > > > > > > > > > >>>> : 101.0,
> > > > > > > > > > > > >>>>>
> > > > > > > > > > > > >>>>
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > >
> > >
> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_99th_percentile"
> > > > > > > > > > > > >>>> : 113.0,
> > > > > > > > > > > > >>>>>
> > > > > > > > > > > > >>>>> Where "delta" is the name of the table I am
> > > querying.
> > > > > > > > > > > > >>>>>
> > > > > > > > > > > > >>>>> In addition to all this, i monitored the
> hardware
> > > > > resources
> > > > > > > > (CPU,
> > > > > > > > > > > > disk,
> > > > > > > > > > > > >>>> and network) of both the client and the region
> > > server
> > > > > and
> > > > > > > > nothing
> > > > > > > > > > > > seems
> > > > > > > > > > > > >>>> anywhere near saturation. So I am puzzled by
> what's
> > > > > going on
> > > > > > > > and
> > > > > > > > > > > > where this
> > > > > > > > > > > > >>>> time is going.
> > > > > > > > > > > > >>>>>
> > > > > > > > > > > > >>>>> Few things to note based on the above
> measurements:
> > > > > both
> > > > > > > > medians of
> > > > > > > > > > > > IPC
> > > > > > > > > > > > >>>> processCallTime and queueCallTime are basically
> zero
> > > > > (ms I
> > > > > > > > presume,
> > > > > > > > > > > > >>>> right?). However, scanNext_median is 101 (ms
> too,
> > > > > right?). I
> > > > > > > > am not
> > > > > > > > > > > > sure
> > > > > > > > > > > > >>>> how this adds up. Also, even though the 101
> figure
> > > seems
> > > > > > > > > > > outrageously
> > > > > > > > > > > > high
> > > > > > > > > > > > >>>> and I don't know why, still all these scans
> should
> > > be
> > > > > > > > happening in
> > > > > > > > > > > > >>>> parallel, so the overall call should finish
> fast,
> > > given
> > > > > that
> > > > > > > > no
> > > > > > > > > > > > hardware
> > > > > > > > > > > > >>>> resource is contended, right? but this is not
> what's
> > > > > > > > happening, so I
> > > > > > > > > > > > have
> > > > > > > > > > > > >>>> to be missing something(s).
> > > > > > > > > > > > >>>>>
> > > > > > > > > > > > >>>>> So, any help is appreciated there.
> > > > > > > > > > > > >>>>>
> > > > > > > > > > > > >>>>> Thanks,
> > > > > > > > > > > > >>>>> Khaled
> > > > > > > > > > > > >>>>
> > > > > > > > > > > > >>>>
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Best regards,
> > > >
> > > >    - Andy
> > > >
> > > > Problems worthy of attack prove their worth by hitting back. - Piet
> Hein
> > > > (via Tom White)
> > >
> > >
>
>

RE: HBase read performance

Posted by Khaled Elmeleegy <kd...@hotmail.com>.

Nick,

I am still prototyping, hence the small instance. I thought XX:MaxDirectMemorySize is subset of Xmx, so I thought the total is 6GB. You are saying they are added. If so, that's not good for performance as the JVM will swap. However, this can't be the reason for the bug. 

For now, I guess the only option is to use LruBlockCache. However, I expect it to suffer from significant GC, as I churn memory fast because I want to cache blocks on writes and I do significant writing.

Khaled

----------------------------------------
> From: ndimiduk@gmail.com
> Date: Fri, 10 Oct 2014 15:00:45 -0700
> Subject: Re: HBase read performance
> To: user@hbase.apache.org
>
> Hang on, how are you using 11G total memory? m1.large only has 7.5G total
> RAM.
>
> On Fri, Oct 10, 2014 at 2:56 PM, Nick Dimiduk <nd...@gmail.com> wrote:
>
>> ByteBuffer position math errors makes me suspect #1 cacheonwrite and #2
>> bucketcache (and #3 their use in combination ;) )
>>
>> 11G memory is a small enough footprint that I'd not bother with
>> BucketCache; just stay on heap with default LruBlockCache.
>>
>> On Fri, Oct 10, 2014 at 2:11 PM, Stack <st...@duboce.net> wrote:
>>
>>> On Fri, Oct 10, 2014 at 10:59 AM, Khaled Elmeleegy <kd...@hotmail.com>
>>> wrote:
>>>
>>>> Yes, I can reproduce it with some work.
>>>> The workload is basically as follows:
>>>> There are writers streaming writes to a table. Then, there is a reader
>>>> (invoked via a web interface). The reader does a 1000 parallel reverse
>>>> scans, all end up hitting the same region in my case. The scans are
>>>> effectively "gets" as I just need to get one record off of each of
>>> them. I
>>>> just need to do a "reverse" get, which is not supported (would be great
>>> to
>>>> have :)), so I do it via reverse scan. After few tries, the reader
>>>> consistently hits this bug.
>>>>
>>>> This happens with these config changes:
>>>> hbase-env:HBASE_REGIONSERVER_OPTS=-Xmx6G -XX:MaxDirectMemorySize=5G
>>>> -XX:CMSInitiatingOccupancyFraction=88 -XX:+AggressiveOpts -verbose:gc
>>>> -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Xlog
>>>> gc:/tmp/hbase-regionserver-gc.loghbase-site:
>>>> hbase.bucketcache.ioengine=offheap
>>>> hbase.bucketcache.size=4196
>>>> hbase.rs.cacheblocksonwrite=true
>>>> hfile.block.index.cacheonwrite=true
>>>> hfile.block.bloom.cacheonwrite=true
>>>>
>>>> Interestingly, without these config changes, I can't reproduce the
>>> problem.
>>>
>>>
>>>
>>> How hard to play w/ combinations? Could you eliminate the cacheonwrites
>>> on
>>> one server and see if that cures the issue? Could trun off block cache on
>>> another to see if that the problem? Anything in your .out files related?
>>>
>>> St.Ack
>>>
>>
>>

Re: HBase read performance

Posted by Nick Dimiduk <nd...@gmail.com>.

Hang on, how are you using 11G total memory? m1.large only has 7.5G total
RAM.

On Fri, Oct 10, 2014 at 2:56 PM, Nick Dimiduk <nd...@gmail.com> wrote:

> ByteBuffer position math errors makes me suspect #1 cacheonwrite and #2
> bucketcache (and #3 their use in combination ;) )
>
> 11G memory is a small enough footprint that I'd not bother with
> BucketCache; just stay on heap with default LruBlockCache.
>
> On Fri, Oct 10, 2014 at 2:11 PM, Stack <st...@duboce.net> wrote:
>
>> On Fri, Oct 10, 2014 at 10:59 AM, Khaled Elmeleegy <kd...@hotmail.com>
>> wrote:
>>
>> > Yes, I can reproduce it with some work.
>> > The workload is basically as follows:
>> > There are writers streaming writes to a table. Then, there is a reader
>> > (invoked via a web interface). The reader does a 1000 parallel reverse
>> > scans, all end up hitting the same region in my case. The scans are
>> > effectively "gets" as I just need to get one record off of each of
>> them. I
>> > just need to do a "reverse" get, which is not supported (would be great
>> to
>> > have :)), so I do it via reverse scan. After few tries, the reader
>> > consistently hits this bug.
>> >
>> > This happens with these config changes:
>> > hbase-env:HBASE_REGIONSERVER_OPTS=-Xmx6G -XX:MaxDirectMemorySize=5G
>> > -XX:CMSInitiatingOccupancyFraction=88 -XX:+AggressiveOpts -verbose:gc
>> > -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Xlog
>> > gc:/tmp/hbase-regionserver-gc.loghbase-site:
>> > hbase.bucketcache.ioengine=offheap
>> > hbase.bucketcache.size=4196
>> > hbase.rs.cacheblocksonwrite=true
>> > hfile.block.index.cacheonwrite=true
>> > hfile.block.bloom.cacheonwrite=true
>> >
>> > Interestingly, without these config changes, I can't reproduce the
>> problem.
>>
>>
>>
>> How hard to play w/ combinations?  Could you eliminate the cacheonwrites
>> on
>> one server and see if that cures the issue?  Could trun off block cache on
>> another to see if that the problem?  Anything in your .out files related?
>>
>> St.Ack
>>
>
>

Re: HBase read performance

Posted by Nick Dimiduk <nd...@gmail.com>.

ByteBuffer position math errors makes me suspect #1 cacheonwrite and #2
bucketcache (and #3 their use in combination ;) )

11G memory is a small enough footprint that I'd not bother with
BucketCache; just stay on heap with default LruBlockCache.

On Fri, Oct 10, 2014 at 2:11 PM, Stack <st...@duboce.net> wrote:

> On Fri, Oct 10, 2014 at 10:59 AM, Khaled Elmeleegy <kd...@hotmail.com>
> wrote:
>
> > Yes, I can reproduce it with some work.
> > The workload is basically as follows:
> > There are writers streaming writes to a table. Then, there is a reader
> > (invoked via a web interface). The reader does a 1000 parallel reverse
> > scans, all end up hitting the same region in my case. The scans are
> > effectively "gets" as I just need to get one record off of each of them.
> I
> > just need to do a "reverse" get, which is not supported (would be great
> to
> > have :)), so I do it via reverse scan. After few tries, the reader
> > consistently hits this bug.
> >
> > This happens with these config changes:
> > hbase-env:HBASE_REGIONSERVER_OPTS=-Xmx6G -XX:MaxDirectMemorySize=5G
> > -XX:CMSInitiatingOccupancyFraction=88 -XX:+AggressiveOpts -verbose:gc
> > -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Xlog
> > gc:/tmp/hbase-regionserver-gc.loghbase-site:
> > hbase.bucketcache.ioengine=offheap
> > hbase.bucketcache.size=4196
> > hbase.rs.cacheblocksonwrite=true
> > hfile.block.index.cacheonwrite=true
> > hfile.block.bloom.cacheonwrite=true
> >
> > Interestingly, without these config changes, I can't reproduce the
> problem.
>
>
>
> How hard to play w/ combinations?  Could you eliminate the cacheonwrites on
> one server and see if that cures the issue?  Could trun off block cache on
> another to see if that the problem?  Anything in your .out files related?
>
> St.Ack
>

Re: HBase read performance

Posted by Stack <st...@duboce.net>.

On Fri, Oct 10, 2014 at 10:59 AM, Khaled Elmeleegy <kd...@hotmail.com>
wrote:

> Yes, I can reproduce it with some work.
> The workload is basically as follows:
> There are writers streaming writes to a table. Then, there is a reader
> (invoked via a web interface). The reader does a 1000 parallel reverse
> scans, all end up hitting the same region in my case. The scans are
> effectively "gets" as I just need to get one record off of each of them. I
> just need to do a "reverse" get, which is not supported (would be great to
> have :)), so I do it via reverse scan. After few tries, the reader
> consistently hits this bug.
>
> This happens with these config changes:
> hbase-env:HBASE_REGIONSERVER_OPTS=-Xmx6G -XX:MaxDirectMemorySize=5G
> -XX:CMSInitiatingOccupancyFraction=88 -XX:+AggressiveOpts -verbose:gc
> -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Xlog
> gc:/tmp/hbase-regionserver-gc.loghbase-site:
> hbase.bucketcache.ioengine=offheap
> hbase.bucketcache.size=4196
> hbase.rs.cacheblocksonwrite=true
> hfile.block.index.cacheonwrite=true
> hfile.block.bloom.cacheonwrite=true
>
> Interestingly, without these config changes, I can't reproduce the problem.



How hard to play w/ combinations?  Could you eliminate the cacheonwrites on
one server and see if that cures the issue?  Could trun off block cache on
another to see if that the problem?  Anything in your .out files related?

St.Ack

RE: HBase read performance

Posted by Khaled Elmeleegy <kd...@hotmail.com>.

Yes, I can reproduce it with some work. 
The workload is basically as follows:
There are writers streaming writes to a table. Then, there is a reader (invoked via a web interface). The reader does a 1000 parallel reverse scans, all end up hitting the same region in my case. The scans are effectively "gets" as I just need to get one record off of each of them. I just need to do a "reverse" get, which is not supported (would be great to have :)), so I do it via reverse scan. After few tries, the reader consistently hits this bug.

This happens with these config changes:
hbase-env:HBASE_REGIONSERVER_OPTS=-Xmx6G -XX:MaxDirectMemorySize=5G -XX:CMSInitiatingOccupancyFraction=88 -XX:+AggressiveOpts -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Xlog gc:/tmp/hbase-regionserver-gc.loghbase-site:
hbase.bucketcache.ioengine=offheap
hbase.bucketcache.size=4196
hbase.rs.cacheblocksonwrite=true
hfile.block.index.cacheonwrite=true
hfile.block.bloom.cacheonwrite=true

Interestingly, without these config changes, I can't reproduce the problem.
Khaled


> Date: Fri, 10 Oct 2014 10:05:14 -0700
> Subject: Re: HBase read performance
> From: stack@duboce.net
> To: user@hbase.apache.org
> 
> It looks like we are messing up our positioning math:
> 
> 233 <http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/nio/Buffer.java#233>
> 
> <http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/nio/Buffer.java#>
> 
>     public final Buffer
> <http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/nio/Buffer.java#Buffer>
>  <http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/nio/Buffer.java#>position(int
> newPosition) {
> 
> 234 <http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/nio/Buffer.java#234>
> 
> <http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/nio/Buffer.java#>
> 
>         if ((newPosition > limit
> <http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/nio/Buffer.java#Buffer.0limit>)
> || (newPosition < 0))
> 
> 235 <http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/nio/Buffer.java#235>
> 
> <http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/nio/Buffer.java#>
> 
>             throw new IllegalArgumentException
> <http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/lang/IllegalArgumentException.java#IllegalArgumentException>();
> 
> 
> Is it easy to reproduce Khaled?  Always same region/store or spread
> across all reads?
> 
> 
> St.Ack
> 
> 
> 
> On Fri, Oct 10, 2014 at 8:31 AM, Khaled Elmeleegy <kd...@hotmail.com> wrote:
> 
> > Andrew thanks. I think this indeed was the problem.
> > To get over it, I increased the amount of memory given the region server
> > to avoid IO on reads. I used the configs below. In my experiments, I have
> > writers streaming their output to HBase. The reader powers a web page and
> > does this scatter/gather, where it reads 1000 keys written last and passes
> > them the the front end. With this workload, I get the exception below at
> > the region server. Again, I am using HBAse (0.98.6.1). Any help is
> > appreciated.
> > hbase-env:HBASE_REGIONSERVER_OPTS=-Xmx6G -XX:MaxDirectMemorySize=5G
> > -XX:CMSInitiatingOccupancyFraction=88 -XX:+AggressiveOpts -verbose:gc
> > -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Xlog
> > gc:/tmp/hbase-regionserver-gc.log
> > hbase-site:hbase.bucketcache.ioengine=offheap
> >
> >                                    hfile.block.cache.size=0.4
> >
> >
> >  hbase.bucketcache.size=4196
> >
> >                         hbase.rs.cacheblocksonwrite=true
> >
> >
> >  hfile.block.index.cacheonwrite=true
> >
> >                         hfile.block.bloom.cacheonwrite=true
> >
> > 2014-10-10 15:06:44,173 ERROR
> > [B.DefaultRpcServer.handler=62,queue=2,port=60020] ipc.RpcServer:
> > Unexpected throwable object
> > java.lang.IllegalArgumentException
> >         at java.nio.Buffer.position(Buffer.java:236)
> >         at
> > org.apache.hadoop.hbase.util.ByteBufferUtils.skip(ByteBufferUtils.java:434)
> >         at
> > org.apache.hadoop.hbase.io.hfile.HFileReaderV2$ScannerV2.readKeyValueLen(HFileReaderV2.java:849)
> >         at
> > org.apache.hadoop.hbase.io.hfile.HFileReaderV2$ScannerV2.next(HFileReaderV2.java:760)
> >         at
> > org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekAtOrAfter(StoreFileScanner.java:248)
> >         at
> > org.apache.hadoop.hbase.regionserver.StoreFileScanner.seek(StoreFileScanner.java:152)
> >         at
> > org.apache.hadoop.hbase.regionserver.StoreScanner.seekScanners(StoreScanner.java:317)
> >         at
> > org.apache.hadoop.hbase.regionserver.StoreScanner.<init>(StoreScanner.java:176)
> >         at
> > org.apache.hadoop.hbase.regionserver.HStore.getScanner(HStore.java:1780)
> >         at
> > org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.<init>(HRegion.java:3758)
> >         at
> > org.apache.hadoop.hbase.regionserver.HRegion.instantiateRegionScanner(HRegion.java:1950)
> >         at
> > org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:1936)
> >         at
> > org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:1913)
> >         at
> > org.apache.hadoop.hbase.regionserver.HRegionServer.scan(HRegionServer.java:3157)
> >         at
> > org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:29587)
> >         at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2027)
> >         at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108)
> >         at
> > org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:114)
> >         at
> > org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:94)
> >         at java.lang.Thread.run(Thread.java:744)
> > > From: apurtell@apache.org
> > > Date: Tue, 7 Oct 2014 17:09:35 -0700
> > > Subject: Re: HBase read performance
> > > To: user@hbase.apache.org
> > >
> > > > The cluster has 2 m1.large nodes.
> > >
> > > That's the problem right there.
> > >
> > > You need to look at c3.4xlarge or i2 instances as a minimum requirement.
> > M1
> > > and even M3 instance types have ridiculously poor IO.
> > >
> > >
> > > On Tue, Oct 7, 2014 at 3:01 PM, Khaled Elmeleegy <kd...@hotmail.com>
> > wrote:
> > >
> > > > Thanks Nicolas, Qiang.
> > > >
> > > > I was able to write a simple program that reproduces the problem on a
> > tiny
> > > > HBase cluster on ec2. The cluster has 2 m1.large nodes. One node runs
> > the
> > > > master, name node and zookeeper. The other node runs a data node and a
> > > > region server, with heap size configured to be 6GB. There, the 1000
> > > > parallel reverse gets (reverse scans) take 7-8 seconds. The data set is
> > > > tiny (10M records, each having a small number of bytes). As I said
> > before,
> > > > all hardware resources are very idle there.
> > > >
> > > > Interestingly, running the same workload on my macbook, the 1000
> > parallel
> > > > gets take ~200ms on a pseudo-distributed installation.
> > > >
> > > > Any help to resolve this mystery is highly appreciated.
> > > >
> > > > P.S. please find my test program attached.
> > > >
> > > > Best,
> > > > Khaled
> > > >
> > > > > From: nkeywal@gmail.com
> > > > > Date: Mon, 6 Oct 2014 09:40:48 +0200
> > > > > Subject: Re: HBase read performance
> > > > > To: user@hbase.apache.org
> > > > >
> > > > > Hi,
> > > > >
> > > > > I haven't seen it mentioned, but if I understand correctly each scan
> > > > > returns a single row? If so you should use Scan#setSmall to save
> > some rpc
> > > > > calls.
> > > > >
> > > > > Cheers,
> > > > >
> > > > > Nicolas
> > > > >
> > > > >
> > > > > On Sun, Oct 5, 2014 at 11:28 AM, Qiang Tian <ti...@gmail.com>
> > wrote:
> > > > >
> > > > > > when using separate HConnection instance, both its
> > > > > > RpcClient instance(maintain connection to a regionserver) and
> > Registry
> > > > > > instance(maintain connection to zookeeper) will be separate..
> > > > > >
> > > > > > see
> > > > > >
> > > > > >
> > > >
> > http://shammijayasinghe.blogspot.com/2012/02/zookeeper-increase-maximum-number-of.html
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Sun, Oct 5, 2014 at 2:24 PM, Khaled Elmeleegy <
> > kdiaa@hotmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > I tried creating my own HConnections pool to use for my HBase
> > calls,
> > > > so
> > > > > > > that not all the (2K) threads share the same HConnection.
> > However, I
> > > > > > could
> > > > > > > only have 10 HConnections. Beyond that I get ZK exceptions,
> > please
> > > > find
> > > > > > it
> > > > > > > below. Also, with 10 HConnections, I don't see noticeable
> > > > improvement in
> > > > > > > performance so far.
> > > > > > > 2014-10-05 06:11:26,490 WARN [main]
> > zookeeper.RecoverableZooKeeper
> > > > > > > (RecoverableZooKeeper.java:retryOrThrow(253)) - Possibly
> > transient
> > > > > > > ZooKeeper, quorum=54.68.206.252:2181,
> > > > > > >
> > > > exception=org.apache.zookeeper.KeeperException$ConnectionLossException:
> > > > > > > KeeperErrorCode = ConnectionLoss for /hbase/hbaseid2014-10-05
> > > > > > 06:11:26,490
> > > > > > > INFO [main] util.RetryCounter
> > > > > > (RetryCounter.java:sleepUntilNextRetry(155))
> > > > > > > - Sleeping 1000ms before retry #0...2014-10-05 06:11:27,845 WARN
> > > > [main]
> > > > > > > zookeeper.RecoverableZooKeeper
> > > > > > > (RecoverableZooKeeper.java:retryOrThrow(253)) - Possibly
> > transient
> > > > > > > ZooKeeper, quorum=54.68.206.252:2181,
> > > > > > >
> > > > exception=org.apache.zookeeper.KeeperException$ConnectionLossException:
> > > > > > > KeeperErrorCode = ConnectionLoss for /hbase/hbaseid2014-10-05
> > > > > > 06:11:27,849
> > > > > > > INFO [main] util.RetryCounter
> > > > > > (RetryCounter.java:sleepUntilNextRetry(155))
> > > > > > > - Sleeping 2000ms before retry #1...2014-10-05 06:11:30,405 WARN
> > > > [main]
> > > > > > > zookeeper.RecoverableZooKeeper
> > > > > > > (RecoverableZooKeeper.java:retryOrThrow(253)) - Possibly
> > transient
> > > > > > > ZooKeeper, quorum=54.68.206.252:2181,
> > > > > > >
> > > > exception=org.apache.zookeeper.KeeperException$ConnectionLossException:
> > > > > > > KeeperErrorCode = ConnectionLoss for /hbase/hbaseid2014-10-05
> > > > > > 06:11:30,405
> > > > > > > INFO [main] util.RetryCounter
> > > > > > (RetryCounter.java:sleepUntilNextRetry(155))
> > > > > > > - Sleeping 4000ms before retry #2...2014-10-05 06:11:35,278 WARN
> > > > [main]
> > > > > > > zookeeper.RecoverableZooKeeper
> > > > > > > (RecoverableZooKeeper.java:retryOrThrow(253)) - Possibly
> > transient
> > > > > > > ZooKeeper, quorum=54.68.206.252:2181,
> > > > > > >
> > > > exception=org.apache.zookeeper.KeeperException$ConnectionLossException:
> > > > > > > KeeperErrorCode = ConnectionLoss for /hbase/hbaseid2014-10-05
> > > > > > 06:11:35,279
> > > > > > > INFO [main] util.RetryCounter
> > > > > > (RetryCounter.java:sleepUntilNextRetry(155))
> > > > > > > - Sleeping 8000ms before retry #3...2014-10-05 06:11:44,393 WARN
> > > > [main]
> > > > > > > zookeeper.RecoverableZooKeeper
> > > > > > > (RecoverableZooKeeper.java:retryOrThrow(253)) - Possibly
> > transient
> > > > > > > ZooKeeper, quorum=54.68.206.252:2181,
> > > > > > >
> > > > exception=org.apache.zookeeper.KeeperException$ConnectionLossException:
> > > > > > > KeeperErrorCode = ConnectionLoss for /hbase/hbaseid2014-10-05
> > > > > > 06:11:44,393
> > > > > > > ERROR [main] zookeeper.RecoverableZooKeeper
> > > > > > > (RecoverableZooKeeper.java:retryOrThrow(255)) - ZooKeeper exists
> > > > failed
> > > > > > > after 4 attempts2014-10-05 06:11:44,394 WARN [main]
> > zookeeper.ZKUtil
> > > > > > > (ZKUtil.java:checkExists(482)) - hconnection-0x4e174f3b, quorum=
> > > > > > > 54.68.206.252:2181, baseZNode=/hbase Unable to set watcher on
> > znode
> > > > > > >
> > > > > >
> > > >
> > (/hbase/hbaseid)org.apache.zookeeper.KeeperException$ConnectionLossException:
> > > > > > > KeeperErrorCode = ConnectionLoss for /hbase/hbaseid at
> > > > > > >
> > org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
> > > > at
> > > > > > >
> > org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
> > > > at
> > > > > > > org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1041) at
> > > > > > >
> > > > > >
> > > >
> > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.exists(RecoverableZooKeeper.java:199)
> > > > > > > at
> > > > org.apache.hadoop.hbase.zookeeper.ZKUtil.checkExists(ZKUtil.java:479)
> > > > > > > at
> > > > > > >
> > > > > >
> > > >
> > org.apache.hadoop.hbase.zookeeper.ZKClusterId.readClusterIdZNode(ZKClusterId.java:65)
> > > > > > > at
> > > > > > >
> > > > > >
> > > >
> > org.apache.hadoop.hbase.client.ZooKeeperRegistry.getClusterId(ZooKeeperRegistry.java:83)
> > > > > > > at
> > > > > > >
> > > > > >
> > > >
> > org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.retrieveClusterId(HConnectionManager.java:857)
> > > > > > > at
> > > > > > >
> > > > > >
> > > >
> > org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.<init>(HConnectionManager.java:662)
> > > > > > > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> > > > > > > Method) at
> > > > > > >
> > > > > >
> > > >
> > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
> > > > > > > at
> > > > > > >
> > > > > >
> > > >
> > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> > > > > > > at
> > java.lang.reflect.Constructor.newInstance(Constructor.java:526)
> > > > > > at
> > > > > > >
> > > > > >
> > > >
> > org.apache.hadoop.hbase.client.HConnectionManager.createConnection(HConnectionManager.java:414)
> > > > > > > at
> > > > > > >
> > > > > >
> > > >
> > org.apache.hadoop.hbase.client.HConnectionManager.createConnection(HConnectionManager.java:335)...
> > > > > > > > From: kdiaa@hotmail.com
> > > > > > > > To: user@hbase.apache.org
> > > > > > > > Subject: RE: HBase read performance
> > > > > > > > Date: Fri, 3 Oct 2014 12:37:39 -0700
> > > > > > > >
> > > > > > > > Lars, Ted, and Qiang,
> > > > > > > > Thanks for all the input.
> > > > > > > > Qiang: yes all the threads are in the same client process
> > sharing
> > > > the
> > > > > > > same connection. And since I don't see hardware contention, may
> > be
> > > > there
> > > > > > is
> > > > > > > contention over this code path. I'll try using many connections
> > and
> > > > see
> > > > > > if
> > > > > > > it alleviates the problems and I'll report back.
> > > > > > > > Thanks again,Khaled
> > > > > > > >
> > > > > > > > > Date: Fri, 3 Oct 2014 15:18:30 +0800
> > > > > > > > > Subject: Re: HBase read performance
> > > > > > > > > From: tianq01@gmail.com
> > > > > > > > > To: user@hbase.apache.org
> > > > > > > > >
> > > > > > > > > Regarding to profiling, Andrew introduced
> > > > > > > > >
> > > > http://www.brendangregg.com/blog/2014-06-12/java-flame-graphs.html
> > > > > > > months
> > > > > > > > > ago.
> > > > > > > > >
> > > > > > > > > processCallTime comes from RpcServer#call, so it looks good?
> > > > > > > > >
> > > > > > > > > I have a suspect:
> > > > https://issues.apache.org/jira/browse/HBASE-11306
> > > > > > > > >
> > > > > > > > > how many processes do you have for your 2000 threads?
> > > > > > > > > if olny 1 process, those threads will share just 1
> > connection to
> > > > that
> > > > > > > > > regionserver, there might be big contention on the RPC code
> > path.
> > > > > > > ---for
> > > > > > > > > such case, could you try using different connections?
> > > > > > > > >
> > > > > > >
> > > > > >
> > > >
> > https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HConnectionManager.html
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Fri, Oct 3, 2014 at 9:55 AM, Ted Yu <yu...@gmail.com>
> > > > wrote:
> > > > > > > > >
> > > > > > > > > > Khaled:
> > > > > > > > > > Do you have profiler such as jprofiler ?
> > > > > > > > > > Profiling would give us more hint.
> > > > > > > > > >
> > > > > > > > > > Otherwise capturing stack trace during the period of
> > reverse
> > > > scan
> > > > > > > would
> > > > > > > > > > help.
> > > > > > > > > >
> > > > > > > > > > Cheers
> > > > > > > > > >
> > > > > > > > > > On Thu, Oct 2, 2014 at 4:52 PM, lars hofhansl <
> > > > larsh@apache.org>
> > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > You might have the data in the OS buffer cache, without
> > short
> > > > > > > circuit
> > > > > > > > > > > reading the region server has to request the block from
> > the
> > > > data
> > > > > > > node
> > > > > > > > > > > process, which then reads it from the block cache.
> > > > > > > > > > > That is a few context switches per RPC that do not show
> > up
> > > > in CPU
> > > > > > > > > > metrics.
> > > > > > > > > > > In that you also would not see disk IO.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > If - as you say - you see a lot of evicted blocks the
> > data
> > > > *has*
> > > > > > > to come
> > > > > > > > > > > from the OS. If you do not see disk IO is *has* to come
> > from
> > > > the
> > > > > > OS
> > > > > > > > > > cache.
> > > > > > > > > > > I.e. there's more RAM on your boxes, and you should
> > increase
> > > > the
> > > > > > > heap
> > > > > > > > > > block
> > > > > > > > > > > cache.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > You can measure the context switches with vmstat. Other
> > than
> > > > that
> > > > > > > I have
> > > > > > > > > > > no suggestion until I reproduce the problem.
> > > > > > > > > > > Also check the data locality index of the region server
> > it
> > > > should
> > > > > > > be
> > > > > > > > > > close
> > > > > > > > > > > to 100%.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > -- Lars
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > ________________________________
> > > > > > > > > > > From: Khaled Elmeleegy <kd...@hotmail.com>
> > > > > > > > > > > To: "user@hbase.apache.org" <us...@hbase.apache.org>
> > > > > > > > > > > Sent: Thursday, October 2, 2014 3:24 PM
> > > > > > > > > > > Subject: RE: HBase read performance
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Lars thanks a lot about all the tips. I'll make sure I
> > cover
> > > > all
> > > > > > > of them
> > > > > > > > > > > and get back to you. I am not sure they are the
> > bottleneck
> > > > though
> > > > > > > as they
> > > > > > > > > > > all are about optimizing physical resource usage. As I
> > said,
> > > > I
> > > > > > > don't see
> > > > > > > > > > > any contended physical resources now. I'll also try to
> > > > reproduce
> > > > > > > this
> > > > > > > > > > > problem in a simpler environment and pass to you the test
> > > > program
> > > > > > > to play
> > > > > > > > > > > with.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Couple of high level points to make. You are right that
> > my
> > > > use
> > > > > > > case is
> > > > > > > > > > > kind of a worst case for HBase reads. But, if things go
> > the
> > > > way
> > > > > > you
> > > > > > > > > > > described them, there should be tons of disk IO and that
> > > > should
> > > > > > be
> > > > > > > > > > clearly
> > > > > > > > > > > the bottleneck. This is not the case though. That's for
> > the
> > > > > > simple
> > > > > > > reason
> > > > > > > > > > > that this is done in a test environment (I am still
> > > > prototyping),
> > > > > > > and
> > > > > > > > > > not a
> > > > > > > > > > > lot of data is yet written to HBase. However for the
> > real use
> > > > > > > case, there
> > > > > > > > > > > should writers constantly writing data to HBase and
> > readers
> > > > > > > occasionally
> > > > > > > > > > > doing this scatter/gather. At steady state, things should
> > > > only
> > > > > > get
> > > > > > > worse
> > > > > > > > > > > and all the issues you mentioned should get far more
> > > > pronounced.
> > > > > > > At this
> > > > > > > > > > > point, one can try to mitigate it using more memory or
> > so. I
> > > > am
> > > > > > > not there
> > > > > > > > > > > yet as I think I am hitting some software bottleneck,
> > which I
> > > > > > > don't know
> > > > > > > > > > > how to work around.
> > > > > > > > > > >
> > > > > > > > > > > Khaled
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > ----------------------------------------
> > > > > > > > > > > > Date: Thu, 2 Oct 2014 14:20:47 -0700
> > > > > > > > > > > > From: larsh@apache.org
> > > > > > > > > > > > Subject: Re: HBase read performance
> > > > > > > > > > > > To: user@hbase.apache.org
> > > > > > > > > > > >
> > > > > > > > > > > > OK... We might need to investigate this.
> > > > > > > > > > > > Any chance that you can provide a minimal test program
> > and
> > > > > > > instruction
> > > > > > > > > > > about how to set it up.
> > > > > > > > > > > > We can do some profiling then.
> > > > > > > > > > > >
> > > > > > > > > > > > One thing to note is that with scanning HBase cannot
> > use
> > > > bloom
> > > > > > > filters
> > > > > > > > > > > to rule out HFiles ahead of time, it needs to look into
> > all
> > > > of
> > > > > > > them.
> > > > > > > > > > > > So you kind of hit on the absolute worst case:
> > > > > > > > > > > > - random reads that do not fit into the block cache
> > > > > > > > > > > > - cannot use bloom filters
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Few more question/comments:
> > > > > > > > > > > > - Do you have short circuit reading enabled? If not,
> > you
> > > > > > should.
> > > > > > > > > > > > - Is your table major compacted? That will reduce the
> > > > number of
> > > > > > > files
> > > > > > > > > > to
> > > > > > > > > > > look at.
> > > > > > > > > > > > - Did you disable Nagle's everywhere (enabled
> > tcpnodelay)?
> > > > It
> > > > > > > disabled
> > > > > > > > > > > by default in HBase, but necessarily in your install of
> > HDFS.
> > > > > > > > > > > > - Which version of HDFS are you using as backing
> > > > filesystem?
> > > > > > > > > > > > - If your disk is idle, it means the data fits into
> > the OS
> > > > > > buffer
> > > > > > > > > > cache.
> > > > > > > > > > > In turn that means that you increase the heap for the
> > region
> > > > > > > servers. You
> > > > > > > > > > > can also use block encoding (FAST_DIFF) to try to make
> > sure
> > > > the
> > > > > > > entire
> > > > > > > > > > > working set fits into the cache.
> > > > > > > > > > > >
> > > > > > > > > > > > - Also try to reduce the block size - although if your
> > > > overall
> > > > > > > working
> > > > > > > > > > > set does not fit in the heap it won't help much.
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > This is a good section of the book to read through
> > > > generally
> > > > > > > (even
> > > > > > > > > > > though you might know most of this already):
> > > > > > > > > > > http://hbase.apache.org/book.html#perf.configurations
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > -- Lars
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > ----- Original Message -----
> > > > > > > > > > > > From: Khaled Elmeleegy <kd...@hotmail.com>
> > > > > > > > > > > > To: "user@hbase.apache.org" <us...@hbase.apache.org>
> > > > > > > > > > > > Cc:
> > > > > > > > > > > > Sent: Thursday, October 2, 2014 11:27 AM
> > > > > > > > > > > > Subject: RE: HBase read performance
> > > > > > > > > > > >
> > > > > > > > > > > > I do see a very brief spike in CPU (user/system), but
> > it's
> > > > no
> > > > > > > where
> > > > > > > > > > near
> > > > > > > > > > > 0% idle. It goes from 99% idle down to something like 40%
> > > > idle
> > > > > > for
> > > > > > > a
> > > > > > > > > > second
> > > > > > > > > > > or so. The thing to note, this is all on a test cluster,
> > so
> > > > no
> > > > > > > real load.
> > > > > > > > > > > Things are generally idle until i issue 2-3 of these
> > > > > > > multi-scan-requests
> > > > > > > > > > to
> > > > > > > > > > > render a web page. Then, you see the spike in the cpu and
> > > > some
> > > > > > > activity
> > > > > > > > > > in
> > > > > > > > > > > the network and disk, but nowhere near saturation.
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > If there are specific tests you'd like me to do to
> > debug
> > > > this,
> > > > > > > I'd be
> > > > > > > > > > > more than happy to do it.
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Khaled
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > ----------------------------------------
> > > > > > > > > > > >> Date: Thu, 2 Oct 2014 11:15:59 -0700
> > > > > > > > > > > >> From: larsh@apache.org
> > > > > > > > > > > >> Subject: Re: HBase read performance
> > > > > > > > > > > >> To: user@hbase.apache.org
> > > > > > > > > > > >>
> > > > > > > > > > > >> I still think you're waiting on disk. No IOWAIT? So
> > CPU
> > > > is not
> > > > > > > waiting
> > > > > > > > > > > a lot for IO. No high User/System CPU either?
> > > > > > > > > > > >>
> > > > > > > > > > > >> If you see a lot of evicted block then each RPC has a
> > high
> > > > > > > chance of
> > > > > > > > > > > requiring to bring an entire 64k block in. You'll see bad
> > > > > > > performance
> > > > > > > > > > with
> > > > > > > > > > > this.
> > > > > > > > > > > >>
> > > > > > > > > > > >> We might need to trace this specific scenario.
> > > > > > > > > > > >>
> > > > > > > > > > > >> -- Lars
> > > > > > > > > > > >>
> > > > > > > > > > > >>
> > > > > > > > > > > >>
> > > > > > > > > > > >> ________________________________
> > > > > > > > > > > >> From: Khaled Elmeleegy <kd...@hotmail.com>
> > > > > > > > > > > >> To: "user@hbase.apache.org" <us...@hbase.apache.org>
> > > > > > > > > > > >> Sent: Thursday, October 2, 2014 10:46 AM
> > > > > > > > > > > >> Subject: RE: HBase read performance
> > > > > > > > > > > >>
> > > > > > > > > > > >>
> > > > > > > > > > > >> I've set the heap size to 6GB and I do have gc
> > logging. No
> > > > > > long
> > > > > > > pauses
> > > > > > > > > > > there -- occasional 0.1s or 0.2s.
> > > > > > > > > > > >>
> > > > > > > > > > > >> Other than the discrepancy between what's reported on
> > the
> > > > > > > client and
> > > > > > > > > > > what's reported at the RS, there is also the issue of not
> > > > getting
> > > > > > > proper
> > > > > > > > > > > concurrency. So, even if a reverse get takes 100ms or so
> > > > (this
> > > > > > has
> > > > > > > to be
> > > > > > > > > > > mostly blocking on various things as no physical
> > resource is
> > > > > > > contended),
> > > > > > > > > > > then the other gets/scans should be able to proceed in
> > > > parallel,
> > > > > > > so a
> > > > > > > > > > > thousand concurrent gets/scans should finish in few
> > hundreds
> > > > of
> > > > > > ms
> > > > > > > not
> > > > > > > > > > many
> > > > > > > > > > > seconds. That's why I thought I'd increase the handlers
> > > > count to
> > > > > > > try to
> > > > > > > > > > get
> > > > > > > > > > > more concurrency, but it didn't help. So, there must be
> > > > something
> > > > > > > else.
> > > > > > > > > > > >>
> > > > > > > > > > > >> Khaled
> > > > > > > > > > > >>
> > > > > > > > > > > >> ----------------------------------------
> > > > > > > > > > > >>> From: ndimiduk@gmail.com
> > > > > > > > > > > >>> Date: Thu, 2 Oct 2014 10:36:39 -0700
> > > > > > > > > > > >>> Subject: Re: HBase read performance
> > > > > > > > > > > >>> To: user@hbase.apache.org
> > > > > > > > > > > >>>
> > > > > > > > > > > >>> Do check again on the heap size of the region
> > servers.
> > > > The
> > > > > > > default
> > > > > > > > > > > >>> unconfigured size is 1G; too small for much of
> > anything.
> > > > > > Check
> > > > > > > your
> > > > > > > > > > RS
> > > > > > > > > > > logs
> > > > > > > > > > > >>> -- look for lines produced by the JVMPauseMonitor
> > thread.
> > > > > > They
> > > > > > > > > > usually
> > > > > > > > > > > >>> correlate with long GC pauses or other process-freeze
> > > > events.
> > > > > > > > > > > >>>
> > > > > > > > > > > >>> Get is implemented as a Scan of a single row, so a
> > > > reverse
> > > > > > > scan of a
> > > > > > > > > > > single
> > > > > > > > > > > >>> row should be functionally equivalent.
> > > > > > > > > > > >>>
> > > > > > > > > > > >>> In practice, I have seen discrepancy between the
> > > > latencies
> > > > > > > reported
> > > > > > > > > > by
> > > > > > > > > > > the
> > > > > > > > > > > >>> RS and the latencies experienced by the client. I've
> > not
> > > > > > > investigated
> > > > > > > > > > > this
> > > > > > > > > > > >>> area thoroughly.
> > > > > > > > > > > >>>
> > > > > > > > > > > >>> On Thu, Oct 2, 2014 at 10:05 AM, Khaled Elmeleegy <
> > > > > > > kdiaa@hotmail.com
> > > > > > > > > > >
> > > > > > > > > > > wrote:
> > > > > > > > > > > >>>
> > > > > > > > > > > >>>> Thanks Lars for your quick reply.
> > > > > > > > > > > >>>>
> > > > > > > > > > > >>>> Yes performance is similar with less handlers (I
> > tried
> > > > with
> > > > > > > 100
> > > > > > > > > > > first).
> > > > > > > > > > > >>>>
> > > > > > > > > > > >>>> The payload is not big ~1KB or so. The working set
> > > > doesn't
> > > > > > > seem to
> > > > > > > > > > > fit in
> > > > > > > > > > > >>>> memory as there are many cache misses. However,
> > disk is
> > > > far
> > > > > > > from
> > > > > > > > > > > being a
> > > > > > > > > > > >>>> bottleneck. I checked using iostat. I also verified
> > that
> > > > > > > neither the
> > > > > > > > > > > >>>> network nor the CPU of the region server or the
> > client
> > > > are a
> > > > > > > > > > > bottleneck.
> > > > > > > > > > > >>>> This leads me to believe that likely this is a
> > software
> > > > > > > bottleneck,
> > > > > > > > > > > >>>> possibly due to a misconfiguration on my side. I
> > just
> > > > don't
> > > > > > > know how
> > > > > > > > > > > to
> > > > > > > > > > > >>>> debug it. A clear disconnect I see is the individual
> > > > request
> > > > > > > latency
> > > > > > > > > > > as
> > > > > > > > > > > >>>> reported by metrics on the region server (IPC
> > > > > > processCallTime
> > > > > > > vs
> > > > > > > > > > > scanNext)
> > > > > > > > > > > >>>> vs what's measured on the client. Does this sound
> > > > right? Any
> > > > > > > ideas
> > > > > > > > > > on
> > > > > > > > > > > how
> > > > > > > > > > > >>>> to better debug it?
> > > > > > > > > > > >>>>
> > > > > > > > > > > >>>> About this trick with the timestamps to be able to
> > do a
> > > > > > > forward
> > > > > > > > > > scan,
> > > > > > > > > > > >>>> thanks for pointing it out. Actually, I am aware of
> > it.
> > > > The
> > > > > > > problem
> > > > > > > > > > I
> > > > > > > > > > > have
> > > > > > > > > > > >>>> is, sometimes I want to get the key after a
> > particular
> > > > > > > timestamp and
> > > > > > > > > > > >>>> sometimes I want to get the key before, so just
> > relying
> > > > on
> > > > > > > the key
> > > > > > > > > > > order
> > > > > > > > > > > >>>> doesn't work. Ideally, I want a reverse get(). I
> > thought
> > > > > > > reverse
> > > > > > > > > > scan
> > > > > > > > > > > can
> > > > > > > > > > > >>>> do the trick though.
> > > > > > > > > > > >>>>
> > > > > > > > > > > >>>> Khaled
> > > > > > > > > > > >>>>
> > > > > > > > > > > >>>> ----------------------------------------
> > > > > > > > > > > >>>>> Date: Thu, 2 Oct 2014 09:40:37 -0700
> > > > > > > > > > > >>>>> From: larsh@apache.org
> > > > > > > > > > > >>>>> Subject: Re: HBase read performance
> > > > > > > > > > > >>>>> To: user@hbase.apache.org
> > > > > > > > > > > >>>>>
> > > > > > > > > > > >>>>> Hi Khaled,
> > > > > > > > > > > >>>>> is it the same with fewer threads? 1500 handler
> > threads
> > > > > > > seems to
> > > > > > > > > > be a
> > > > > > > > > > > >>>> lot. Typically a good number of threads depends on
> > the
> > > > > > > hardware
> > > > > > > > > > > (number of
> > > > > > > > > > > >>>> cores, number of spindles, etc). I cannot think of
> > any
> > > > type
> > > > > > of
> > > > > > > > > > > scenario
> > > > > > > > > > > >>>> where more than 100 would give any improvement.
> > > > > > > > > > > >>>>>
> > > > > > > > > > > >>>>> How large is the payload per KV retrieved that
> > way? If
> > > > > > large
> > > > > > > (as
> > > > > > > > > > in a
> > > > > > > > > > > >>>> few 100k) you definitely want to lower the number
> > of the
> > > > > > > handler
> > > > > > > > > > > threads.
> > > > > > > > > > > >>>>> How much heap do you give the region server? Does
> > the
> > > > > > > working set
> > > > > > > > > > fit
> > > > > > > > > > > >>>> into the cache? (i.e. in the metrics, do you see the
> > > > > > eviction
> > > > > > > count
> > > > > > > > > > > going
> > > > > > > > > > > >>>> up, if so it does not fit into the cache).
> > > > > > > > > > > >>>>>
> > > > > > > > > > > >>>>> If the working set does not fit into the cache
> > > > (eviction
> > > > > > > count goes
> > > > > > > > > > > up)
> > > > > > > > > > > >>>> then HBase will need to bring a new block in from
> > disk
> > > > on
> > > > > > > each Get
> > > > > > > > > > > >>>> (assuming the Gets are more or less random as far
> > as the
> > > > > > > server is
> > > > > > > > > > > >>>> concerned).
> > > > > > > > > > > >>>>> In case you'll benefit from reducing the HFile
> > block
> > > > size
> > > > > > > (from 64k
> > > > > > > > > > > to
> > > > > > > > > > > >>>> 8k or even 4k).
> > > > > > > > > > > >>>>>
> > > > > > > > > > > >>>>> Lastly I don't think we tested the performance of
> > using
> > > > > > > reverse
> > > > > > > > > > scan
> > > > > > > > > > > >>>> this way, there is probably room to optimize this.
> > > > > > > > > > > >>>>> Can you restructure your keys to allow forwards
> > > > scanning?
> > > > > > For
> > > > > > > > > > example
> > > > > > > > > > > >>>> you could store the time as MAX_LONG-time. Or you
> > could
> > > > > > > invert all
> > > > > > > > > > > the bits
> > > > > > > > > > > >>>> of the time portion of the key, so that it sort the
> > > > other
> > > > > > > way. Then
> > > > > > > > > > > you
> > > > > > > > > > > >>>> could do a forward scan.
> > > > > > > > > > > >>>>>
> > > > > > > > > > > >>>>> Let us know how it goes.
> > > > > > > > > > > >>>>>
> > > > > > > > > > > >>>>> -- Lars
> > > > > > > > > > > >>>>>
> > > > > > > > > > > >>>>>
> > > > > > > > > > > >>>>> ----- Original Message -----
> > > > > > > > > > > >>>>> From: Khaled Elmeleegy <kd...@hotmail.com>
> > > > > > > > > > > >>>>> To: "user@hbase.apache.org" <user@hbase.apache.org
> > >
> > > > > > > > > > > >>>>> Cc:
> > > > > > > > > > > >>>>> Sent: Thursday, October 2, 2014 12:12 AM
> > > > > > > > > > > >>>>> Subject: HBase read performance
> > > > > > > > > > > >>>>>
> > > > > > > > > > > >>>>> Hi,
> > > > > > > > > > > >>>>>
> > > > > > > > > > > >>>>> I am trying to do a scatter/gather on hbase
> > (0.98.6.1),
> > > > > > > where I
> > > > > > > > > > have
> > > > > > > > > > > a
> > > > > > > > > > > >>>> client reading ~1000 keys from an HBase table. These
> > > > keys
> > > > > > > happen to
> > > > > > > > > > > fall on
> > > > > > > > > > > >>>> the same region server. For my reads I use reverse
> > scan
> > > > to
> > > > > > > read each
> > > > > > > > > > > key as
> > > > > > > > > > > >>>> I want the key prior to a specific time stamp (time
> > > > stamps
> > > > > > are
> > > > > > > > > > stored
> > > > > > > > > > > in
> > > > > > > > > > > >>>> reverse order). I don't believe gets can accomplish
> > > > that,
> > > > > > > right? so
> > > > > > > > > > I
> > > > > > > > > > > use
> > > > > > > > > > > >>>> scan, with caching set to 1.
> > > > > > > > > > > >>>>>
> > > > > > > > > > > >>>>> I use 2000 reader threads in the client and on
> > HBase,
> > > > I've
> > > > > > > set
> > > > > > > > > > > >>>> hbase.regionserver.handler.count to 1500. With this
> > > > setup,
> > > > > > my
> > > > > > > > > > scatter
> > > > > > > > > > > >>>> gather is very slow and can take up to 10s in total.
> > > > Timing
> > > > > > an
> > > > > > > > > > > individual
> > > > > > > > > > > >>>> getScanner(..) call on the client side, it can
> > easily
> > > > take
> > > > > > few
> > > > > > > > > > > hundreds of
> > > > > > > > > > > >>>> ms. I also got the following metrics from the region
> > > > server
> > > > > > in
> > > > > > > > > > > question:
> > > > > > > > > > > >>>>>
> > > > > > > > > > > >>>>> "queueCallTime_mean" : 2.190855525775637,
> > > > > > > > > > > >>>>> "queueCallTime_median" : 0.0,
> > > > > > > > > > > >>>>> "queueCallTime_75th_percentile" : 0.0,
> > > > > > > > > > > >>>>> "queueCallTime_95th_percentile" : 1.0,
> > > > > > > > > > > >>>>> "queueCallTime_99th_percentile" :
> > 556.9799999999818,
> > > > > > > > > > > >>>>>
> > > > > > > > > > > >>>>> "processCallTime_min" : 0,
> > > > > > > > > > > >>>>> "processCallTime_max" : 12755,
> > > > > > > > > > > >>>>> "processCallTime_mean" : 105.64873440912682,
> > > > > > > > > > > >>>>> "processCallTime_median" : 0.0,
> > > > > > > > > > > >>>>> "processCallTime_75th_percentile" : 2.0,
> > > > > > > > > > > >>>>> "processCallTime_95th_percentile" : 7917.95,
> > > > > > > > > > > >>>>> "processCallTime_99th_percentile" : 8876.89,
> > > > > > > > > > > >>>>>
> > > > > > > > > > > >>>>>
> > > > > > > > > > > >>>>
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > >
> > > > > >
> > > >
> > "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_min"
> > > > > > > > > > > >>>> : 89,
> > > > > > > > > > > >>>>>
> > > > > > > > > > > >>>>
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > >
> > > > > >
> > > >
> > "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_max"
> > > > > > > > > > > >>>> : 11300,
> > > > > > > > > > > >>>>>
> > > > > > > > > > > >>>>
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > >
> > > > > >
> > > >
> > "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_mean"
> > > > > > > > > > > >>>> : 654.4949739797315,
> > > > > > > > > > > >>>>>
> > > > > > > > > > > >>>>
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > >
> > > > > >
> > > >
> > "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_median"
> > > > > > > > > > > >>>> : 101.0,
> > > > > > > > > > > >>>>>
> > > > > > > > > > > >>>>
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > >
> > > > > >
> > > >
> > "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_75th_percentile"
> > > > > > > > > > > >>>> : 101.0,
> > > > > > > > > > > >>>>>
> > > > > > > > > > > >>>>
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > >
> > > > > >
> > > >
> > "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_95th_percentile"
> > > > > > > > > > > >>>> : 101.0,
> > > > > > > > > > > >>>>>
> > > > > > > > > > > >>>>
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > >
> > > > > >
> > > >
> > "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_99th_percentile"
> > > > > > > > > > > >>>> : 113.0,
> > > > > > > > > > > >>>>>
> > > > > > > > > > > >>>>> Where "delta" is the name of the table I am
> > querying.
> > > > > > > > > > > >>>>>
> > > > > > > > > > > >>>>> In addition to all this, i monitored the hardware
> > > > resources
> > > > > > > (CPU,
> > > > > > > > > > > disk,
> > > > > > > > > > > >>>> and network) of both the client and the region
> > server
> > > > and
> > > > > > > nothing
> > > > > > > > > > > seems
> > > > > > > > > > > >>>> anywhere near saturation. So I am puzzled by what's
> > > > going on
> > > > > > > and
> > > > > > > > > > > where this
> > > > > > > > > > > >>>> time is going.
> > > > > > > > > > > >>>>>
> > > > > > > > > > > >>>>> Few things to note based on the above measurements:
> > > > both
> > > > > > > medians of
> > > > > > > > > > > IPC
> > > > > > > > > > > >>>> processCallTime and queueCallTime are basically zero
> > > > (ms I
> > > > > > > presume,
> > > > > > > > > > > >>>> right?). However, scanNext_median is 101 (ms too,
> > > > right?). I
> > > > > > > am not
> > > > > > > > > > > sure
> > > > > > > > > > > >>>> how this adds up. Also, even though the 101 figure
> > seems
> > > > > > > > > > outrageously
> > > > > > > > > > > high
> > > > > > > > > > > >>>> and I don't know why, still all these scans should
> > be
> > > > > > > happening in
> > > > > > > > > > > >>>> parallel, so the overall call should finish fast,
> > given
> > > > that
> > > > > > > no
> > > > > > > > > > > hardware
> > > > > > > > > > > >>>> resource is contended, right? but this is not what's
> > > > > > > happening, so I
> > > > > > > > > > > have
> > > > > > > > > > > >>>> to be missing something(s).
> > > > > > > > > > > >>>>>
> > > > > > > > > > > >>>>> So, any help is appreciated there.
> > > > > > > > > > > >>>>>
> > > > > > > > > > > >>>>> Thanks,
> > > > > > > > > > > >>>>> Khaled
> > > > > > > > > > > >>>>
> > > > > > > > > > > >>>>
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Best regards,
> > >
> > >    - Andy
> > >
> > > Problems worthy of attack prove their worth by hitting back. - Piet Hein
> > > (via Tom White)
> >
> >

Re: HBase read performance

Posted by Stack <st...@duboce.net>.

It looks like we are messing up our positioning math:

233 <http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/nio/Buffer.java#233>

<http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/nio/Buffer.java#>

    public final Buffer
<http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/nio/Buffer.java#Buffer>
 <http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/nio/Buffer.java#>position(int
newPosition) {

234 <http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/nio/Buffer.java#234>

<http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/nio/Buffer.java#>

        if ((newPosition > limit
<http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/nio/Buffer.java#Buffer.0limit>)
|| (newPosition < 0))

235 <http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/nio/Buffer.java#235>

<http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/nio/Buffer.java#>

            throw new IllegalArgumentException
<http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/lang/IllegalArgumentException.java#IllegalArgumentException>();


Is it easy to reproduce Khaled?  Always same region/store or spread
across all reads?


St.Ack



On Fri, Oct 10, 2014 at 8:31 AM, Khaled Elmeleegy <kd...@hotmail.com> wrote:

> Andrew thanks. I think this indeed was the problem.
> To get over it, I increased the amount of memory given the region server
> to avoid IO on reads. I used the configs below. In my experiments, I have
> writers streaming their output to HBase. The reader powers a web page and
> does this scatter/gather, where it reads 1000 keys written last and passes
> them the the front end. With this workload, I get the exception below at
> the region server. Again, I am using HBAse (0.98.6.1). Any help is
> appreciated.
> hbase-env:HBASE_REGIONSERVER_OPTS=-Xmx6G -XX:MaxDirectMemorySize=5G
> -XX:CMSInitiatingOccupancyFraction=88 -XX:+AggressiveOpts -verbose:gc
> -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Xlog
> gc:/tmp/hbase-regionserver-gc.log
> hbase-site:hbase.bucketcache.ioengine=offheap
>
>                                    hfile.block.cache.size=0.4
>
>
>  hbase.bucketcache.size=4196
>
>                         hbase.rs.cacheblocksonwrite=true
>
>
>  hfile.block.index.cacheonwrite=true
>
>                         hfile.block.bloom.cacheonwrite=true
>
> 2014-10-10 15:06:44,173 ERROR
> [B.DefaultRpcServer.handler=62,queue=2,port=60020] ipc.RpcServer:
> Unexpected throwable object
> java.lang.IllegalArgumentException
>         at java.nio.Buffer.position(Buffer.java:236)
>         at
> org.apache.hadoop.hbase.util.ByteBufferUtils.skip(ByteBufferUtils.java:434)
>         at
> org.apache.hadoop.hbase.io.hfile.HFileReaderV2$ScannerV2.readKeyValueLen(HFileReaderV2.java:849)
>         at
> org.apache.hadoop.hbase.io.hfile.HFileReaderV2$ScannerV2.next(HFileReaderV2.java:760)
>         at
> org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekAtOrAfter(StoreFileScanner.java:248)
>         at
> org.apache.hadoop.hbase.regionserver.StoreFileScanner.seek(StoreFileScanner.java:152)
>         at
> org.apache.hadoop.hbase.regionserver.StoreScanner.seekScanners(StoreScanner.java:317)
>         at
> org.apache.hadoop.hbase.regionserver.StoreScanner.<init>(StoreScanner.java:176)
>         at
> org.apache.hadoop.hbase.regionserver.HStore.getScanner(HStore.java:1780)
>         at
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.<init>(HRegion.java:3758)
>         at
> org.apache.hadoop.hbase.regionserver.HRegion.instantiateRegionScanner(HRegion.java:1950)
>         at
> org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:1936)
>         at
> org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:1913)
>         at
> org.apache.hadoop.hbase.regionserver.HRegionServer.scan(HRegionServer.java:3157)
>         at
> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:29587)
>         at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2027)
>         at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108)
>         at
> org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:114)
>         at
> org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:94)
>         at java.lang.Thread.run(Thread.java:744)
> > From: apurtell@apache.org
> > Date: Tue, 7 Oct 2014 17:09:35 -0700
> > Subject: Re: HBase read performance
> > To: user@hbase.apache.org
> >
> > > The cluster has 2 m1.large nodes.
> >
> > That's the problem right there.
> >
> > You need to look at c3.4xlarge or i2 instances as a minimum requirement.
> M1
> > and even M3 instance types have ridiculously poor IO.
> >
> >
> > On Tue, Oct 7, 2014 at 3:01 PM, Khaled Elmeleegy <kd...@hotmail.com>
> wrote:
> >
> > > Thanks Nicolas, Qiang.
> > >
> > > I was able to write a simple program that reproduces the problem on a
> tiny
> > > HBase cluster on ec2. The cluster has 2 m1.large nodes. One node runs
> the
> > > master, name node and zookeeper. The other node runs a data node and a
> > > region server, with heap size configured to be 6GB. There, the 1000
> > > parallel reverse gets (reverse scans) take 7-8 seconds. The data set is
> > > tiny (10M records, each having a small number of bytes). As I said
> before,
> > > all hardware resources are very idle there.
> > >
> > > Interestingly, running the same workload on my macbook, the 1000
> parallel
> > > gets take ~200ms on a pseudo-distributed installation.
> > >
> > > Any help to resolve this mystery is highly appreciated.
> > >
> > > P.S. please find my test program attached.
> > >
> > > Best,
> > > Khaled
> > >
> > > > From: nkeywal@gmail.com
> > > > Date: Mon, 6 Oct 2014 09:40:48 +0200
> > > > Subject: Re: HBase read performance
> > > > To: user@hbase.apache.org
> > > >
> > > > Hi,
> > > >
> > > > I haven't seen it mentioned, but if I understand correctly each scan
> > > > returns a single row? If so you should use Scan#setSmall to save
> some rpc
> > > > calls.
> > > >
> > > > Cheers,
> > > >
> > > > Nicolas
> > > >
> > > >
> > > > On Sun, Oct 5, 2014 at 11:28 AM, Qiang Tian <ti...@gmail.com>
> wrote:
> > > >
> > > > > when using separate HConnection instance, both its
> > > > > RpcClient instance(maintain connection to a regionserver) and
> Registry
> > > > > instance(maintain connection to zookeeper) will be separate..
> > > > >
> > > > > see
> > > > >
> > > > >
> > >
> http://shammijayasinghe.blogspot.com/2012/02/zookeeper-increase-maximum-number-of.html
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Sun, Oct 5, 2014 at 2:24 PM, Khaled Elmeleegy <
> kdiaa@hotmail.com>
> > > > > wrote:
> > > > >
> > > > > > I tried creating my own HConnections pool to use for my HBase
> calls,
> > > so
> > > > > > that not all the (2K) threads share the same HConnection.
> However, I
> > > > > could
> > > > > > only have 10 HConnections. Beyond that I get ZK exceptions,
> please
> > > find
> > > > > it
> > > > > > below. Also, with 10 HConnections, I don't see noticeable
> > > improvement in
> > > > > > performance so far.
> > > > > > 2014-10-05 06:11:26,490 WARN [main]
> zookeeper.RecoverableZooKeeper
> > > > > > (RecoverableZooKeeper.java:retryOrThrow(253)) - Possibly
> transient
> > > > > > ZooKeeper, quorum=54.68.206.252:2181,
> > > > > >
> > > exception=org.apache.zookeeper.KeeperException$ConnectionLossException:
> > > > > > KeeperErrorCode = ConnectionLoss for /hbase/hbaseid2014-10-05
> > > > > 06:11:26,490
> > > > > > INFO [main] util.RetryCounter
> > > > > (RetryCounter.java:sleepUntilNextRetry(155))
> > > > > > - Sleeping 1000ms before retry #0...2014-10-05 06:11:27,845 WARN
> > > [main]
> > > > > > zookeeper.RecoverableZooKeeper
> > > > > > (RecoverableZooKeeper.java:retryOrThrow(253)) - Possibly
> transient
> > > > > > ZooKeeper, quorum=54.68.206.252:2181,
> > > > > >
> > > exception=org.apache.zookeeper.KeeperException$ConnectionLossException:
> > > > > > KeeperErrorCode = ConnectionLoss for /hbase/hbaseid2014-10-05
> > > > > 06:11:27,849
> > > > > > INFO [main] util.RetryCounter
> > > > > (RetryCounter.java:sleepUntilNextRetry(155))
> > > > > > - Sleeping 2000ms before retry #1...2014-10-05 06:11:30,405 WARN
> > > [main]
> > > > > > zookeeper.RecoverableZooKeeper
> > > > > > (RecoverableZooKeeper.java:retryOrThrow(253)) - Possibly
> transient
> > > > > > ZooKeeper, quorum=54.68.206.252:2181,
> > > > > >
> > > exception=org.apache.zookeeper.KeeperException$ConnectionLossException:
> > > > > > KeeperErrorCode = ConnectionLoss for /hbase/hbaseid2014-10-05
> > > > > 06:11:30,405
> > > > > > INFO [main] util.RetryCounter
> > > > > (RetryCounter.java:sleepUntilNextRetry(155))
> > > > > > - Sleeping 4000ms before retry #2...2014-10-05 06:11:35,278 WARN
> > > [main]
> > > > > > zookeeper.RecoverableZooKeeper
> > > > > > (RecoverableZooKeeper.java:retryOrThrow(253)) - Possibly
> transient
> > > > > > ZooKeeper, quorum=54.68.206.252:2181,
> > > > > >
> > > exception=org.apache.zookeeper.KeeperException$ConnectionLossException:
> > > > > > KeeperErrorCode = ConnectionLoss for /hbase/hbaseid2014-10-05
> > > > > 06:11:35,279
> > > > > > INFO [main] util.RetryCounter
> > > > > (RetryCounter.java:sleepUntilNextRetry(155))
> > > > > > - Sleeping 8000ms before retry #3...2014-10-05 06:11:44,393 WARN
> > > [main]
> > > > > > zookeeper.RecoverableZooKeeper
> > > > > > (RecoverableZooKeeper.java:retryOrThrow(253)) - Possibly
> transient
> > > > > > ZooKeeper, quorum=54.68.206.252:2181,
> > > > > >
> > > exception=org.apache.zookeeper.KeeperException$ConnectionLossException:
> > > > > > KeeperErrorCode = ConnectionLoss for /hbase/hbaseid2014-10-05
> > > > > 06:11:44,393
> > > > > > ERROR [main] zookeeper.RecoverableZooKeeper
> > > > > > (RecoverableZooKeeper.java:retryOrThrow(255)) - ZooKeeper exists
> > > failed
> > > > > > after 4 attempts2014-10-05 06:11:44,394 WARN [main]
> zookeeper.ZKUtil
> > > > > > (ZKUtil.java:checkExists(482)) - hconnection-0x4e174f3b, quorum=
> > > > > > 54.68.206.252:2181, baseZNode=/hbase Unable to set watcher on
> znode
> > > > > >
> > > > >
> > >
> (/hbase/hbaseid)org.apache.zookeeper.KeeperException$ConnectionLossException:
> > > > > > KeeperErrorCode = ConnectionLoss for /hbase/hbaseid at
> > > > > >
> org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
> > > at
> > > > > >
> org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
> > > at
> > > > > > org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1041) at
> > > > > >
> > > > >
> > >
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.exists(RecoverableZooKeeper.java:199)
> > > > > > at
> > > org.apache.hadoop.hbase.zookeeper.ZKUtil.checkExists(ZKUtil.java:479)
> > > > > > at
> > > > > >
> > > > >
> > >
> org.apache.hadoop.hbase.zookeeper.ZKClusterId.readClusterIdZNode(ZKClusterId.java:65)
> > > > > > at
> > > > > >
> > > > >
> > >
> org.apache.hadoop.hbase.client.ZooKeeperRegistry.getClusterId(ZooKeeperRegistry.java:83)
> > > > > > at
> > > > > >
> > > > >
> > >
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.retrieveClusterId(HConnectionManager.java:857)
> > > > > > at
> > > > > >
> > > > >
> > >
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.<init>(HConnectionManager.java:662)
> > > > > > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> > > > > > Method) at
> > > > > >
> > > > >
> > >
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
> > > > > > at
> > > > > >
> > > > >
> > >
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> > > > > > at
> java.lang.reflect.Constructor.newInstance(Constructor.java:526)
> > > > > at
> > > > > >
> > > > >
> > >
> org.apache.hadoop.hbase.client.HConnectionManager.createConnection(HConnectionManager.java:414)
> > > > > > at
> > > > > >
> > > > >
> > >
> org.apache.hadoop.hbase.client.HConnectionManager.createConnection(HConnectionManager.java:335)...
> > > > > > > From: kdiaa@hotmail.com
> > > > > > > To: user@hbase.apache.org
> > > > > > > Subject: RE: HBase read performance
> > > > > > > Date: Fri, 3 Oct 2014 12:37:39 -0700
> > > > > > >
> > > > > > > Lars, Ted, and Qiang,
> > > > > > > Thanks for all the input.
> > > > > > > Qiang: yes all the threads are in the same client process
> sharing
> > > the
> > > > > > same connection. And since I don't see hardware contention, may
> be
> > > there
> > > > > is
> > > > > > contention over this code path. I'll try using many connections
> and
> > > see
> > > > > if
> > > > > > it alleviates the problems and I'll report back.
> > > > > > > Thanks again,Khaled
> > > > > > >
> > > > > > > > Date: Fri, 3 Oct 2014 15:18:30 +0800
> > > > > > > > Subject: Re: HBase read performance
> > > > > > > > From: tianq01@gmail.com
> > > > > > > > To: user@hbase.apache.org
> > > > > > > >
> > > > > > > > Regarding to profiling, Andrew introduced
> > > > > > > >
> > > http://www.brendangregg.com/blog/2014-06-12/java-flame-graphs.html
> > > > > > months
> > > > > > > > ago.
> > > > > > > >
> > > > > > > > processCallTime comes from RpcServer#call, so it looks good?
> > > > > > > >
> > > > > > > > I have a suspect:
> > > https://issues.apache.org/jira/browse/HBASE-11306
> > > > > > > >
> > > > > > > > how many processes do you have for your 2000 threads?
> > > > > > > > if olny 1 process, those threads will share just 1
> connection to
> > > that
> > > > > > > > regionserver, there might be big contention on the RPC code
> path.
> > > > > > ---for
> > > > > > > > such case, could you try using different connections?
> > > > > > > >
> > > > > >
> > > > >
> > >
> https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HConnectionManager.html
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On Fri, Oct 3, 2014 at 9:55 AM, Ted Yu <yu...@gmail.com>
> > > wrote:
> > > > > > > >
> > > > > > > > > Khaled:
> > > > > > > > > Do you have profiler such as jprofiler ?
> > > > > > > > > Profiling would give us more hint.
> > > > > > > > >
> > > > > > > > > Otherwise capturing stack trace during the period of
> reverse
> > > scan
> > > > > > would
> > > > > > > > > help.
> > > > > > > > >
> > > > > > > > > Cheers
> > > > > > > > >
> > > > > > > > > On Thu, Oct 2, 2014 at 4:52 PM, lars hofhansl <
> > > larsh@apache.org>
> > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > You might have the data in the OS buffer cache, without
> short
> > > > > > circuit
> > > > > > > > > > reading the region server has to request the block from
> the
> > > data
> > > > > > node
> > > > > > > > > > process, which then reads it from the block cache.
> > > > > > > > > > That is a few context switches per RPC that do not show
> up
> > > in CPU
> > > > > > > > > metrics.
> > > > > > > > > > In that you also would not see disk IO.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > If - as you say - you see a lot of evicted blocks the
> data
> > > *has*
> > > > > > to come
> > > > > > > > > > from the OS. If you do not see disk IO is *has* to come
> from
> > > the
> > > > > OS
> > > > > > > > > cache.
> > > > > > > > > > I.e. there's more RAM on your boxes, and you should
> increase
> > > the
> > > > > > heap
> > > > > > > > > block
> > > > > > > > > > cache.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > You can measure the context switches with vmstat. Other
> than
> > > that
> > > > > > I have
> > > > > > > > > > no suggestion until I reproduce the problem.
> > > > > > > > > > Also check the data locality index of the region server
> it
> > > should
> > > > > > be
> > > > > > > > > close
> > > > > > > > > > to 100%.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > -- Lars
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > ________________________________
> > > > > > > > > > From: Khaled Elmeleegy <kd...@hotmail.com>
> > > > > > > > > > To: "user@hbase.apache.org" <us...@hbase.apache.org>
> > > > > > > > > > Sent: Thursday, October 2, 2014 3:24 PM
> > > > > > > > > > Subject: RE: HBase read performance
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Lars thanks a lot about all the tips. I'll make sure I
> cover
> > > all
> > > > > > of them
> > > > > > > > > > and get back to you. I am not sure they are the
> bottleneck
> > > though
> > > > > > as they
> > > > > > > > > > all are about optimizing physical resource usage. As I
> said,
> > > I
> > > > > > don't see
> > > > > > > > > > any contended physical resources now. I'll also try to
> > > reproduce
> > > > > > this
> > > > > > > > > > problem in a simpler environment and pass to you the test
> > > program
> > > > > > to play
> > > > > > > > > > with.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Couple of high level points to make. You are right that
> my
> > > use
> > > > > > case is
> > > > > > > > > > kind of a worst case for HBase reads. But, if things go
> the
> > > way
> > > > > you
> > > > > > > > > > described them, there should be tons of disk IO and that
> > > should
> > > > > be
> > > > > > > > > clearly
> > > > > > > > > > the bottleneck. This is not the case though. That's for
> the
> > > > > simple
> > > > > > reason
> > > > > > > > > > that this is done in a test environment (I am still
> > > prototyping),
> > > > > > and
> > > > > > > > > not a
> > > > > > > > > > lot of data is yet written to HBase. However for the
> real use
> > > > > > case, there
> > > > > > > > > > should writers constantly writing data to HBase and
> readers
> > > > > > occasionally
> > > > > > > > > > doing this scatter/gather. At steady state, things should
> > > only
> > > > > get
> > > > > > worse
> > > > > > > > > > and all the issues you mentioned should get far more
> > > pronounced.
> > > > > > At this
> > > > > > > > > > point, one can try to mitigate it using more memory or
> so. I
> > > am
> > > > > > not there
> > > > > > > > > > yet as I think I am hitting some software bottleneck,
> which I
> > > > > > don't know
> > > > > > > > > > how to work around.
> > > > > > > > > >
> > > > > > > > > > Khaled
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > ----------------------------------------
> > > > > > > > > > > Date: Thu, 2 Oct 2014 14:20:47 -0700
> > > > > > > > > > > From: larsh@apache.org
> > > > > > > > > > > Subject: Re: HBase read performance
> > > > > > > > > > > To: user@hbase.apache.org
> > > > > > > > > > >
> > > > > > > > > > > OK... We might need to investigate this.
> > > > > > > > > > > Any chance that you can provide a minimal test program
> and
> > > > > > instruction
> > > > > > > > > > about how to set it up.
> > > > > > > > > > > We can do some profiling then.
> > > > > > > > > > >
> > > > > > > > > > > One thing to note is that with scanning HBase cannot
> use
> > > bloom
> > > > > > filters
> > > > > > > > > > to rule out HFiles ahead of time, it needs to look into
> all
> > > of
> > > > > > them.
> > > > > > > > > > > So you kind of hit on the absolute worst case:
> > > > > > > > > > > - random reads that do not fit into the block cache
> > > > > > > > > > > - cannot use bloom filters
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Few more question/comments:
> > > > > > > > > > > - Do you have short circuit reading enabled? If not,
> you
> > > > > should.
> > > > > > > > > > > - Is your table major compacted? That will reduce the
> > > number of
> > > > > > files
> > > > > > > > > to
> > > > > > > > > > look at.
> > > > > > > > > > > - Did you disable Nagle's everywhere (enabled
> tcpnodelay)?
> > > It
> > > > > > disabled
> > > > > > > > > > by default in HBase, but necessarily in your install of
> HDFS.
> > > > > > > > > > > - Which version of HDFS are you using as backing
> > > filesystem?
> > > > > > > > > > > - If your disk is idle, it means the data fits into
> the OS
> > > > > buffer
> > > > > > > > > cache.
> > > > > > > > > > In turn that means that you increase the heap for the
> region
> > > > > > servers. You
> > > > > > > > > > can also use block encoding (FAST_DIFF) to try to make
> sure
> > > the
> > > > > > entire
> > > > > > > > > > working set fits into the cache.
> > > > > > > > > > >
> > > > > > > > > > > - Also try to reduce the block size - although if your
> > > overall
> > > > > > working
> > > > > > > > > > set does not fit in the heap it won't help much.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > This is a good section of the book to read through
> > > generally
> > > > > > (even
> > > > > > > > > > though you might know most of this already):
> > > > > > > > > > http://hbase.apache.org/book.html#perf.configurations
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > -- Lars
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > ----- Original Message -----
> > > > > > > > > > > From: Khaled Elmeleegy <kd...@hotmail.com>
> > > > > > > > > > > To: "user@hbase.apache.org" <us...@hbase.apache.org>
> > > > > > > > > > > Cc:
> > > > > > > > > > > Sent: Thursday, October 2, 2014 11:27 AM
> > > > > > > > > > > Subject: RE: HBase read performance
> > > > > > > > > > >
> > > > > > > > > > > I do see a very brief spike in CPU (user/system), but
> it's
> > > no
> > > > > > where
> > > > > > > > > near
> > > > > > > > > > 0% idle. It goes from 99% idle down to something like 40%
> > > idle
> > > > > for
> > > > > > a
> > > > > > > > > second
> > > > > > > > > > or so. The thing to note, this is all on a test cluster,
> so
> > > no
> > > > > > real load.
> > > > > > > > > > Things are generally idle until i issue 2-3 of these
> > > > > > multi-scan-requests
> > > > > > > > > to
> > > > > > > > > > render a web page. Then, you see the spike in the cpu and
> > > some
> > > > > > activity
> > > > > > > > > in
> > > > > > > > > > the network and disk, but nowhere near saturation.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > If there are specific tests you'd like me to do to
> debug
> > > this,
> > > > > > I'd be
> > > > > > > > > > more than happy to do it.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Khaled
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > ----------------------------------------
> > > > > > > > > > >> Date: Thu, 2 Oct 2014 11:15:59 -0700
> > > > > > > > > > >> From: larsh@apache.org
> > > > > > > > > > >> Subject: Re: HBase read performance
> > > > > > > > > > >> To: user@hbase.apache.org
> > > > > > > > > > >>
> > > > > > > > > > >> I still think you're waiting on disk. No IOWAIT? So
> CPU
> > > is not
> > > > > > waiting
> > > > > > > > > > a lot for IO. No high User/System CPU either?
> > > > > > > > > > >>
> > > > > > > > > > >> If you see a lot of evicted block then each RPC has a
> high
> > > > > > chance of
> > > > > > > > > > requiring to bring an entire 64k block in. You'll see bad
> > > > > > performance
> > > > > > > > > with
> > > > > > > > > > this.
> > > > > > > > > > >>
> > > > > > > > > > >> We might need to trace this specific scenario.
> > > > > > > > > > >>
> > > > > > > > > > >> -- Lars
> > > > > > > > > > >>
> > > > > > > > > > >>
> > > > > > > > > > >>
> > > > > > > > > > >> ________________________________
> > > > > > > > > > >> From: Khaled Elmeleegy <kd...@hotmail.com>
> > > > > > > > > > >> To: "user@hbase.apache.org" <us...@hbase.apache.org>
> > > > > > > > > > >> Sent: Thursday, October 2, 2014 10:46 AM
> > > > > > > > > > >> Subject: RE: HBase read performance
> > > > > > > > > > >>
> > > > > > > > > > >>
> > > > > > > > > > >> I've set the heap size to 6GB and I do have gc
> logging. No
> > > > > long
> > > > > > pauses
> > > > > > > > > > there -- occasional 0.1s or 0.2s.
> > > > > > > > > > >>
> > > > > > > > > > >> Other than the discrepancy between what's reported on
> the
> > > > > > client and
> > > > > > > > > > what's reported at the RS, there is also the issue of not
> > > getting
> > > > > > proper
> > > > > > > > > > concurrency. So, even if a reverse get takes 100ms or so
> > > (this
> > > > > has
> > > > > > to be
> > > > > > > > > > mostly blocking on various things as no physical
> resource is
> > > > > > contended),
> > > > > > > > > > then the other gets/scans should be able to proceed in
> > > parallel,
> > > > > > so a
> > > > > > > > > > thousand concurrent gets/scans should finish in few
> hundreds
> > > of
> > > > > ms
> > > > > > not
> > > > > > > > > many
> > > > > > > > > > seconds. That's why I thought I'd increase the handlers
> > > count to
> > > > > > try to
> > > > > > > > > get
> > > > > > > > > > more concurrency, but it didn't help. So, there must be
> > > something
> > > > > > else.
> > > > > > > > > > >>
> > > > > > > > > > >> Khaled
> > > > > > > > > > >>
> > > > > > > > > > >> ----------------------------------------
> > > > > > > > > > >>> From: ndimiduk@gmail.com
> > > > > > > > > > >>> Date: Thu, 2 Oct 2014 10:36:39 -0700
> > > > > > > > > > >>> Subject: Re: HBase read performance
> > > > > > > > > > >>> To: user@hbase.apache.org
> > > > > > > > > > >>>
> > > > > > > > > > >>> Do check again on the heap size of the region
> servers.
> > > The
> > > > > > default
> > > > > > > > > > >>> unconfigured size is 1G; too small for much of
> anything.
> > > > > Check
> > > > > > your
> > > > > > > > > RS
> > > > > > > > > > logs
> > > > > > > > > > >>> -- look for lines produced by the JVMPauseMonitor
> thread.
> > > > > They
> > > > > > > > > usually
> > > > > > > > > > >>> correlate with long GC pauses or other process-freeze
> > > events.
> > > > > > > > > > >>>
> > > > > > > > > > >>> Get is implemented as a Scan of a single row, so a
> > > reverse
> > > > > > scan of a
> > > > > > > > > > single
> > > > > > > > > > >>> row should be functionally equivalent.
> > > > > > > > > > >>>
> > > > > > > > > > >>> In practice, I have seen discrepancy between the
> > > latencies
> > > > > > reported
> > > > > > > > > by
> > > > > > > > > > the
> > > > > > > > > > >>> RS and the latencies experienced by the client. I've
> not
> > > > > > investigated
> > > > > > > > > > this
> > > > > > > > > > >>> area thoroughly.
> > > > > > > > > > >>>
> > > > > > > > > > >>> On Thu, Oct 2, 2014 at 10:05 AM, Khaled Elmeleegy <
> > > > > > kdiaa@hotmail.com
> > > > > > > > > >
> > > > > > > > > > wrote:
> > > > > > > > > > >>>
> > > > > > > > > > >>>> Thanks Lars for your quick reply.
> > > > > > > > > > >>>>
> > > > > > > > > > >>>> Yes performance is similar with less handlers (I
> tried
> > > with
> > > > > > 100
> > > > > > > > > > first).
> > > > > > > > > > >>>>
> > > > > > > > > > >>>> The payload is not big ~1KB or so. The working set
> > > doesn't
> > > > > > seem to
> > > > > > > > > > fit in
> > > > > > > > > > >>>> memory as there are many cache misses. However,
> disk is
> > > far
> > > > > > from
> > > > > > > > > > being a
> > > > > > > > > > >>>> bottleneck. I checked using iostat. I also verified
> that
> > > > > > neither the
> > > > > > > > > > >>>> network nor the CPU of the region server or the
> client
> > > are a
> > > > > > > > > > bottleneck.
> > > > > > > > > > >>>> This leads me to believe that likely this is a
> software
> > > > > > bottleneck,
> > > > > > > > > > >>>> possibly due to a misconfiguration on my side. I
> just
> > > don't
> > > > > > know how
> > > > > > > > > > to
> > > > > > > > > > >>>> debug it. A clear disconnect I see is the individual
> > > request
> > > > > > latency
> > > > > > > > > > as
> > > > > > > > > > >>>> reported by metrics on the region server (IPC
> > > > > processCallTime
> > > > > > vs
> > > > > > > > > > scanNext)
> > > > > > > > > > >>>> vs what's measured on the client. Does this sound
> > > right? Any
> > > > > > ideas
> > > > > > > > > on
> > > > > > > > > > how
> > > > > > > > > > >>>> to better debug it?
> > > > > > > > > > >>>>
> > > > > > > > > > >>>> About this trick with the timestamps to be able to
> do a
> > > > > > forward
> > > > > > > > > scan,
> > > > > > > > > > >>>> thanks for pointing it out. Actually, I am aware of
> it.
> > > The
> > > > > > problem
> > > > > > > > > I
> > > > > > > > > > have
> > > > > > > > > > >>>> is, sometimes I want to get the key after a
> particular
> > > > > > timestamp and
> > > > > > > > > > >>>> sometimes I want to get the key before, so just
> relying
> > > on
> > > > > > the key
> > > > > > > > > > order
> > > > > > > > > > >>>> doesn't work. Ideally, I want a reverse get(). I
> thought
> > > > > > reverse
> > > > > > > > > scan
> > > > > > > > > > can
> > > > > > > > > > >>>> do the trick though.
> > > > > > > > > > >>>>
> > > > > > > > > > >>>> Khaled
> > > > > > > > > > >>>>
> > > > > > > > > > >>>> ----------------------------------------
> > > > > > > > > > >>>>> Date: Thu, 2 Oct 2014 09:40:37 -0700
> > > > > > > > > > >>>>> From: larsh@apache.org
> > > > > > > > > > >>>>> Subject: Re: HBase read performance
> > > > > > > > > > >>>>> To: user@hbase.apache.org
> > > > > > > > > > >>>>>
> > > > > > > > > > >>>>> Hi Khaled,
> > > > > > > > > > >>>>> is it the same with fewer threads? 1500 handler
> threads
> > > > > > seems to
> > > > > > > > > be a
> > > > > > > > > > >>>> lot. Typically a good number of threads depends on
> the
> > > > > > hardware
> > > > > > > > > > (number of
> > > > > > > > > > >>>> cores, number of spindles, etc). I cannot think of
> any
> > > type
> > > > > of
> > > > > > > > > > scenario
> > > > > > > > > > >>>> where more than 100 would give any improvement.
> > > > > > > > > > >>>>>
> > > > > > > > > > >>>>> How large is the payload per KV retrieved that
> way? If
> > > > > large
> > > > > > (as
> > > > > > > > > in a
> > > > > > > > > > >>>> few 100k) you definitely want to lower the number
> of the
> > > > > > handler
> > > > > > > > > > threads.
> > > > > > > > > > >>>>> How much heap do you give the region server? Does
> the
> > > > > > working set
> > > > > > > > > fit
> > > > > > > > > > >>>> into the cache? (i.e. in the metrics, do you see the
> > > > > eviction
> > > > > > count
> > > > > > > > > > going
> > > > > > > > > > >>>> up, if so it does not fit into the cache).
> > > > > > > > > > >>>>>
> > > > > > > > > > >>>>> If the working set does not fit into the cache
> > > (eviction
> > > > > > count goes
> > > > > > > > > > up)
> > > > > > > > > > >>>> then HBase will need to bring a new block in from
> disk
> > > on
> > > > > > each Get
> > > > > > > > > > >>>> (assuming the Gets are more or less random as far
> as the
> > > > > > server is
> > > > > > > > > > >>>> concerned).
> > > > > > > > > > >>>>> In case you'll benefit from reducing the HFile
> block
> > > size
> > > > > > (from 64k
> > > > > > > > > > to
> > > > > > > > > > >>>> 8k or even 4k).
> > > > > > > > > > >>>>>
> > > > > > > > > > >>>>> Lastly I don't think we tested the performance of
> using
> > > > > > reverse
> > > > > > > > > scan
> > > > > > > > > > >>>> this way, there is probably room to optimize this.
> > > > > > > > > > >>>>> Can you restructure your keys to allow forwards
> > > scanning?
> > > > > For
> > > > > > > > > example
> > > > > > > > > > >>>> you could store the time as MAX_LONG-time. Or you
> could
> > > > > > invert all
> > > > > > > > > > the bits
> > > > > > > > > > >>>> of the time portion of the key, so that it sort the
> > > other
> > > > > > way. Then
> > > > > > > > > > you
> > > > > > > > > > >>>> could do a forward scan.
> > > > > > > > > > >>>>>
> > > > > > > > > > >>>>> Let us know how it goes.
> > > > > > > > > > >>>>>
> > > > > > > > > > >>>>> -- Lars
> > > > > > > > > > >>>>>
> > > > > > > > > > >>>>>
> > > > > > > > > > >>>>> ----- Original Message -----
> > > > > > > > > > >>>>> From: Khaled Elmeleegy <kd...@hotmail.com>
> > > > > > > > > > >>>>> To: "user@hbase.apache.org" <user@hbase.apache.org
> >
> > > > > > > > > > >>>>> Cc:
> > > > > > > > > > >>>>> Sent: Thursday, October 2, 2014 12:12 AM
> > > > > > > > > > >>>>> Subject: HBase read performance
> > > > > > > > > > >>>>>
> > > > > > > > > > >>>>> Hi,
> > > > > > > > > > >>>>>
> > > > > > > > > > >>>>> I am trying to do a scatter/gather on hbase
> (0.98.6.1),
> > > > > > where I
> > > > > > > > > have
> > > > > > > > > > a
> > > > > > > > > > >>>> client reading ~1000 keys from an HBase table. These
> > > keys
> > > > > > happen to
> > > > > > > > > > fall on
> > > > > > > > > > >>>> the same region server. For my reads I use reverse
> scan
> > > to
> > > > > > read each
> > > > > > > > > > key as
> > > > > > > > > > >>>> I want the key prior to a specific time stamp (time
> > > stamps
> > > > > are
> > > > > > > > > stored
> > > > > > > > > > in
> > > > > > > > > > >>>> reverse order). I don't believe gets can accomplish
> > > that,
> > > > > > right? so
> > > > > > > > > I
> > > > > > > > > > use
> > > > > > > > > > >>>> scan, with caching set to 1.
> > > > > > > > > > >>>>>
> > > > > > > > > > >>>>> I use 2000 reader threads in the client and on
> HBase,
> > > I've
> > > > > > set
> > > > > > > > > > >>>> hbase.regionserver.handler.count to 1500. With this
> > > setup,
> > > > > my
> > > > > > > > > scatter
> > > > > > > > > > >>>> gather is very slow and can take up to 10s in total.
> > > Timing
> > > > > an
> > > > > > > > > > individual
> > > > > > > > > > >>>> getScanner(..) call on the client side, it can
> easily
> > > take
> > > > > few
> > > > > > > > > > hundreds of
> > > > > > > > > > >>>> ms. I also got the following metrics from the region
> > > server
> > > > > in
> > > > > > > > > > question:
> > > > > > > > > > >>>>>
> > > > > > > > > > >>>>> "queueCallTime_mean" : 2.190855525775637,
> > > > > > > > > > >>>>> "queueCallTime_median" : 0.0,
> > > > > > > > > > >>>>> "queueCallTime_75th_percentile" : 0.0,
> > > > > > > > > > >>>>> "queueCallTime_95th_percentile" : 1.0,
> > > > > > > > > > >>>>> "queueCallTime_99th_percentile" :
> 556.9799999999818,
> > > > > > > > > > >>>>>
> > > > > > > > > > >>>>> "processCallTime_min" : 0,
> > > > > > > > > > >>>>> "processCallTime_max" : 12755,
> > > > > > > > > > >>>>> "processCallTime_mean" : 105.64873440912682,
> > > > > > > > > > >>>>> "processCallTime_median" : 0.0,
> > > > > > > > > > >>>>> "processCallTime_75th_percentile" : 2.0,
> > > > > > > > > > >>>>> "processCallTime_95th_percentile" : 7917.95,
> > > > > > > > > > >>>>> "processCallTime_99th_percentile" : 8876.89,
> > > > > > > > > > >>>>>
> > > > > > > > > > >>>>>
> > > > > > > > > > >>>>
> > > > > > > > > >
> > > > > > > > >
> > > > > >
> > > > >
> > >
> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_min"
> > > > > > > > > > >>>> : 89,
> > > > > > > > > > >>>>>
> > > > > > > > > > >>>>
> > > > > > > > > >
> > > > > > > > >
> > > > > >
> > > > >
> > >
> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_max"
> > > > > > > > > > >>>> : 11300,
> > > > > > > > > > >>>>>
> > > > > > > > > > >>>>
> > > > > > > > > >
> > > > > > > > >
> > > > > >
> > > > >
> > >
> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_mean"
> > > > > > > > > > >>>> : 654.4949739797315,
> > > > > > > > > > >>>>>
> > > > > > > > > > >>>>
> > > > > > > > > >
> > > > > > > > >
> > > > > >
> > > > >
> > >
> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_median"
> > > > > > > > > > >>>> : 101.0,
> > > > > > > > > > >>>>>
> > > > > > > > > > >>>>
> > > > > > > > > >
> > > > > > > > >
> > > > > >
> > > > >
> > >
> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_75th_percentile"
> > > > > > > > > > >>>> : 101.0,
> > > > > > > > > > >>>>>
> > > > > > > > > > >>>>
> > > > > > > > > >
> > > > > > > > >
> > > > > >
> > > > >
> > >
> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_95th_percentile"
> > > > > > > > > > >>>> : 101.0,
> > > > > > > > > > >>>>>
> > > > > > > > > > >>>>
> > > > > > > > > >
> > > > > > > > >
> > > > > >
> > > > >
> > >
> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_99th_percentile"
> > > > > > > > > > >>>> : 113.0,
> > > > > > > > > > >>>>>
> > > > > > > > > > >>>>> Where "delta" is the name of the table I am
> querying.
> > > > > > > > > > >>>>>
> > > > > > > > > > >>>>> In addition to all this, i monitored the hardware
> > > resources
> > > > > > (CPU,
> > > > > > > > > > disk,
> > > > > > > > > > >>>> and network) of both the client and the region
> server
> > > and
> > > > > > nothing
> > > > > > > > > > seems
> > > > > > > > > > >>>> anywhere near saturation. So I am puzzled by what's
> > > going on
> > > > > > and
> > > > > > > > > > where this
> > > > > > > > > > >>>> time is going.
> > > > > > > > > > >>>>>
> > > > > > > > > > >>>>> Few things to note based on the above measurements:
> > > both
> > > > > > medians of
> > > > > > > > > > IPC
> > > > > > > > > > >>>> processCallTime and queueCallTime are basically zero
> > > (ms I
> > > > > > presume,
> > > > > > > > > > >>>> right?). However, scanNext_median is 101 (ms too,
> > > right?). I
> > > > > > am not
> > > > > > > > > > sure
> > > > > > > > > > >>>> how this adds up. Also, even though the 101 figure
> seems
> > > > > > > > > outrageously
> > > > > > > > > > high
> > > > > > > > > > >>>> and I don't know why, still all these scans should
> be
> > > > > > happening in
> > > > > > > > > > >>>> parallel, so the overall call should finish fast,
> given
> > > that
> > > > > > no
> > > > > > > > > > hardware
> > > > > > > > > > >>>> resource is contended, right? but this is not what's
> > > > > > happening, so I
> > > > > > > > > > have
> > > > > > > > > > >>>> to be missing something(s).
> > > > > > > > > > >>>>>
> > > > > > > > > > >>>>> So, any help is appreciated there.
> > > > > > > > > > >>>>>
> > > > > > > > > > >>>>> Thanks,
> > > > > > > > > > >>>>> Khaled
> > > > > > > > > > >>>>
> > > > > > > > > > >>>>
> > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > >
> > >
> >
> >
> >
> > --
> > Best regards,
> >
> >    - Andy
> >
> > Problems worthy of attack prove their worth by hitting back. - Piet Hein
> > (via Tom White)
>
>

RE: HBase read performance

Posted by Khaled Elmeleegy <kd...@hotmail.com>.

Andrew thanks. I think this indeed was the problem. 
To get over it, I increased the amount of memory given the region server to avoid IO on reads. I used the configs below. In my experiments, I have writers streaming their output to HBase. The reader powers a web page and does this scatter/gather, where it reads 1000 keys written last and passes them the the front end. With this workload, I get the exception below at the region server. Again, I am using HBAse (0.98.6.1). Any help is appreciated.
hbase-env:HBASE_REGIONSERVER_OPTS=-Xmx6G -XX:MaxDirectMemorySize=5G -XX:CMSInitiatingOccupancyFraction=88 -XX:+AggressiveOpts -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Xlog gc:/tmp/hbase-regionserver-gc.log
hbase-site:hbase.bucketcache.ioengine=offheap                                                                                                                                             hfile.block.cache.size=0.4                                                                                                                                                     hbase.bucketcache.size=4196                                                                                                                                                    hbase.rs.cacheblocksonwrite=true                                                                                                                                               hfile.block.index.cacheonwrite=true                                                                                                                                            hfile.block.bloom.cacheonwrite=true                                                                                                                                            

2014-10-10 15:06:44,173 ERROR [B.DefaultRpcServer.handler=62,queue=2,port=60020] ipc.RpcServer: Unexpected throwable object 
java.lang.IllegalArgumentException
	at java.nio.Buffer.position(Buffer.java:236)
	at org.apache.hadoop.hbase.util.ByteBufferUtils.skip(ByteBufferUtils.java:434)
	at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$ScannerV2.readKeyValueLen(HFileReaderV2.java:849)
	at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$ScannerV2.next(HFileReaderV2.java:760)
	at org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekAtOrAfter(StoreFileScanner.java:248)
	at org.apache.hadoop.hbase.regionserver.StoreFileScanner.seek(StoreFileScanner.java:152)
	at org.apache.hadoop.hbase.regionserver.StoreScanner.seekScanners(StoreScanner.java:317)
	at org.apache.hadoop.hbase.regionserver.StoreScanner.<init>(StoreScanner.java:176)
	at org.apache.hadoop.hbase.regionserver.HStore.getScanner(HStore.java:1780)
	at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.<init>(HRegion.java:3758)
	at org.apache.hadoop.hbase.regionserver.HRegion.instantiateRegionScanner(HRegion.java:1950)
	at org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:1936)
	at org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:1913)
	at org.apache.hadoop.hbase.regionserver.HRegionServer.scan(HRegionServer.java:3157)
	at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:29587)
	at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2027)
	at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108)
	at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:114)
	at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:94)
	at java.lang.Thread.run(Thread.java:744)
> From: apurtell@apache.org
> Date: Tue, 7 Oct 2014 17:09:35 -0700
> Subject: Re: HBase read performance
> To: user@hbase.apache.org
> 
> > The cluster has 2 m1.large nodes.
> 
> That's the problem right there.
> 
> You need to look at c3.4xlarge or i2 instances as a minimum requirement. M1
> and even M3 instance types have ridiculously poor IO.
> 
> 
> On Tue, Oct 7, 2014 at 3:01 PM, Khaled Elmeleegy <kd...@hotmail.com> wrote:
> 
> > Thanks Nicolas, Qiang.
> >
> > I was able to write a simple program that reproduces the problem on a tiny
> > HBase cluster on ec2. The cluster has 2 m1.large nodes. One node runs the
> > master, name node and zookeeper. The other node runs a data node and a
> > region server, with heap size configured to be 6GB. There, the 1000
> > parallel reverse gets (reverse scans) take 7-8 seconds. The data set is
> > tiny (10M records, each having a small number of bytes). As I said before,
> > all hardware resources are very idle there.
> >
> > Interestingly, running the same workload on my macbook, the 1000 parallel
> > gets take ~200ms on a pseudo-distributed installation.
> >
> > Any help to resolve this mystery is highly appreciated.
> >
> > P.S. please find my test program attached.
> >
> > Best,
> > Khaled
> >
> > > From: nkeywal@gmail.com
> > > Date: Mon, 6 Oct 2014 09:40:48 +0200
> > > Subject: Re: HBase read performance
> > > To: user@hbase.apache.org
> > >
> > > Hi,
> > >
> > > I haven't seen it mentioned, but if I understand correctly each scan
> > > returns a single row? If so you should use Scan#setSmall to save some rpc
> > > calls.
> > >
> > > Cheers,
> > >
> > > Nicolas
> > >
> > >
> > > On Sun, Oct 5, 2014 at 11:28 AM, Qiang Tian <ti...@gmail.com> wrote:
> > >
> > > > when using separate HConnection instance, both its
> > > > RpcClient instance(maintain connection to a regionserver) and Registry
> > > > instance(maintain connection to zookeeper) will be separate..
> > > >
> > > > see
> > > >
> > > >
> > http://shammijayasinghe.blogspot.com/2012/02/zookeeper-increase-maximum-number-of.html
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > On Sun, Oct 5, 2014 at 2:24 PM, Khaled Elmeleegy <kd...@hotmail.com>
> > > > wrote:
> > > >
> > > > > I tried creating my own HConnections pool to use for my HBase calls,
> > so
> > > > > that not all the (2K) threads share the same HConnection. However, I
> > > > could
> > > > > only have 10 HConnections. Beyond that I get ZK exceptions, please
> > find
> > > > it
> > > > > below. Also, with 10 HConnections, I don't see noticeable
> > improvement in
> > > > > performance so far.
> > > > > 2014-10-05 06:11:26,490 WARN [main] zookeeper.RecoverableZooKeeper
> > > > > (RecoverableZooKeeper.java:retryOrThrow(253)) - Possibly transient
> > > > > ZooKeeper, quorum=54.68.206.252:2181,
> > > > >
> > exception=org.apache.zookeeper.KeeperException$ConnectionLossException:
> > > > > KeeperErrorCode = ConnectionLoss for /hbase/hbaseid2014-10-05
> > > > 06:11:26,490
> > > > > INFO [main] util.RetryCounter
> > > > (RetryCounter.java:sleepUntilNextRetry(155))
> > > > > - Sleeping 1000ms before retry #0...2014-10-05 06:11:27,845 WARN
> > [main]
> > > > > zookeeper.RecoverableZooKeeper
> > > > > (RecoverableZooKeeper.java:retryOrThrow(253)) - Possibly transient
> > > > > ZooKeeper, quorum=54.68.206.252:2181,
> > > > >
> > exception=org.apache.zookeeper.KeeperException$ConnectionLossException:
> > > > > KeeperErrorCode = ConnectionLoss for /hbase/hbaseid2014-10-05
> > > > 06:11:27,849
> > > > > INFO [main] util.RetryCounter
> > > > (RetryCounter.java:sleepUntilNextRetry(155))
> > > > > - Sleeping 2000ms before retry #1...2014-10-05 06:11:30,405 WARN
> > [main]
> > > > > zookeeper.RecoverableZooKeeper
> > > > > (RecoverableZooKeeper.java:retryOrThrow(253)) - Possibly transient
> > > > > ZooKeeper, quorum=54.68.206.252:2181,
> > > > >
> > exception=org.apache.zookeeper.KeeperException$ConnectionLossException:
> > > > > KeeperErrorCode = ConnectionLoss for /hbase/hbaseid2014-10-05
> > > > 06:11:30,405
> > > > > INFO [main] util.RetryCounter
> > > > (RetryCounter.java:sleepUntilNextRetry(155))
> > > > > - Sleeping 4000ms before retry #2...2014-10-05 06:11:35,278 WARN
> > [main]
> > > > > zookeeper.RecoverableZooKeeper
> > > > > (RecoverableZooKeeper.java:retryOrThrow(253)) - Possibly transient
> > > > > ZooKeeper, quorum=54.68.206.252:2181,
> > > > >
> > exception=org.apache.zookeeper.KeeperException$ConnectionLossException:
> > > > > KeeperErrorCode = ConnectionLoss for /hbase/hbaseid2014-10-05
> > > > 06:11:35,279
> > > > > INFO [main] util.RetryCounter
> > > > (RetryCounter.java:sleepUntilNextRetry(155))
> > > > > - Sleeping 8000ms before retry #3...2014-10-05 06:11:44,393 WARN
> > [main]
> > > > > zookeeper.RecoverableZooKeeper
> > > > > (RecoverableZooKeeper.java:retryOrThrow(253)) - Possibly transient
> > > > > ZooKeeper, quorum=54.68.206.252:2181,
> > > > >
> > exception=org.apache.zookeeper.KeeperException$ConnectionLossException:
> > > > > KeeperErrorCode = ConnectionLoss for /hbase/hbaseid2014-10-05
> > > > 06:11:44,393
> > > > > ERROR [main] zookeeper.RecoverableZooKeeper
> > > > > (RecoverableZooKeeper.java:retryOrThrow(255)) - ZooKeeper exists
> > failed
> > > > > after 4 attempts2014-10-05 06:11:44,394 WARN [main] zookeeper.ZKUtil
> > > > > (ZKUtil.java:checkExists(482)) - hconnection-0x4e174f3b, quorum=
> > > > > 54.68.206.252:2181, baseZNode=/hbase Unable to set watcher on znode
> > > > >
> > > >
> > (/hbase/hbaseid)org.apache.zookeeper.KeeperException$ConnectionLossException:
> > > > > KeeperErrorCode = ConnectionLoss for /hbase/hbaseid at
> > > > > org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
> > at
> > > > > org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
> > at
> > > > > org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1041) at
> > > > >
> > > >
> > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.exists(RecoverableZooKeeper.java:199)
> > > > > at
> > org.apache.hadoop.hbase.zookeeper.ZKUtil.checkExists(ZKUtil.java:479)
> > > > > at
> > > > >
> > > >
> > org.apache.hadoop.hbase.zookeeper.ZKClusterId.readClusterIdZNode(ZKClusterId.java:65)
> > > > > at
> > > > >
> > > >
> > org.apache.hadoop.hbase.client.ZooKeeperRegistry.getClusterId(ZooKeeperRegistry.java:83)
> > > > > at
> > > > >
> > > >
> > org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.retrieveClusterId(HConnectionManager.java:857)
> > > > > at
> > > > >
> > > >
> > org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.<init>(HConnectionManager.java:662)
> > > > > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> > > > > Method) at
> > > > >
> > > >
> > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
> > > > > at
> > > > >
> > > >
> > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> > > > > at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
> > > > at
> > > > >
> > > >
> > org.apache.hadoop.hbase.client.HConnectionManager.createConnection(HConnectionManager.java:414)
> > > > > at
> > > > >
> > > >
> > org.apache.hadoop.hbase.client.HConnectionManager.createConnection(HConnectionManager.java:335)...
> > > > > > From: kdiaa@hotmail.com
> > > > > > To: user@hbase.apache.org
> > > > > > Subject: RE: HBase read performance
> > > > > > Date: Fri, 3 Oct 2014 12:37:39 -0700
> > > > > >
> > > > > > Lars, Ted, and Qiang,
> > > > > > Thanks for all the input.
> > > > > > Qiang: yes all the threads are in the same client process sharing
> > the
> > > > > same connection. And since I don't see hardware contention, may be
> > there
> > > > is
> > > > > contention over this code path. I'll try using many connections and
> > see
> > > > if
> > > > > it alleviates the problems and I'll report back.
> > > > > > Thanks again,Khaled
> > > > > >
> > > > > > > Date: Fri, 3 Oct 2014 15:18:30 +0800
> > > > > > > Subject: Re: HBase read performance
> > > > > > > From: tianq01@gmail.com
> > > > > > > To: user@hbase.apache.org
> > > > > > >
> > > > > > > Regarding to profiling, Andrew introduced
> > > > > > >
> > http://www.brendangregg.com/blog/2014-06-12/java-flame-graphs.html
> > > > > months
> > > > > > > ago.
> > > > > > >
> > > > > > > processCallTime comes from RpcServer#call, so it looks good?
> > > > > > >
> > > > > > > I have a suspect:
> > https://issues.apache.org/jira/browse/HBASE-11306
> > > > > > >
> > > > > > > how many processes do you have for your 2000 threads?
> > > > > > > if olny 1 process, those threads will share just 1 connection to
> > that
> > > > > > > regionserver, there might be big contention on the RPC code path.
> > > > > ---for
> > > > > > > such case, could you try using different connections?
> > > > > > >
> > > > >
> > > >
> > https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HConnectionManager.html
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Fri, Oct 3, 2014 at 9:55 AM, Ted Yu <yu...@gmail.com>
> > wrote:
> > > > > > >
> > > > > > > > Khaled:
> > > > > > > > Do you have profiler such as jprofiler ?
> > > > > > > > Profiling would give us more hint.
> > > > > > > >
> > > > > > > > Otherwise capturing stack trace during the period of reverse
> > scan
> > > > > would
> > > > > > > > help.
> > > > > > > >
> > > > > > > > Cheers
> > > > > > > >
> > > > > > > > On Thu, Oct 2, 2014 at 4:52 PM, lars hofhansl <
> > larsh@apache.org>
> > > > > wrote:
> > > > > > > >
> > > > > > > > > You might have the data in the OS buffer cache, without short
> > > > > circuit
> > > > > > > > > reading the region server has to request the block from the
> > data
> > > > > node
> > > > > > > > > process, which then reads it from the block cache.
> > > > > > > > > That is a few context switches per RPC that do not show up
> > in CPU
> > > > > > > > metrics.
> > > > > > > > > In that you also would not see disk IO.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > If - as you say - you see a lot of evicted blocks the data
> > *has*
> > > > > to come
> > > > > > > > > from the OS. If you do not see disk IO is *has* to come from
> > the
> > > > OS
> > > > > > > > cache.
> > > > > > > > > I.e. there's more RAM on your boxes, and you should increase
> > the
> > > > > heap
> > > > > > > > block
> > > > > > > > > cache.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > You can measure the context switches with vmstat. Other than
> > that
> > > > > I have
> > > > > > > > > no suggestion until I reproduce the problem.
> > > > > > > > > Also check the data locality index of the region server it
> > should
> > > > > be
> > > > > > > > close
> > > > > > > > > to 100%.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > -- Lars
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > ________________________________
> > > > > > > > > From: Khaled Elmeleegy <kd...@hotmail.com>
> > > > > > > > > To: "user@hbase.apache.org" <us...@hbase.apache.org>
> > > > > > > > > Sent: Thursday, October 2, 2014 3:24 PM
> > > > > > > > > Subject: RE: HBase read performance
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Lars thanks a lot about all the tips. I'll make sure I cover
> > all
> > > > > of them
> > > > > > > > > and get back to you. I am not sure they are the bottleneck
> > though
> > > > > as they
> > > > > > > > > all are about optimizing physical resource usage. As I said,
> > I
> > > > > don't see
> > > > > > > > > any contended physical resources now. I'll also try to
> > reproduce
> > > > > this
> > > > > > > > > problem in a simpler environment and pass to you the test
> > program
> > > > > to play
> > > > > > > > > with.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Couple of high level points to make. You are right that my
> > use
> > > > > case is
> > > > > > > > > kind of a worst case for HBase reads. But, if things go the
> > way
> > > > you
> > > > > > > > > described them, there should be tons of disk IO and that
> > should
> > > > be
> > > > > > > > clearly
> > > > > > > > > the bottleneck. This is not the case though. That's for the
> > > > simple
> > > > > reason
> > > > > > > > > that this is done in a test environment (I am still
> > prototyping),
> > > > > and
> > > > > > > > not a
> > > > > > > > > lot of data is yet written to HBase. However for the real use
> > > > > case, there
> > > > > > > > > should writers constantly writing data to HBase and readers
> > > > > occasionally
> > > > > > > > > doing this scatter/gather. At steady state, things should
> > only
> > > > get
> > > > > worse
> > > > > > > > > and all the issues you mentioned should get far more
> > pronounced.
> > > > > At this
> > > > > > > > > point, one can try to mitigate it using more memory or so. I
> > am
> > > > > not there
> > > > > > > > > yet as I think I am hitting some software bottleneck, which I
> > > > > don't know
> > > > > > > > > how to work around.
> > > > > > > > >
> > > > > > > > > Khaled
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > ----------------------------------------
> > > > > > > > > > Date: Thu, 2 Oct 2014 14:20:47 -0700
> > > > > > > > > > From: larsh@apache.org
> > > > > > > > > > Subject: Re: HBase read performance
> > > > > > > > > > To: user@hbase.apache.org
> > > > > > > > > >
> > > > > > > > > > OK... We might need to investigate this.
> > > > > > > > > > Any chance that you can provide a minimal test program and
> > > > > instruction
> > > > > > > > > about how to set it up.
> > > > > > > > > > We can do some profiling then.
> > > > > > > > > >
> > > > > > > > > > One thing to note is that with scanning HBase cannot use
> > bloom
> > > > > filters
> > > > > > > > > to rule out HFiles ahead of time, it needs to look into all
> > of
> > > > > them.
> > > > > > > > > > So you kind of hit on the absolute worst case:
> > > > > > > > > > - random reads that do not fit into the block cache
> > > > > > > > > > - cannot use bloom filters
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Few more question/comments:
> > > > > > > > > > - Do you have short circuit reading enabled? If not, you
> > > > should.
> > > > > > > > > > - Is your table major compacted? That will reduce the
> > number of
> > > > > files
> > > > > > > > to
> > > > > > > > > look at.
> > > > > > > > > > - Did you disable Nagle's everywhere (enabled tcpnodelay)?
> > It
> > > > > disabled
> > > > > > > > > by default in HBase, but necessarily in your install of HDFS.
> > > > > > > > > > - Which version of HDFS are you using as backing
> > filesystem?
> > > > > > > > > > - If your disk is idle, it means the data fits into the OS
> > > > buffer
> > > > > > > > cache.
> > > > > > > > > In turn that means that you increase the heap for the region
> > > > > servers. You
> > > > > > > > > can also use block encoding (FAST_DIFF) to try to make sure
> > the
> > > > > entire
> > > > > > > > > working set fits into the cache.
> > > > > > > > > >
> > > > > > > > > > - Also try to reduce the block size - although if your
> > overall
> > > > > working
> > > > > > > > > set does not fit in the heap it won't help much.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > This is a good section of the book to read through
> > generally
> > > > > (even
> > > > > > > > > though you might know most of this already):
> > > > > > > > > http://hbase.apache.org/book.html#perf.configurations
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > -- Lars
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > ----- Original Message -----
> > > > > > > > > > From: Khaled Elmeleegy <kd...@hotmail.com>
> > > > > > > > > > To: "user@hbase.apache.org" <us...@hbase.apache.org>
> > > > > > > > > > Cc:
> > > > > > > > > > Sent: Thursday, October 2, 2014 11:27 AM
> > > > > > > > > > Subject: RE: HBase read performance
> > > > > > > > > >
> > > > > > > > > > I do see a very brief spike in CPU (user/system), but it's
> > no
> > > > > where
> > > > > > > > near
> > > > > > > > > 0% idle. It goes from 99% idle down to something like 40%
> > idle
> > > > for
> > > > > a
> > > > > > > > second
> > > > > > > > > or so. The thing to note, this is all on a test cluster, so
> > no
> > > > > real load.
> > > > > > > > > Things are generally idle until i issue 2-3 of these
> > > > > multi-scan-requests
> > > > > > > > to
> > > > > > > > > render a web page. Then, you see the spike in the cpu and
> > some
> > > > > activity
> > > > > > > > in
> > > > > > > > > the network and disk, but nowhere near saturation.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > If there are specific tests you'd like me to do to debug
> > this,
> > > > > I'd be
> > > > > > > > > more than happy to do it.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Khaled
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > ----------------------------------------
> > > > > > > > > >> Date: Thu, 2 Oct 2014 11:15:59 -0700
> > > > > > > > > >> From: larsh@apache.org
> > > > > > > > > >> Subject: Re: HBase read performance
> > > > > > > > > >> To: user@hbase.apache.org
> > > > > > > > > >>
> > > > > > > > > >> I still think you're waiting on disk. No IOWAIT? So CPU
> > is not
> > > > > waiting
> > > > > > > > > a lot for IO. No high User/System CPU either?
> > > > > > > > > >>
> > > > > > > > > >> If you see a lot of evicted block then each RPC has a high
> > > > > chance of
> > > > > > > > > requiring to bring an entire 64k block in. You'll see bad
> > > > > performance
> > > > > > > > with
> > > > > > > > > this.
> > > > > > > > > >>
> > > > > > > > > >> We might need to trace this specific scenario.
> > > > > > > > > >>
> > > > > > > > > >> -- Lars
> > > > > > > > > >>
> > > > > > > > > >>
> > > > > > > > > >>
> > > > > > > > > >> ________________________________
> > > > > > > > > >> From: Khaled Elmeleegy <kd...@hotmail.com>
> > > > > > > > > >> To: "user@hbase.apache.org" <us...@hbase.apache.org>
> > > > > > > > > >> Sent: Thursday, October 2, 2014 10:46 AM
> > > > > > > > > >> Subject: RE: HBase read performance
> > > > > > > > > >>
> > > > > > > > > >>
> > > > > > > > > >> I've set the heap size to 6GB and I do have gc logging. No
> > > > long
> > > > > pauses
> > > > > > > > > there -- occasional 0.1s or 0.2s.
> > > > > > > > > >>
> > > > > > > > > >> Other than the discrepancy between what's reported on the
> > > > > client and
> > > > > > > > > what's reported at the RS, there is also the issue of not
> > getting
> > > > > proper
> > > > > > > > > concurrency. So, even if a reverse get takes 100ms or so
> > (this
> > > > has
> > > > > to be
> > > > > > > > > mostly blocking on various things as no physical resource is
> > > > > contended),
> > > > > > > > > then the other gets/scans should be able to proceed in
> > parallel,
> > > > > so a
> > > > > > > > > thousand concurrent gets/scans should finish in few hundreds
> > of
> > > > ms
> > > > > not
> > > > > > > > many
> > > > > > > > > seconds. That's why I thought I'd increase the handlers
> > count to
> > > > > try to
> > > > > > > > get
> > > > > > > > > more concurrency, but it didn't help. So, there must be
> > something
> > > > > else.
> > > > > > > > > >>
> > > > > > > > > >> Khaled
> > > > > > > > > >>
> > > > > > > > > >> ----------------------------------------
> > > > > > > > > >>> From: ndimiduk@gmail.com
> > > > > > > > > >>> Date: Thu, 2 Oct 2014 10:36:39 -0700
> > > > > > > > > >>> Subject: Re: HBase read performance
> > > > > > > > > >>> To: user@hbase.apache.org
> > > > > > > > > >>>
> > > > > > > > > >>> Do check again on the heap size of the region servers.
> > The
> > > > > default
> > > > > > > > > >>> unconfigured size is 1G; too small for much of anything.
> > > > Check
> > > > > your
> > > > > > > > RS
> > > > > > > > > logs
> > > > > > > > > >>> -- look for lines produced by the JVMPauseMonitor thread.
> > > > They
> > > > > > > > usually
> > > > > > > > > >>> correlate with long GC pauses or other process-freeze
> > events.
> > > > > > > > > >>>
> > > > > > > > > >>> Get is implemented as a Scan of a single row, so a
> > reverse
> > > > > scan of a
> > > > > > > > > single
> > > > > > > > > >>> row should be functionally equivalent.
> > > > > > > > > >>>
> > > > > > > > > >>> In practice, I have seen discrepancy between the
> > latencies
> > > > > reported
> > > > > > > > by
> > > > > > > > > the
> > > > > > > > > >>> RS and the latencies experienced by the client. I've not
> > > > > investigated
> > > > > > > > > this
> > > > > > > > > >>> area thoroughly.
> > > > > > > > > >>>
> > > > > > > > > >>> On Thu, Oct 2, 2014 at 10:05 AM, Khaled Elmeleegy <
> > > > > kdiaa@hotmail.com
> > > > > > > > >
> > > > > > > > > wrote:
> > > > > > > > > >>>
> > > > > > > > > >>>> Thanks Lars for your quick reply.
> > > > > > > > > >>>>
> > > > > > > > > >>>> Yes performance is similar with less handlers (I tried
> > with
> > > > > 100
> > > > > > > > > first).
> > > > > > > > > >>>>
> > > > > > > > > >>>> The payload is not big ~1KB or so. The working set
> > doesn't
> > > > > seem to
> > > > > > > > > fit in
> > > > > > > > > >>>> memory as there are many cache misses. However, disk is
> > far
> > > > > from
> > > > > > > > > being a
> > > > > > > > > >>>> bottleneck. I checked using iostat. I also verified that
> > > > > neither the
> > > > > > > > > >>>> network nor the CPU of the region server or the client
> > are a
> > > > > > > > > bottleneck.
> > > > > > > > > >>>> This leads me to believe that likely this is a software
> > > > > bottleneck,
> > > > > > > > > >>>> possibly due to a misconfiguration on my side. I just
> > don't
> > > > > know how
> > > > > > > > > to
> > > > > > > > > >>>> debug it. A clear disconnect I see is the individual
> > request
> > > > > latency
> > > > > > > > > as
> > > > > > > > > >>>> reported by metrics on the region server (IPC
> > > > processCallTime
> > > > > vs
> > > > > > > > > scanNext)
> > > > > > > > > >>>> vs what's measured on the client. Does this sound
> > right? Any
> > > > > ideas
> > > > > > > > on
> > > > > > > > > how
> > > > > > > > > >>>> to better debug it?
> > > > > > > > > >>>>
> > > > > > > > > >>>> About this trick with the timestamps to be able to do a
> > > > > forward
> > > > > > > > scan,
> > > > > > > > > >>>> thanks for pointing it out. Actually, I am aware of it.
> > The
> > > > > problem
> > > > > > > > I
> > > > > > > > > have
> > > > > > > > > >>>> is, sometimes I want to get the key after a particular
> > > > > timestamp and
> > > > > > > > > >>>> sometimes I want to get the key before, so just relying
> > on
> > > > > the key
> > > > > > > > > order
> > > > > > > > > >>>> doesn't work. Ideally, I want a reverse get(). I thought
> > > > > reverse
> > > > > > > > scan
> > > > > > > > > can
> > > > > > > > > >>>> do the trick though.
> > > > > > > > > >>>>
> > > > > > > > > >>>> Khaled
> > > > > > > > > >>>>
> > > > > > > > > >>>> ----------------------------------------
> > > > > > > > > >>>>> Date: Thu, 2 Oct 2014 09:40:37 -0700
> > > > > > > > > >>>>> From: larsh@apache.org
> > > > > > > > > >>>>> Subject: Re: HBase read performance
> > > > > > > > > >>>>> To: user@hbase.apache.org
> > > > > > > > > >>>>>
> > > > > > > > > >>>>> Hi Khaled,
> > > > > > > > > >>>>> is it the same with fewer threads? 1500 handler threads
> > > > > seems to
> > > > > > > > be a
> > > > > > > > > >>>> lot. Typically a good number of threads depends on the
> > > > > hardware
> > > > > > > > > (number of
> > > > > > > > > >>>> cores, number of spindles, etc). I cannot think of any
> > type
> > > > of
> > > > > > > > > scenario
> > > > > > > > > >>>> where more than 100 would give any improvement.
> > > > > > > > > >>>>>
> > > > > > > > > >>>>> How large is the payload per KV retrieved that way? If
> > > > large
> > > > > (as
> > > > > > > > in a
> > > > > > > > > >>>> few 100k) you definitely want to lower the number of the
> > > > > handler
> > > > > > > > > threads.
> > > > > > > > > >>>>> How much heap do you give the region server? Does the
> > > > > working set
> > > > > > > > fit
> > > > > > > > > >>>> into the cache? (i.e. in the metrics, do you see the
> > > > eviction
> > > > > count
> > > > > > > > > going
> > > > > > > > > >>>> up, if so it does not fit into the cache).
> > > > > > > > > >>>>>
> > > > > > > > > >>>>> If the working set does not fit into the cache
> > (eviction
> > > > > count goes
> > > > > > > > > up)
> > > > > > > > > >>>> then HBase will need to bring a new block in from disk
> > on
> > > > > each Get
> > > > > > > > > >>>> (assuming the Gets are more or less random as far as the
> > > > > server is
> > > > > > > > > >>>> concerned).
> > > > > > > > > >>>>> In case you'll benefit from reducing the HFile block
> > size
> > > > > (from 64k
> > > > > > > > > to
> > > > > > > > > >>>> 8k or even 4k).
> > > > > > > > > >>>>>
> > > > > > > > > >>>>> Lastly I don't think we tested the performance of using
> > > > > reverse
> > > > > > > > scan
> > > > > > > > > >>>> this way, there is probably room to optimize this.
> > > > > > > > > >>>>> Can you restructure your keys to allow forwards
> > scanning?
> > > > For
> > > > > > > > example
> > > > > > > > > >>>> you could store the time as MAX_LONG-time. Or you could
> > > > > invert all
> > > > > > > > > the bits
> > > > > > > > > >>>> of the time portion of the key, so that it sort the
> > other
> > > > > way. Then
> > > > > > > > > you
> > > > > > > > > >>>> could do a forward scan.
> > > > > > > > > >>>>>
> > > > > > > > > >>>>> Let us know how it goes.
> > > > > > > > > >>>>>
> > > > > > > > > >>>>> -- Lars
> > > > > > > > > >>>>>
> > > > > > > > > >>>>>
> > > > > > > > > >>>>> ----- Original Message -----
> > > > > > > > > >>>>> From: Khaled Elmeleegy <kd...@hotmail.com>
> > > > > > > > > >>>>> To: "user@hbase.apache.org" <us...@hbase.apache.org>
> > > > > > > > > >>>>> Cc:
> > > > > > > > > >>>>> Sent: Thursday, October 2, 2014 12:12 AM
> > > > > > > > > >>>>> Subject: HBase read performance
> > > > > > > > > >>>>>
> > > > > > > > > >>>>> Hi,
> > > > > > > > > >>>>>
> > > > > > > > > >>>>> I am trying to do a scatter/gather on hbase (0.98.6.1),
> > > > > where I
> > > > > > > > have
> > > > > > > > > a
> > > > > > > > > >>>> client reading ~1000 keys from an HBase table. These
> > keys
> > > > > happen to
> > > > > > > > > fall on
> > > > > > > > > >>>> the same region server. For my reads I use reverse scan
> > to
> > > > > read each
> > > > > > > > > key as
> > > > > > > > > >>>> I want the key prior to a specific time stamp (time
> > stamps
> > > > are
> > > > > > > > stored
> > > > > > > > > in
> > > > > > > > > >>>> reverse order). I don't believe gets can accomplish
> > that,
> > > > > right? so
> > > > > > > > I
> > > > > > > > > use
> > > > > > > > > >>>> scan, with caching set to 1.
> > > > > > > > > >>>>>
> > > > > > > > > >>>>> I use 2000 reader threads in the client and on HBase,
> > I've
> > > > > set
> > > > > > > > > >>>> hbase.regionserver.handler.count to 1500. With this
> > setup,
> > > > my
> > > > > > > > scatter
> > > > > > > > > >>>> gather is very slow and can take up to 10s in total.
> > Timing
> > > > an
> > > > > > > > > individual
> > > > > > > > > >>>> getScanner(..) call on the client side, it can easily
> > take
> > > > few
> > > > > > > > > hundreds of
> > > > > > > > > >>>> ms. I also got the following metrics from the region
> > server
> > > > in
> > > > > > > > > question:
> > > > > > > > > >>>>>
> > > > > > > > > >>>>> "queueCallTime_mean" : 2.190855525775637,
> > > > > > > > > >>>>> "queueCallTime_median" : 0.0,
> > > > > > > > > >>>>> "queueCallTime_75th_percentile" : 0.0,
> > > > > > > > > >>>>> "queueCallTime_95th_percentile" : 1.0,
> > > > > > > > > >>>>> "queueCallTime_99th_percentile" : 556.9799999999818,
> > > > > > > > > >>>>>
> > > > > > > > > >>>>> "processCallTime_min" : 0,
> > > > > > > > > >>>>> "processCallTime_max" : 12755,
> > > > > > > > > >>>>> "processCallTime_mean" : 105.64873440912682,
> > > > > > > > > >>>>> "processCallTime_median" : 0.0,
> > > > > > > > > >>>>> "processCallTime_75th_percentile" : 2.0,
> > > > > > > > > >>>>> "processCallTime_95th_percentile" : 7917.95,
> > > > > > > > > >>>>> "processCallTime_99th_percentile" : 8876.89,
> > > > > > > > > >>>>>
> > > > > > > > > >>>>>
> > > > > > > > > >>>>
> > > > > > > > >
> > > > > > > >
> > > > >
> > > >
> > "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_min"
> > > > > > > > > >>>> : 89,
> > > > > > > > > >>>>>
> > > > > > > > > >>>>
> > > > > > > > >
> > > > > > > >
> > > > >
> > > >
> > "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_max"
> > > > > > > > > >>>> : 11300,
> > > > > > > > > >>>>>
> > > > > > > > > >>>>
> > > > > > > > >
> > > > > > > >
> > > > >
> > > >
> > "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_mean"
> > > > > > > > > >>>> : 654.4949739797315,
> > > > > > > > > >>>>>
> > > > > > > > > >>>>
> > > > > > > > >
> > > > > > > >
> > > > >
> > > >
> > "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_median"
> > > > > > > > > >>>> : 101.0,
> > > > > > > > > >>>>>
> > > > > > > > > >>>>
> > > > > > > > >
> > > > > > > >
> > > > >
> > > >
> > "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_75th_percentile"
> > > > > > > > > >>>> : 101.0,
> > > > > > > > > >>>>>
> > > > > > > > > >>>>
> > > > > > > > >
> > > > > > > >
> > > > >
> > > >
> > "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_95th_percentile"
> > > > > > > > > >>>> : 101.0,
> > > > > > > > > >>>>>
> > > > > > > > > >>>>
> > > > > > > > >
> > > > > > > >
> > > > >
> > > >
> > "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_99th_percentile"
> > > > > > > > > >>>> : 113.0,
> > > > > > > > > >>>>>
> > > > > > > > > >>>>> Where "delta" is the name of the table I am querying.
> > > > > > > > > >>>>>
> > > > > > > > > >>>>> In addition to all this, i monitored the hardware
> > resources
> > > > > (CPU,
> > > > > > > > > disk,
> > > > > > > > > >>>> and network) of both the client and the region server
> > and
> > > > > nothing
> > > > > > > > > seems
> > > > > > > > > >>>> anywhere near saturation. So I am puzzled by what's
> > going on
> > > > > and
> > > > > > > > > where this
> > > > > > > > > >>>> time is going.
> > > > > > > > > >>>>>
> > > > > > > > > >>>>> Few things to note based on the above measurements:
> > both
> > > > > medians of
> > > > > > > > > IPC
> > > > > > > > > >>>> processCallTime and queueCallTime are basically zero
> > (ms I
> > > > > presume,
> > > > > > > > > >>>> right?). However, scanNext_median is 101 (ms too,
> > right?). I
> > > > > am not
> > > > > > > > > sure
> > > > > > > > > >>>> how this adds up. Also, even though the 101 figure seems
> > > > > > > > outrageously
> > > > > > > > > high
> > > > > > > > > >>>> and I don't know why, still all these scans should be
> > > > > happening in
> > > > > > > > > >>>> parallel, so the overall call should finish fast, given
> > that
> > > > > no
> > > > > > > > > hardware
> > > > > > > > > >>>> resource is contended, right? but this is not what's
> > > > > happening, so I
> > > > > > > > > have
> > > > > > > > > >>>> to be missing something(s).
> > > > > > > > > >>>>>
> > > > > > > > > >>>>> So, any help is appreciated there.
> > > > > > > > > >>>>>
> > > > > > > > > >>>>> Thanks,
> > > > > > > > > >>>>> Khaled
> > > > > > > > > >>>>
> > > > > > > > > >>>>
> > > > > > > > >
> > > > > > > >
> > > > > >
> > > > >
> > > > >
> > > >
> >
> 
> 
> 
> -- 
> Best regards,
> 
>    - Andy
> 
> Problems worthy of attack prove their worth by hitting back. - Piet Hein
> (via Tom White)

Re: HBase read performance

Posted by Andrew Purtell <ap...@apache.org>.

> The cluster has 2 m1.large nodes.

That's the problem right there.

You need to look at c3.4xlarge or i2 instances as a minimum requirement. M1
and even M3 instance types have ridiculously poor IO.


On Tue, Oct 7, 2014 at 3:01 PM, Khaled Elmeleegy <kd...@hotmail.com> wrote:

> Thanks Nicolas, Qiang.
>
> I was able to write a simple program that reproduces the problem on a tiny
> HBase cluster on ec2. The cluster has 2 m1.large nodes. One node runs the
> master, name node and zookeeper. The other node runs a data node and a
> region server, with heap size configured to be 6GB. There, the 1000
> parallel reverse gets (reverse scans) take 7-8 seconds. The data set is
> tiny (10M records, each having a small number of bytes). As I said before,
> all hardware resources are very idle there.
>
> Interestingly, running the same workload on my macbook, the 1000 parallel
> gets take ~200ms on a pseudo-distributed installation.
>
> Any help to resolve this mystery is highly appreciated.
>
> P.S. please find my test program attached.
>
> Best,
> Khaled
>
> > From: nkeywal@gmail.com
> > Date: Mon, 6 Oct 2014 09:40:48 +0200
> > Subject: Re: HBase read performance
> > To: user@hbase.apache.org
> >
> > Hi,
> >
> > I haven't seen it mentioned, but if I understand correctly each scan
> > returns a single row? If so you should use Scan#setSmall to save some rpc
> > calls.
> >
> > Cheers,
> >
> > Nicolas
> >
> >
> > On Sun, Oct 5, 2014 at 11:28 AM, Qiang Tian <ti...@gmail.com> wrote:
> >
> > > when using separate HConnection instance, both its
> > > RpcClient instance(maintain connection to a regionserver) and Registry
> > > instance(maintain connection to zookeeper) will be separate..
> > >
> > > see
> > >
> > >
> http://shammijayasinghe.blogspot.com/2012/02/zookeeper-increase-maximum-number-of.html
> > >
> > >
> > >
> > >
> > >
> > > On Sun, Oct 5, 2014 at 2:24 PM, Khaled Elmeleegy <kd...@hotmail.com>
> > > wrote:
> > >
> > > > I tried creating my own HConnections pool to use for my HBase calls,
> so
> > > > that not all the (2K) threads share the same HConnection. However, I
> > > could
> > > > only have 10 HConnections. Beyond that I get ZK exceptions, please
> find
> > > it
> > > > below. Also, with 10 HConnections, I don't see noticeable
> improvement in
> > > > performance so far.
> > > > 2014-10-05 06:11:26,490 WARN [main] zookeeper.RecoverableZooKeeper
> > > > (RecoverableZooKeeper.java:retryOrThrow(253)) - Possibly transient
> > > > ZooKeeper, quorum=54.68.206.252:2181,
> > > >
> exception=org.apache.zookeeper.KeeperException$ConnectionLossException:
> > > > KeeperErrorCode = ConnectionLoss for /hbase/hbaseid2014-10-05
> > > 06:11:26,490
> > > > INFO [main] util.RetryCounter
> > > (RetryCounter.java:sleepUntilNextRetry(155))
> > > > - Sleeping 1000ms before retry #0...2014-10-05 06:11:27,845 WARN
> [main]
> > > > zookeeper.RecoverableZooKeeper
> > > > (RecoverableZooKeeper.java:retryOrThrow(253)) - Possibly transient
> > > > ZooKeeper, quorum=54.68.206.252:2181,
> > > >
> exception=org.apache.zookeeper.KeeperException$ConnectionLossException:
> > > > KeeperErrorCode = ConnectionLoss for /hbase/hbaseid2014-10-05
> > > 06:11:27,849
> > > > INFO [main] util.RetryCounter
> > > (RetryCounter.java:sleepUntilNextRetry(155))
> > > > - Sleeping 2000ms before retry #1...2014-10-05 06:11:30,405 WARN
> [main]
> > > > zookeeper.RecoverableZooKeeper
> > > > (RecoverableZooKeeper.java:retryOrThrow(253)) - Possibly transient
> > > > ZooKeeper, quorum=54.68.206.252:2181,
> > > >
> exception=org.apache.zookeeper.KeeperException$ConnectionLossException:
> > > > KeeperErrorCode = ConnectionLoss for /hbase/hbaseid2014-10-05
> > > 06:11:30,405
> > > > INFO [main] util.RetryCounter
> > > (RetryCounter.java:sleepUntilNextRetry(155))
> > > > - Sleeping 4000ms before retry #2...2014-10-05 06:11:35,278 WARN
> [main]
> > > > zookeeper.RecoverableZooKeeper
> > > > (RecoverableZooKeeper.java:retryOrThrow(253)) - Possibly transient
> > > > ZooKeeper, quorum=54.68.206.252:2181,
> > > >
> exception=org.apache.zookeeper.KeeperException$ConnectionLossException:
> > > > KeeperErrorCode = ConnectionLoss for /hbase/hbaseid2014-10-05
> > > 06:11:35,279
> > > > INFO [main] util.RetryCounter
> > > (RetryCounter.java:sleepUntilNextRetry(155))
> > > > - Sleeping 8000ms before retry #3...2014-10-05 06:11:44,393 WARN
> [main]
> > > > zookeeper.RecoverableZooKeeper
> > > > (RecoverableZooKeeper.java:retryOrThrow(253)) - Possibly transient
> > > > ZooKeeper, quorum=54.68.206.252:2181,
> > > >
> exception=org.apache.zookeeper.KeeperException$ConnectionLossException:
> > > > KeeperErrorCode = ConnectionLoss for /hbase/hbaseid2014-10-05
> > > 06:11:44,393
> > > > ERROR [main] zookeeper.RecoverableZooKeeper
> > > > (RecoverableZooKeeper.java:retryOrThrow(255)) - ZooKeeper exists
> failed
> > > > after 4 attempts2014-10-05 06:11:44,394 WARN [main] zookeeper.ZKUtil
> > > > (ZKUtil.java:checkExists(482)) - hconnection-0x4e174f3b, quorum=
> > > > 54.68.206.252:2181, baseZNode=/hbase Unable to set watcher on znode
> > > >
> > >
> (/hbase/hbaseid)org.apache.zookeeper.KeeperException$ConnectionLossException:
> > > > KeeperErrorCode = ConnectionLoss for /hbase/hbaseid at
> > > > org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
> at
> > > > org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
> at
> > > > org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1041) at
> > > >
> > >
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.exists(RecoverableZooKeeper.java:199)
> > > > at
> org.apache.hadoop.hbase.zookeeper.ZKUtil.checkExists(ZKUtil.java:479)
> > > > at
> > > >
> > >
> org.apache.hadoop.hbase.zookeeper.ZKClusterId.readClusterIdZNode(ZKClusterId.java:65)
> > > > at
> > > >
> > >
> org.apache.hadoop.hbase.client.ZooKeeperRegistry.getClusterId(ZooKeeperRegistry.java:83)
> > > > at
> > > >
> > >
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.retrieveClusterId(HConnectionManager.java:857)
> > > > at
> > > >
> > >
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.<init>(HConnectionManager.java:662)
> > > > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> > > > Method) at
> > > >
> > >
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
> > > > at
> > > >
> > >
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> > > > at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
> > > at
> > > >
> > >
> org.apache.hadoop.hbase.client.HConnectionManager.createConnection(HConnectionManager.java:414)
> > > > at
> > > >
> > >
> org.apache.hadoop.hbase.client.HConnectionManager.createConnection(HConnectionManager.java:335)...
> > > > > From: kdiaa@hotmail.com
> > > > > To: user@hbase.apache.org
> > > > > Subject: RE: HBase read performance
> > > > > Date: Fri, 3 Oct 2014 12:37:39 -0700
> > > > >
> > > > > Lars, Ted, and Qiang,
> > > > > Thanks for all the input.
> > > > > Qiang: yes all the threads are in the same client process sharing
> the
> > > > same connection. And since I don't see hardware contention, may be
> there
> > > is
> > > > contention over this code path. I'll try using many connections and
> see
> > > if
> > > > it alleviates the problems and I'll report back.
> > > > > Thanks again,Khaled
> > > > >
> > > > > > Date: Fri, 3 Oct 2014 15:18:30 +0800
> > > > > > Subject: Re: HBase read performance
> > > > > > From: tianq01@gmail.com
> > > > > > To: user@hbase.apache.org
> > > > > >
> > > > > > Regarding to profiling, Andrew introduced
> > > > > >
> http://www.brendangregg.com/blog/2014-06-12/java-flame-graphs.html
> > > > months
> > > > > > ago.
> > > > > >
> > > > > > processCallTime comes from RpcServer#call, so it looks good?
> > > > > >
> > > > > > I have a suspect:
> https://issues.apache.org/jira/browse/HBASE-11306
> > > > > >
> > > > > > how many processes do you have for your 2000 threads?
> > > > > > if olny 1 process, those threads will share just 1 connection to
> that
> > > > > > regionserver, there might be big contention on the RPC code path.
> > > > ---for
> > > > > > such case, could you try using different connections?
> > > > > >
> > > >
> > >
> https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HConnectionManager.html
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Fri, Oct 3, 2014 at 9:55 AM, Ted Yu <yu...@gmail.com>
> wrote:
> > > > > >
> > > > > > > Khaled:
> > > > > > > Do you have profiler such as jprofiler ?
> > > > > > > Profiling would give us more hint.
> > > > > > >
> > > > > > > Otherwise capturing stack trace during the period of reverse
> scan
> > > > would
> > > > > > > help.
> > > > > > >
> > > > > > > Cheers
> > > > > > >
> > > > > > > On Thu, Oct 2, 2014 at 4:52 PM, lars hofhansl <
> larsh@apache.org>
> > > > wrote:
> > > > > > >
> > > > > > > > You might have the data in the OS buffer cache, without short
> > > > circuit
> > > > > > > > reading the region server has to request the block from the
> data
> > > > node
> > > > > > > > process, which then reads it from the block cache.
> > > > > > > > That is a few context switches per RPC that do not show up
> in CPU
> > > > > > > metrics.
> > > > > > > > In that you also would not see disk IO.
> > > > > > > >
> > > > > > > >
> > > > > > > > If - as you say - you see a lot of evicted blocks the data
> *has*
> > > > to come
> > > > > > > > from the OS. If you do not see disk IO is *has* to come from
> the
> > > OS
> > > > > > > cache.
> > > > > > > > I.e. there's more RAM on your boxes, and you should increase
> the
> > > > heap
> > > > > > > block
> > > > > > > > cache.
> > > > > > > >
> > > > > > > >
> > > > > > > > You can measure the context switches with vmstat. Other than
> that
> > > > I have
> > > > > > > > no suggestion until I reproduce the problem.
> > > > > > > > Also check the data locality index of the region server it
> should
> > > > be
> > > > > > > close
> > > > > > > > to 100%.
> > > > > > > >
> > > > > > > >
> > > > > > > > -- Lars
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > ________________________________
> > > > > > > > From: Khaled Elmeleegy <kd...@hotmail.com>
> > > > > > > > To: "user@hbase.apache.org" <us...@hbase.apache.org>
> > > > > > > > Sent: Thursday, October 2, 2014 3:24 PM
> > > > > > > > Subject: RE: HBase read performance
> > > > > > > >
> > > > > > > >
> > > > > > > > Lars thanks a lot about all the tips. I'll make sure I cover
> all
> > > > of them
> > > > > > > > and get back to you. I am not sure they are the bottleneck
> though
> > > > as they
> > > > > > > > all are about optimizing physical resource usage. As I said,
> I
> > > > don't see
> > > > > > > > any contended physical resources now. I'll also try to
> reproduce
> > > > this
> > > > > > > > problem in a simpler environment and pass to you the test
> program
> > > > to play
> > > > > > > > with.
> > > > > > > >
> > > > > > > >
> > > > > > > > Couple of high level points to make. You are right that my
> use
> > > > case is
> > > > > > > > kind of a worst case for HBase reads. But, if things go the
> way
> > > you
> > > > > > > > described them, there should be tons of disk IO and that
> should
> > > be
> > > > > > > clearly
> > > > > > > > the bottleneck. This is not the case though. That's for the
> > > simple
> > > > reason
> > > > > > > > that this is done in a test environment (I am still
> prototyping),
> > > > and
> > > > > > > not a
> > > > > > > > lot of data is yet written to HBase. However for the real use
> > > > case, there
> > > > > > > > should writers constantly writing data to HBase and readers
> > > > occasionally
> > > > > > > > doing this scatter/gather. At steady state, things should
> only
> > > get
> > > > worse
> > > > > > > > and all the issues you mentioned should get far more
> pronounced.
> > > > At this
> > > > > > > > point, one can try to mitigate it using more memory or so. I
> am
> > > > not there
> > > > > > > > yet as I think I am hitting some software bottleneck, which I
> > > > don't know
> > > > > > > > how to work around.
> > > > > > > >
> > > > > > > > Khaled
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > ----------------------------------------
> > > > > > > > > Date: Thu, 2 Oct 2014 14:20:47 -0700
> > > > > > > > > From: larsh@apache.org
> > > > > > > > > Subject: Re: HBase read performance
> > > > > > > > > To: user@hbase.apache.org
> > > > > > > > >
> > > > > > > > > OK... We might need to investigate this.
> > > > > > > > > Any chance that you can provide a minimal test program and
> > > > instruction
> > > > > > > > about how to set it up.
> > > > > > > > > We can do some profiling then.
> > > > > > > > >
> > > > > > > > > One thing to note is that with scanning HBase cannot use
> bloom
> > > > filters
> > > > > > > > to rule out HFiles ahead of time, it needs to look into all
> of
> > > > them.
> > > > > > > > > So you kind of hit on the absolute worst case:
> > > > > > > > > - random reads that do not fit into the block cache
> > > > > > > > > - cannot use bloom filters
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Few more question/comments:
> > > > > > > > > - Do you have short circuit reading enabled? If not, you
> > > should.
> > > > > > > > > - Is your table major compacted? That will reduce the
> number of
> > > > files
> > > > > > > to
> > > > > > > > look at.
> > > > > > > > > - Did you disable Nagle's everywhere (enabled tcpnodelay)?
> It
> > > > disabled
> > > > > > > > by default in HBase, but necessarily in your install of HDFS.
> > > > > > > > > - Which version of HDFS are you using as backing
> filesystem?
> > > > > > > > > - If your disk is idle, it means the data fits into the OS
> > > buffer
> > > > > > > cache.
> > > > > > > > In turn that means that you increase the heap for the region
> > > > servers. You
> > > > > > > > can also use block encoding (FAST_DIFF) to try to make sure
> the
> > > > entire
> > > > > > > > working set fits into the cache.
> > > > > > > > >
> > > > > > > > > - Also try to reduce the block size - although if your
> overall
> > > > working
> > > > > > > > set does not fit in the heap it won't help much.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > This is a good section of the book to read through
> generally
> > > > (even
> > > > > > > > though you might know most of this already):
> > > > > > > > http://hbase.apache.org/book.html#perf.configurations
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > -- Lars
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > ----- Original Message -----
> > > > > > > > > From: Khaled Elmeleegy <kd...@hotmail.com>
> > > > > > > > > To: "user@hbase.apache.org" <us...@hbase.apache.org>
> > > > > > > > > Cc:
> > > > > > > > > Sent: Thursday, October 2, 2014 11:27 AM
> > > > > > > > > Subject: RE: HBase read performance
> > > > > > > > >
> > > > > > > > > I do see a very brief spike in CPU (user/system), but it's
> no
> > > > where
> > > > > > > near
> > > > > > > > 0% idle. It goes from 99% idle down to something like 40%
> idle
> > > for
> > > > a
> > > > > > > second
> > > > > > > > or so. The thing to note, this is all on a test cluster, so
> no
> > > > real load.
> > > > > > > > Things are generally idle until i issue 2-3 of these
> > > > multi-scan-requests
> > > > > > > to
> > > > > > > > render a web page. Then, you see the spike in the cpu and
> some
> > > > activity
> > > > > > > in
> > > > > > > > the network and disk, but nowhere near saturation.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > If there are specific tests you'd like me to do to debug
> this,
> > > > I'd be
> > > > > > > > more than happy to do it.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Khaled
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > ----------------------------------------
> > > > > > > > >> Date: Thu, 2 Oct 2014 11:15:59 -0700
> > > > > > > > >> From: larsh@apache.org
> > > > > > > > >> Subject: Re: HBase read performance
> > > > > > > > >> To: user@hbase.apache.org
> > > > > > > > >>
> > > > > > > > >> I still think you're waiting on disk. No IOWAIT? So CPU
> is not
> > > > waiting
> > > > > > > > a lot for IO. No high User/System CPU either?
> > > > > > > > >>
> > > > > > > > >> If you see a lot of evicted block then each RPC has a high
> > > > chance of
> > > > > > > > requiring to bring an entire 64k block in. You'll see bad
> > > > performance
> > > > > > > with
> > > > > > > > this.
> > > > > > > > >>
> > > > > > > > >> We might need to trace this specific scenario.
> > > > > > > > >>
> > > > > > > > >> -- Lars
> > > > > > > > >>
> > > > > > > > >>
> > > > > > > > >>
> > > > > > > > >> ________________________________
> > > > > > > > >> From: Khaled Elmeleegy <kd...@hotmail.com>
> > > > > > > > >> To: "user@hbase.apache.org" <us...@hbase.apache.org>
> > > > > > > > >> Sent: Thursday, October 2, 2014 10:46 AM
> > > > > > > > >> Subject: RE: HBase read performance
> > > > > > > > >>
> > > > > > > > >>
> > > > > > > > >> I've set the heap size to 6GB and I do have gc logging. No
> > > long
> > > > pauses
> > > > > > > > there -- occasional 0.1s or 0.2s.
> > > > > > > > >>
> > > > > > > > >> Other than the discrepancy between what's reported on the
> > > > client and
> > > > > > > > what's reported at the RS, there is also the issue of not
> getting
> > > > proper
> > > > > > > > concurrency. So, even if a reverse get takes 100ms or so
> (this
> > > has
> > > > to be
> > > > > > > > mostly blocking on various things as no physical resource is
> > > > contended),
> > > > > > > > then the other gets/scans should be able to proceed in
> parallel,
> > > > so a
> > > > > > > > thousand concurrent gets/scans should finish in few hundreds
> of
> > > ms
> > > > not
> > > > > > > many
> > > > > > > > seconds. That's why I thought I'd increase the handlers
> count to
> > > > try to
> > > > > > > get
> > > > > > > > more concurrency, but it didn't help. So, there must be
> something
> > > > else.
> > > > > > > > >>
> > > > > > > > >> Khaled
> > > > > > > > >>
> > > > > > > > >> ----------------------------------------
> > > > > > > > >>> From: ndimiduk@gmail.com
> > > > > > > > >>> Date: Thu, 2 Oct 2014 10:36:39 -0700
> > > > > > > > >>> Subject: Re: HBase read performance
> > > > > > > > >>> To: user@hbase.apache.org
> > > > > > > > >>>
> > > > > > > > >>> Do check again on the heap size of the region servers.
> The
> > > > default
> > > > > > > > >>> unconfigured size is 1G; too small for much of anything.
> > > Check
> > > > your
> > > > > > > RS
> > > > > > > > logs
> > > > > > > > >>> -- look for lines produced by the JVMPauseMonitor thread.
> > > They
> > > > > > > usually
> > > > > > > > >>> correlate with long GC pauses or other process-freeze
> events.
> > > > > > > > >>>
> > > > > > > > >>> Get is implemented as a Scan of a single row, so a
> reverse
> > > > scan of a
> > > > > > > > single
> > > > > > > > >>> row should be functionally equivalent.
> > > > > > > > >>>
> > > > > > > > >>> In practice, I have seen discrepancy between the
> latencies
> > > > reported
> > > > > > > by
> > > > > > > > the
> > > > > > > > >>> RS and the latencies experienced by the client. I've not
> > > > investigated
> > > > > > > > this
> > > > > > > > >>> area thoroughly.
> > > > > > > > >>>
> > > > > > > > >>> On Thu, Oct 2, 2014 at 10:05 AM, Khaled Elmeleegy <
> > > > kdiaa@hotmail.com
> > > > > > > >
> > > > > > > > wrote:
> > > > > > > > >>>
> > > > > > > > >>>> Thanks Lars for your quick reply.
> > > > > > > > >>>>
> > > > > > > > >>>> Yes performance is similar with less handlers (I tried
> with
> > > > 100
> > > > > > > > first).
> > > > > > > > >>>>
> > > > > > > > >>>> The payload is not big ~1KB or so. The working set
> doesn't
> > > > seem to
> > > > > > > > fit in
> > > > > > > > >>>> memory as there are many cache misses. However, disk is
> far
> > > > from
> > > > > > > > being a
> > > > > > > > >>>> bottleneck. I checked using iostat. I also verified that
> > > > neither the
> > > > > > > > >>>> network nor the CPU of the region server or the client
> are a
> > > > > > > > bottleneck.
> > > > > > > > >>>> This leads me to believe that likely this is a software
> > > > bottleneck,
> > > > > > > > >>>> possibly due to a misconfiguration on my side. I just
> don't
> > > > know how
> > > > > > > > to
> > > > > > > > >>>> debug it. A clear disconnect I see is the individual
> request
> > > > latency
> > > > > > > > as
> > > > > > > > >>>> reported by metrics on the region server (IPC
> > > processCallTime
> > > > vs
> > > > > > > > scanNext)
> > > > > > > > >>>> vs what's measured on the client. Does this sound
> right? Any
> > > > ideas
> > > > > > > on
> > > > > > > > how
> > > > > > > > >>>> to better debug it?
> > > > > > > > >>>>
> > > > > > > > >>>> About this trick with the timestamps to be able to do a
> > > > forward
> > > > > > > scan,
> > > > > > > > >>>> thanks for pointing it out. Actually, I am aware of it.
> The
> > > > problem
> > > > > > > I
> > > > > > > > have
> > > > > > > > >>>> is, sometimes I want to get the key after a particular
> > > > timestamp and
> > > > > > > > >>>> sometimes I want to get the key before, so just relying
> on
> > > > the key
> > > > > > > > order
> > > > > > > > >>>> doesn't work. Ideally, I want a reverse get(). I thought
> > > > reverse
> > > > > > > scan
> > > > > > > > can
> > > > > > > > >>>> do the trick though.
> > > > > > > > >>>>
> > > > > > > > >>>> Khaled
> > > > > > > > >>>>
> > > > > > > > >>>> ----------------------------------------
> > > > > > > > >>>>> Date: Thu, 2 Oct 2014 09:40:37 -0700
> > > > > > > > >>>>> From: larsh@apache.org
> > > > > > > > >>>>> Subject: Re: HBase read performance
> > > > > > > > >>>>> To: user@hbase.apache.org
> > > > > > > > >>>>>
> > > > > > > > >>>>> Hi Khaled,
> > > > > > > > >>>>> is it the same with fewer threads? 1500 handler threads
> > > > seems to
> > > > > > > be a
> > > > > > > > >>>> lot. Typically a good number of threads depends on the
> > > > hardware
> > > > > > > > (number of
> > > > > > > > >>>> cores, number of spindles, etc). I cannot think of any
> type
> > > of
> > > > > > > > scenario
> > > > > > > > >>>> where more than 100 would give any improvement.
> > > > > > > > >>>>>
> > > > > > > > >>>>> How large is the payload per KV retrieved that way? If
> > > large
> > > > (as
> > > > > > > in a
> > > > > > > > >>>> few 100k) you definitely want to lower the number of the
> > > > handler
> > > > > > > > threads.
> > > > > > > > >>>>> How much heap do you give the region server? Does the
> > > > working set
> > > > > > > fit
> > > > > > > > >>>> into the cache? (i.e. in the metrics, do you see the
> > > eviction
> > > > count
> > > > > > > > going
> > > > > > > > >>>> up, if so it does not fit into the cache).
> > > > > > > > >>>>>
> > > > > > > > >>>>> If the working set does not fit into the cache
> (eviction
> > > > count goes
> > > > > > > > up)
> > > > > > > > >>>> then HBase will need to bring a new block in from disk
> on
> > > > each Get
> > > > > > > > >>>> (assuming the Gets are more or less random as far as the
> > > > server is
> > > > > > > > >>>> concerned).
> > > > > > > > >>>>> In case you'll benefit from reducing the HFile block
> size
> > > > (from 64k
> > > > > > > > to
> > > > > > > > >>>> 8k or even 4k).
> > > > > > > > >>>>>
> > > > > > > > >>>>> Lastly I don't think we tested the performance of using
> > > > reverse
> > > > > > > scan
> > > > > > > > >>>> this way, there is probably room to optimize this.
> > > > > > > > >>>>> Can you restructure your keys to allow forwards
> scanning?
> > > For
> > > > > > > example
> > > > > > > > >>>> you could store the time as MAX_LONG-time. Or you could
> > > > invert all
> > > > > > > > the bits
> > > > > > > > >>>> of the time portion of the key, so that it sort the
> other
> > > > way. Then
> > > > > > > > you
> > > > > > > > >>>> could do a forward scan.
> > > > > > > > >>>>>
> > > > > > > > >>>>> Let us know how it goes.
> > > > > > > > >>>>>
> > > > > > > > >>>>> -- Lars
> > > > > > > > >>>>>
> > > > > > > > >>>>>
> > > > > > > > >>>>> ----- Original Message -----
> > > > > > > > >>>>> From: Khaled Elmeleegy <kd...@hotmail.com>
> > > > > > > > >>>>> To: "user@hbase.apache.org" <us...@hbase.apache.org>
> > > > > > > > >>>>> Cc:
> > > > > > > > >>>>> Sent: Thursday, October 2, 2014 12:12 AM
> > > > > > > > >>>>> Subject: HBase read performance
> > > > > > > > >>>>>
> > > > > > > > >>>>> Hi,
> > > > > > > > >>>>>
> > > > > > > > >>>>> I am trying to do a scatter/gather on hbase (0.98.6.1),
> > > > where I
> > > > > > > have
> > > > > > > > a
> > > > > > > > >>>> client reading ~1000 keys from an HBase table. These
> keys
> > > > happen to
> > > > > > > > fall on
> > > > > > > > >>>> the same region server. For my reads I use reverse scan
> to
> > > > read each
> > > > > > > > key as
> > > > > > > > >>>> I want the key prior to a specific time stamp (time
> stamps
> > > are
> > > > > > > stored
> > > > > > > > in
> > > > > > > > >>>> reverse order). I don't believe gets can accomplish
> that,
> > > > right? so
> > > > > > > I
> > > > > > > > use
> > > > > > > > >>>> scan, with caching set to 1.
> > > > > > > > >>>>>
> > > > > > > > >>>>> I use 2000 reader threads in the client and on HBase,
> I've
> > > > set
> > > > > > > > >>>> hbase.regionserver.handler.count to 1500. With this
> setup,
> > > my
> > > > > > > scatter
> > > > > > > > >>>> gather is very slow and can take up to 10s in total.
> Timing
> > > an
> > > > > > > > individual
> > > > > > > > >>>> getScanner(..) call on the client side, it can easily
> take
> > > few
> > > > > > > > hundreds of
> > > > > > > > >>>> ms. I also got the following metrics from the region
> server
> > > in
> > > > > > > > question:
> > > > > > > > >>>>>
> > > > > > > > >>>>> "queueCallTime_mean" : 2.190855525775637,
> > > > > > > > >>>>> "queueCallTime_median" : 0.0,
> > > > > > > > >>>>> "queueCallTime_75th_percentile" : 0.0,
> > > > > > > > >>>>> "queueCallTime_95th_percentile" : 1.0,
> > > > > > > > >>>>> "queueCallTime_99th_percentile" : 556.9799999999818,
> > > > > > > > >>>>>
> > > > > > > > >>>>> "processCallTime_min" : 0,
> > > > > > > > >>>>> "processCallTime_max" : 12755,
> > > > > > > > >>>>> "processCallTime_mean" : 105.64873440912682,
> > > > > > > > >>>>> "processCallTime_median" : 0.0,
> > > > > > > > >>>>> "processCallTime_75th_percentile" : 2.0,
> > > > > > > > >>>>> "processCallTime_95th_percentile" : 7917.95,
> > > > > > > > >>>>> "processCallTime_99th_percentile" : 8876.89,
> > > > > > > > >>>>>
> > > > > > > > >>>>>
> > > > > > > > >>>>
> > > > > > > >
> > > > > > >
> > > >
> > >
> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_min"
> > > > > > > > >>>> : 89,
> > > > > > > > >>>>>
> > > > > > > > >>>>
> > > > > > > >
> > > > > > >
> > > >
> > >
> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_max"
> > > > > > > > >>>> : 11300,
> > > > > > > > >>>>>
> > > > > > > > >>>>
> > > > > > > >
> > > > > > >
> > > >
> > >
> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_mean"
> > > > > > > > >>>> : 654.4949739797315,
> > > > > > > > >>>>>
> > > > > > > > >>>>
> > > > > > > >
> > > > > > >
> > > >
> > >
> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_median"
> > > > > > > > >>>> : 101.0,
> > > > > > > > >>>>>
> > > > > > > > >>>>
> > > > > > > >
> > > > > > >
> > > >
> > >
> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_75th_percentile"
> > > > > > > > >>>> : 101.0,
> > > > > > > > >>>>>
> > > > > > > > >>>>
> > > > > > > >
> > > > > > >
> > > >
> > >
> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_95th_percentile"
> > > > > > > > >>>> : 101.0,
> > > > > > > > >>>>>
> > > > > > > > >>>>
> > > > > > > >
> > > > > > >
> > > >
> > >
> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_99th_percentile"
> > > > > > > > >>>> : 113.0,
> > > > > > > > >>>>>
> > > > > > > > >>>>> Where "delta" is the name of the table I am querying.
> > > > > > > > >>>>>
> > > > > > > > >>>>> In addition to all this, i monitored the hardware
> resources
> > > > (CPU,
> > > > > > > > disk,
> > > > > > > > >>>> and network) of both the client and the region server
> and
> > > > nothing
> > > > > > > > seems
> > > > > > > > >>>> anywhere near saturation. So I am puzzled by what's
> going on
> > > > and
> > > > > > > > where this
> > > > > > > > >>>> time is going.
> > > > > > > > >>>>>
> > > > > > > > >>>>> Few things to note based on the above measurements:
> both
> > > > medians of
> > > > > > > > IPC
> > > > > > > > >>>> processCallTime and queueCallTime are basically zero
> (ms I
> > > > presume,
> > > > > > > > >>>> right?). However, scanNext_median is 101 (ms too,
> right?). I
> > > > am not
> > > > > > > > sure
> > > > > > > > >>>> how this adds up. Also, even though the 101 figure seems
> > > > > > > outrageously
> > > > > > > > high
> > > > > > > > >>>> and I don't know why, still all these scans should be
> > > > happening in
> > > > > > > > >>>> parallel, so the overall call should finish fast, given
> that
> > > > no
> > > > > > > > hardware
> > > > > > > > >>>> resource is contended, right? but this is not what's
> > > > happening, so I
> > > > > > > > have
> > > > > > > > >>>> to be missing something(s).
> > > > > > > > >>>>>
> > > > > > > > >>>>> So, any help is appreciated there.
> > > > > > > > >>>>>
> > > > > > > > >>>>> Thanks,
> > > > > > > > >>>>> Khaled
> > > > > > > > >>>>
> > > > > > > > >>>>
> > > > > > > >
> > > > > > >
> > > > >
> > > >
> > > >
> > >
>



-- 
Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein
(via Tom White)

RE: HBase read performance

Posted by Khaled Elmeleegy <kd...@hotmail.com>.

Thanks Nicolas, Qiang.
I was able to write a simple program that reproduces the problem on a tiny HBase cluster on ec2. The cluster has 2 m1.large nodes. One node runs the master, name node and zookeeper. The other node runs a data node and a region server, with heap size configured to be 6GB. There, the 1000 parallel reverse gets (reverse scans) take 7-8 seconds. The data set is tiny (10M records, each having a small number of bytes). As I said before, all hardware resources are very idle there.
Interestingly, running the same workload on my macbook, the 1000 parallel gets take ~200ms on a pseudo-distributed installation.
Any help to resolve this mystery is highly appreciated.
P.S. please find my test program attached.
Best,Khaled

> From: nkeywal@gmail.com
> Date: Mon, 6 Oct 2014 09:40:48 +0200
> Subject: Re: HBase read performance
> To: user@hbase.apache.org
> 
> Hi,
> 
> I haven't seen it mentioned, but if I understand correctly each scan
> returns a single row? If so you should use Scan#setSmall to save some rpc
> calls.
> 
> Cheers,
> 
> Nicolas
> 
> 
> On Sun, Oct 5, 2014 at 11:28 AM, Qiang Tian <ti...@gmail.com> wrote:
> 
> > when using separate HConnection instance, both its
> > RpcClient instance(maintain connection to a  regionserver) and Registry
> > instance(maintain  connection to zookeeper) will be separate..
> >
> > see
> >
> > http://shammijayasinghe.blogspot.com/2012/02/zookeeper-increase-maximum-number-of.html
> >
> >
> >
> >
> >
> > On Sun, Oct 5, 2014 at 2:24 PM, Khaled Elmeleegy <kd...@hotmail.com>
> > wrote:
> >
> > > I tried creating my own HConnections pool to use for my HBase calls, so
> > > that not all the (2K) threads share the same HConnection. However, I
> > could
> > > only have 10 HConnections. Beyond that I get ZK exceptions, please find
> > it
> > > below. Also, with 10 HConnections, I don't see noticeable improvement in
> > > performance so far.
> > > 2014-10-05 06:11:26,490 WARN  [main] zookeeper.RecoverableZooKeeper
> > > (RecoverableZooKeeper.java:retryOrThrow(253)) - Possibly transient
> > > ZooKeeper, quorum=54.68.206.252:2181,
> > > exception=org.apache.zookeeper.KeeperException$ConnectionLossException:
> > > KeeperErrorCode = ConnectionLoss for /hbase/hbaseid2014-10-05
> > 06:11:26,490
> > > INFO  [main] util.RetryCounter
> > (RetryCounter.java:sleepUntilNextRetry(155))
> > > - Sleeping 1000ms before retry #0...2014-10-05 06:11:27,845 WARN  [main]
> > > zookeeper.RecoverableZooKeeper
> > > (RecoverableZooKeeper.java:retryOrThrow(253)) - Possibly transient
> > > ZooKeeper, quorum=54.68.206.252:2181,
> > > exception=org.apache.zookeeper.KeeperException$ConnectionLossException:
> > > KeeperErrorCode = ConnectionLoss for /hbase/hbaseid2014-10-05
> > 06:11:27,849
> > > INFO  [main] util.RetryCounter
> > (RetryCounter.java:sleepUntilNextRetry(155))
> > > - Sleeping 2000ms before retry #1...2014-10-05 06:11:30,405 WARN  [main]
> > > zookeeper.RecoverableZooKeeper
> > > (RecoverableZooKeeper.java:retryOrThrow(253)) - Possibly transient
> > > ZooKeeper, quorum=54.68.206.252:2181,
> > > exception=org.apache.zookeeper.KeeperException$ConnectionLossException:
> > > KeeperErrorCode = ConnectionLoss for /hbase/hbaseid2014-10-05
> > 06:11:30,405
> > > INFO  [main] util.RetryCounter
> > (RetryCounter.java:sleepUntilNextRetry(155))
> > > - Sleeping 4000ms before retry #2...2014-10-05 06:11:35,278 WARN  [main]
> > > zookeeper.RecoverableZooKeeper
> > > (RecoverableZooKeeper.java:retryOrThrow(253)) - Possibly transient
> > > ZooKeeper, quorum=54.68.206.252:2181,
> > > exception=org.apache.zookeeper.KeeperException$ConnectionLossException:
> > > KeeperErrorCode = ConnectionLoss for /hbase/hbaseid2014-10-05
> > 06:11:35,279
> > > INFO  [main] util.RetryCounter
> > (RetryCounter.java:sleepUntilNextRetry(155))
> > > - Sleeping 8000ms before retry #3...2014-10-05 06:11:44,393 WARN  [main]
> > > zookeeper.RecoverableZooKeeper
> > > (RecoverableZooKeeper.java:retryOrThrow(253)) - Possibly transient
> > > ZooKeeper, quorum=54.68.206.252:2181,
> > > exception=org.apache.zookeeper.KeeperException$ConnectionLossException:
> > > KeeperErrorCode = ConnectionLoss for /hbase/hbaseid2014-10-05
> > 06:11:44,393
> > > ERROR [main] zookeeper.RecoverableZooKeeper
> > > (RecoverableZooKeeper.java:retryOrThrow(255)) - ZooKeeper exists failed
> > > after 4 attempts2014-10-05 06:11:44,394 WARN  [main] zookeeper.ZKUtil
> > > (ZKUtil.java:checkExists(482)) - hconnection-0x4e174f3b, quorum=
> > > 54.68.206.252:2181, baseZNode=/hbase Unable to set watcher on znode
> > >
> > (/hbase/hbaseid)org.apache.zookeeper.KeeperException$ConnectionLossException:
> > > KeeperErrorCode = ConnectionLoss for /hbase/hbaseid at
> > > org.apache.zookeeper.KeeperException.create(KeeperException.java:99) at
> > > org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at
> > > org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1041)   at
> > >
> > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.exists(RecoverableZooKeeper.java:199)
> > > at org.apache.hadoop.hbase.zookeeper.ZKUtil.checkExists(ZKUtil.java:479)
> > >     at
> > >
> > org.apache.hadoop.hbase.zookeeper.ZKClusterId.readClusterIdZNode(ZKClusterId.java:65)
> > >       at
> > >
> > org.apache.hadoop.hbase.client.ZooKeeperRegistry.getClusterId(ZooKeeperRegistry.java:83)
> > >    at
> > >
> > org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.retrieveClusterId(HConnectionManager.java:857)
> > >  at
> > >
> > org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.<init>(HConnectionManager.java:662)
> > >     at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> > > Method)        at
> > >
> > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
> > > at
> > >
> > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> > > at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
> > at
> > >
> > org.apache.hadoop.hbase.client.HConnectionManager.createConnection(HConnectionManager.java:414)
> > >     at
> > >
> > org.apache.hadoop.hbase.client.HConnectionManager.createConnection(HConnectionManager.java:335)...
> > > > From: kdiaa@hotmail.com
> > > > To: user@hbase.apache.org
> > > > Subject: RE: HBase read performance
> > > > Date: Fri, 3 Oct 2014 12:37:39 -0700
> > > >
> > > > Lars, Ted, and Qiang,
> > > > Thanks for all the input.
> > > > Qiang: yes all the threads are in the same client process sharing the
> > > same connection. And since I don't see hardware contention, may be there
> > is
> > > contention over this code path. I'll try using many connections and see
> > if
> > > it alleviates the problems and I'll report back.
> > > > Thanks again,Khaled
> > > >
> > > > > Date: Fri, 3 Oct 2014 15:18:30 +0800
> > > > > Subject: Re: HBase read performance
> > > > > From: tianq01@gmail.com
> > > > > To: user@hbase.apache.org
> > > > >
> > > > > Regarding to profiling, Andrew introduced
> > > > > http://www.brendangregg.com/blog/2014-06-12/java-flame-graphs.html
> > > months
> > > > > ago.
> > > > >
> > > > > processCallTime comes from RpcServer#call, so it looks good?
> > > > >
> > > > > I have a suspect: https://issues.apache.org/jira/browse/HBASE-11306
> > > > >
> > > > > how many processes do you have for your 2000 threads?
> > > > > if olny 1 process, those threads will share just 1 connection to that
> > > > > regionserver, there might be big contention on the RPC code path.
> > > ---for
> > > > > such case, could you try using different connections?
> > > > >
> > >
> > https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HConnectionManager.html
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Fri, Oct 3, 2014 at 9:55 AM, Ted Yu <yu...@gmail.com> wrote:
> > > > >
> > > > > > Khaled:
> > > > > > Do you have profiler such as jprofiler ?
> > > > > > Profiling would give us more hint.
> > > > > >
> > > > > > Otherwise capturing stack trace during the period of reverse scan
> > > would
> > > > > > help.
> > > > > >
> > > > > > Cheers
> > > > > >
> > > > > > On Thu, Oct 2, 2014 at 4:52 PM, lars hofhansl <la...@apache.org>
> > > wrote:
> > > > > >
> > > > > > > You might have the data in the OS buffer cache, without short
> > > circuit
> > > > > > > reading the region server has to request the block from the data
> > > node
> > > > > > > process, which then reads it from the block cache.
> > > > > > > That is a few context switches per RPC that do not show up in CPU
> > > > > > metrics.
> > > > > > > In that you also would not see disk IO.
> > > > > > >
> > > > > > >
> > > > > > > If - as you say - you see a lot of evicted blocks the data *has*
> > > to come
> > > > > > > from the OS. If you do not see disk IO is *has* to come from the
> > OS
> > > > > > cache.
> > > > > > > I.e. there's more RAM on your boxes, and you should increase the
> > > heap
> > > > > > block
> > > > > > > cache.
> > > > > > >
> > > > > > >
> > > > > > > You can measure the context switches with vmstat. Other than that
> > > I have
> > > > > > > no suggestion until I reproduce the problem.
> > > > > > > Also check the data locality index of the region server it should
> > > be
> > > > > > close
> > > > > > > to 100%.
> > > > > > >
> > > > > > >
> > > > > > > -- Lars
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > ________________________________
> > > > > > >  From: Khaled Elmeleegy <kd...@hotmail.com>
> > > > > > > To: "user@hbase.apache.org" <us...@hbase.apache.org>
> > > > > > > Sent: Thursday, October 2, 2014 3:24 PM
> > > > > > > Subject: RE: HBase read performance
> > > > > > >
> > > > > > >
> > > > > > > Lars thanks a lot about all the tips. I'll make sure I cover all
> > > of them
> > > > > > > and get back to you. I am not sure they are the bottleneck though
> > > as they
> > > > > > > all are about optimizing physical resource usage. As I said, I
> > > don't see
> > > > > > > any contended physical resources now. I'll also try to reproduce
> > > this
> > > > > > > problem in a simpler environment and pass to you the test program
> > > to play
> > > > > > > with.
> > > > > > >
> > > > > > >
> > > > > > > Couple of high level points to make. You are right that my use
> > > case is
> > > > > > > kind of a worst case for HBase reads. But, if things go the way
> > you
> > > > > > > described them, there should be tons of disk IO and that should
> > be
> > > > > > clearly
> > > > > > > the bottleneck. This is not the case though. That's for the
> > simple
> > > reason
> > > > > > > that this is done in a test environment (I am still prototyping),
> > > and
> > > > > > not a
> > > > > > > lot of data is yet written to HBase. However for the real use
> > > case, there
> > > > > > > should writers constantly writing data to HBase and readers
> > > occasionally
> > > > > > > doing this scatter/gather. At steady state, things should only
> > get
> > > worse
> > > > > > > and all the issues you mentioned should get far more pronounced.
> > > At this
> > > > > > > point, one can try to mitigate it using more memory or so. I am
> > > not there
> > > > > > > yet as I think I am hitting some software bottleneck, which I
> > > don't know
> > > > > > > how to work around.
> > > > > > >
> > > > > > > Khaled
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > ----------------------------------------
> > > > > > > > Date: Thu, 2 Oct 2014 14:20:47 -0700
> > > > > > > > From: larsh@apache.org
> > > > > > > > Subject: Re: HBase read performance
> > > > > > > > To: user@hbase.apache.org
> > > > > > > >
> > > > > > > > OK... We might need to investigate this.
> > > > > > > > Any chance that you can provide a minimal test program and
> > > instruction
> > > > > > > about how to set it up.
> > > > > > > > We can do some profiling then.
> > > > > > > >
> > > > > > > > One thing to note is that with scanning HBase cannot use bloom
> > > filters
> > > > > > > to rule out HFiles ahead of time, it needs to look into all of
> > > them.
> > > > > > > > So you kind of hit on the absolute worst case:
> > > > > > > > - random reads that do not fit into the block cache
> > > > > > > > - cannot use bloom filters
> > > > > > > >
> > > > > > > >
> > > > > > > > Few more question/comments:
> > > > > > > > - Do you have short circuit reading enabled? If not, you
> > should.
> > > > > > > > - Is your table major compacted? That will reduce the number of
> > > files
> > > > > > to
> > > > > > > look at.
> > > > > > > > - Did you disable Nagle's everywhere (enabled tcpnodelay)? It
> > > disabled
> > > > > > > by default in HBase, but necessarily in your install of HDFS.
> > > > > > > > - Which version of HDFS are you using as backing filesystem?
> > > > > > > > - If your disk is idle, it means the data fits into the OS
> > buffer
> > > > > > cache.
> > > > > > > In turn that means that you increase the heap for the region
> > > servers. You
> > > > > > > can also use block encoding (FAST_DIFF) to try to make sure the
> > > entire
> > > > > > > working set fits into the cache.
> > > > > > > >
> > > > > > > > - Also try to reduce the block size - although if your overall
> > > working
> > > > > > > set does not fit in the heap it won't help much.
> > > > > > > >
> > > > > > > >
> > > > > > > > This is a good section of the book to read through generally
> > > (even
> > > > > > > though you might know most of this already):
> > > > > > > http://hbase.apache.org/book.html#perf.configurations
> > > > > > > >
> > > > > > > >
> > > > > > > > -- Lars
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > ----- Original Message -----
> > > > > > > > From: Khaled Elmeleegy <kd...@hotmail.com>
> > > > > > > > To: "user@hbase.apache.org" <us...@hbase.apache.org>
> > > > > > > > Cc:
> > > > > > > > Sent: Thursday, October 2, 2014 11:27 AM
> > > > > > > > Subject: RE: HBase read performance
> > > > > > > >
> > > > > > > > I do see a very brief spike in CPU (user/system), but it's no
> > > where
> > > > > > near
> > > > > > > 0% idle. It goes from 99% idle down to something like 40% idle
> > for
> > > a
> > > > > > second
> > > > > > > or so. The thing to note, this is all on a test cluster, so no
> > > real load.
> > > > > > > Things are generally idle until i issue 2-3 of these
> > > multi-scan-requests
> > > > > > to
> > > > > > > render a web page. Then, you see the spike in the cpu and some
> > > activity
> > > > > > in
> > > > > > > the network and disk, but nowhere near saturation.
> > > > > > > >
> > > > > > > >
> > > > > > > > If there are specific tests you'd like me to do to debug this,
> > > I'd be
> > > > > > > more than happy to do it.
> > > > > > > >
> > > > > > > >
> > > > > > > > Khaled
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > ----------------------------------------
> > > > > > > >> Date: Thu, 2 Oct 2014 11:15:59 -0700
> > > > > > > >> From: larsh@apache.org
> > > > > > > >> Subject: Re: HBase read performance
> > > > > > > >> To: user@hbase.apache.org
> > > > > > > >>
> > > > > > > >> I still think you're waiting on disk. No IOWAIT? So CPU is not
> > > waiting
> > > > > > > a lot for IO. No high User/System CPU either?
> > > > > > > >>
> > > > > > > >> If you see a lot of evicted block then each RPC has a high
> > > chance of
> > > > > > > requiring to bring an entire 64k block in. You'll see bad
> > > performance
> > > > > > with
> > > > > > > this.
> > > > > > > >>
> > > > > > > >> We might need to trace this specific scenario.
> > > > > > > >>
> > > > > > > >> -- Lars
> > > > > > > >>
> > > > > > > >>
> > > > > > > >>
> > > > > > > >> ________________________________
> > > > > > > >> From: Khaled Elmeleegy <kd...@hotmail.com>
> > > > > > > >> To: "user@hbase.apache.org" <us...@hbase.apache.org>
> > > > > > > >> Sent: Thursday, October 2, 2014 10:46 AM
> > > > > > > >> Subject: RE: HBase read performance
> > > > > > > >>
> > > > > > > >>
> > > > > > > >> I've set the heap size to 6GB and I do have gc logging. No
> > long
> > > pauses
> > > > > > > there -- occasional 0.1s or 0.2s.
> > > > > > > >>
> > > > > > > >> Other than the discrepancy between what's reported on the
> > > client and
> > > > > > > what's reported at the RS, there is also the issue of not getting
> > > proper
> > > > > > > concurrency. So, even if a reverse get takes 100ms or so (this
> > has
> > > to be
> > > > > > > mostly blocking on various things as no physical resource is
> > > contended),
> > > > > > > then the other gets/scans should be able to proceed in parallel,
> > > so a
> > > > > > > thousand concurrent gets/scans should finish in few hundreds of
> > ms
> > > not
> > > > > > many
> > > > > > > seconds. That's why I thought I'd increase the handlers count to
> > > try to
> > > > > > get
> > > > > > > more concurrency, but it didn't help. So, there must be something
> > > else.
> > > > > > > >>
> > > > > > > >> Khaled
> > > > > > > >>
> > > > > > > >> ----------------------------------------
> > > > > > > >>> From: ndimiduk@gmail.com
> > > > > > > >>> Date: Thu, 2 Oct 2014 10:36:39 -0700
> > > > > > > >>> Subject: Re: HBase read performance
> > > > > > > >>> To: user@hbase.apache.org
> > > > > > > >>>
> > > > > > > >>> Do check again on the heap size of the region servers. The
> > > default
> > > > > > > >>> unconfigured size is 1G; too small for much of anything.
> > Check
> > > your
> > > > > > RS
> > > > > > > logs
> > > > > > > >>> -- look for lines produced by the JVMPauseMonitor thread.
> > They
> > > > > > usually
> > > > > > > >>> correlate with long GC pauses or other process-freeze events.
> > > > > > > >>>
> > > > > > > >>> Get is implemented as a Scan of a single row, so a reverse
> > > scan of a
> > > > > > > single
> > > > > > > >>> row should be functionally equivalent.
> > > > > > > >>>
> > > > > > > >>> In practice, I have seen discrepancy between the latencies
> > > reported
> > > > > > by
> > > > > > > the
> > > > > > > >>> RS and the latencies experienced by the client. I've not
> > > investigated
> > > > > > > this
> > > > > > > >>> area thoroughly.
> > > > > > > >>>
> > > > > > > >>> On Thu, Oct 2, 2014 at 10:05 AM, Khaled Elmeleegy <
> > > kdiaa@hotmail.com
> > > > > > >
> > > > > > > wrote:
> > > > > > > >>>
> > > > > > > >>>> Thanks Lars for your quick reply.
> > > > > > > >>>>
> > > > > > > >>>> Yes performance is similar with less handlers (I tried with
> > > 100
> > > > > > > first).
> > > > > > > >>>>
> > > > > > > >>>> The payload is not big ~1KB or so. The working set doesn't
> > > seem to
> > > > > > > fit in
> > > > > > > >>>> memory as there are many cache misses. However, disk is far
> > > from
> > > > > > > being a
> > > > > > > >>>> bottleneck. I checked using iostat. I also verified that
> > > neither the
> > > > > > > >>>> network nor the CPU of the region server or the client are a
> > > > > > > bottleneck.
> > > > > > > >>>> This leads me to believe that likely this is a software
> > > bottleneck,
> > > > > > > >>>> possibly due to a misconfiguration on my side. I just don't
> > > know how
> > > > > > > to
> > > > > > > >>>> debug it. A clear disconnect I see is the individual request
> > > latency
> > > > > > > as
> > > > > > > >>>> reported by metrics on the region server (IPC
> > processCallTime
> > > vs
> > > > > > > scanNext)
> > > > > > > >>>> vs what's measured on the client. Does this sound right? Any
> > > ideas
> > > > > > on
> > > > > > > how
> > > > > > > >>>> to better debug it?
> > > > > > > >>>>
> > > > > > > >>>> About this trick with the timestamps to be able to do a
> > > forward
> > > > > > scan,
> > > > > > > >>>> thanks for pointing it out. Actually, I am aware of it. The
> > > problem
> > > > > > I
> > > > > > > have
> > > > > > > >>>> is, sometimes I want to get the key after a particular
> > > timestamp and
> > > > > > > >>>> sometimes I want to get the key before, so just relying on
> > > the key
> > > > > > > order
> > > > > > > >>>> doesn't work. Ideally, I want a reverse get(). I thought
> > > reverse
> > > > > > scan
> > > > > > > can
> > > > > > > >>>> do the trick though.
> > > > > > > >>>>
> > > > > > > >>>> Khaled
> > > > > > > >>>>
> > > > > > > >>>> ----------------------------------------
> > > > > > > >>>>> Date: Thu, 2 Oct 2014 09:40:37 -0700
> > > > > > > >>>>> From: larsh@apache.org
> > > > > > > >>>>> Subject: Re: HBase read performance
> > > > > > > >>>>> To: user@hbase.apache.org
> > > > > > > >>>>>
> > > > > > > >>>>> Hi Khaled,
> > > > > > > >>>>> is it the same with fewer threads? 1500 handler threads
> > > seems to
> > > > > > be a
> > > > > > > >>>> lot. Typically a good number of threads depends on the
> > > hardware
> > > > > > > (number of
> > > > > > > >>>> cores, number of spindles, etc). I cannot think of any type
> > of
> > > > > > > scenario
> > > > > > > >>>> where more than 100 would give any improvement.
> > > > > > > >>>>>
> > > > > > > >>>>> How large is the payload per KV retrieved that way? If
> > large
> > > (as
> > > > > > in a
> > > > > > > >>>> few 100k) you definitely want to lower the number of the
> > > handler
> > > > > > > threads.
> > > > > > > >>>>> How much heap do you give the region server? Does the
> > > working set
> > > > > > fit
> > > > > > > >>>> into the cache? (i.e. in the metrics, do you see the
> > eviction
> > > count
> > > > > > > going
> > > > > > > >>>> up, if so it does not fit into the cache).
> > > > > > > >>>>>
> > > > > > > >>>>> If the working set does not fit into the cache (eviction
> > > count goes
> > > > > > > up)
> > > > > > > >>>> then HBase will need to bring a new block in from disk on
> > > each Get
> > > > > > > >>>> (assuming the Gets are more or less random as far as the
> > > server is
> > > > > > > >>>> concerned).
> > > > > > > >>>>> In case you'll benefit from reducing the HFile block size
> > > (from 64k
> > > > > > > to
> > > > > > > >>>> 8k or even 4k).
> > > > > > > >>>>>
> > > > > > > >>>>> Lastly I don't think we tested the performance of using
> > > reverse
> > > > > > scan
> > > > > > > >>>> this way, there is probably room to optimize this.
> > > > > > > >>>>> Can you restructure your keys to allow forwards scanning?
> > For
> > > > > > example
> > > > > > > >>>> you could store the time as MAX_LONG-time. Or you could
> > > invert all
> > > > > > > the bits
> > > > > > > >>>> of the time portion of the key, so that it sort the other
> > > way. Then
> > > > > > > you
> > > > > > > >>>> could do a forward scan.
> > > > > > > >>>>>
> > > > > > > >>>>> Let us know how it goes.
> > > > > > > >>>>>
> > > > > > > >>>>> -- Lars
> > > > > > > >>>>>
> > > > > > > >>>>>
> > > > > > > >>>>> ----- Original Message -----
> > > > > > > >>>>> From: Khaled Elmeleegy <kd...@hotmail.com>
> > > > > > > >>>>> To: "user@hbase.apache.org" <us...@hbase.apache.org>
> > > > > > > >>>>> Cc:
> > > > > > > >>>>> Sent: Thursday, October 2, 2014 12:12 AM
> > > > > > > >>>>> Subject: HBase read performance
> > > > > > > >>>>>
> > > > > > > >>>>> Hi,
> > > > > > > >>>>>
> > > > > > > >>>>> I am trying to do a scatter/gather on hbase (0.98.6.1),
> > > where I
> > > > > > have
> > > > > > > a
> > > > > > > >>>> client reading ~1000 keys from an HBase table. These keys
> > > happen to
> > > > > > > fall on
> > > > > > > >>>> the same region server. For my reads I use reverse scan to
> > > read each
> > > > > > > key as
> > > > > > > >>>> I want the key prior to a specific time stamp (time stamps
> > are
> > > > > > stored
> > > > > > > in
> > > > > > > >>>> reverse order). I don't believe gets can accomplish that,
> > > right? so
> > > > > > I
> > > > > > > use
> > > > > > > >>>> scan, with caching set to 1.
> > > > > > > >>>>>
> > > > > > > >>>>> I use 2000 reader threads in the client and on HBase, I've
> > > set
> > > > > > > >>>> hbase.regionserver.handler.count to 1500. With this setup,
> > my
> > > > > > scatter
> > > > > > > >>>> gather is very slow and can take up to 10s in total. Timing
> > an
> > > > > > > individual
> > > > > > > >>>> getScanner(..) call on the client side, it can easily take
> > few
> > > > > > > hundreds of
> > > > > > > >>>> ms. I also got the following metrics from the region server
> > in
> > > > > > > question:
> > > > > > > >>>>>
> > > > > > > >>>>> "queueCallTime_mean" : 2.190855525775637,
> > > > > > > >>>>> "queueCallTime_median" : 0.0,
> > > > > > > >>>>> "queueCallTime_75th_percentile" : 0.0,
> > > > > > > >>>>> "queueCallTime_95th_percentile" : 1.0,
> > > > > > > >>>>> "queueCallTime_99th_percentile" : 556.9799999999818,
> > > > > > > >>>>>
> > > > > > > >>>>> "processCallTime_min" : 0,
> > > > > > > >>>>> "processCallTime_max" : 12755,
> > > > > > > >>>>> "processCallTime_mean" : 105.64873440912682,
> > > > > > > >>>>> "processCallTime_median" : 0.0,
> > > > > > > >>>>> "processCallTime_75th_percentile" : 2.0,
> > > > > > > >>>>> "processCallTime_95th_percentile" : 7917.95,
> > > > > > > >>>>> "processCallTime_99th_percentile" : 8876.89,
> > > > > > > >>>>>
> > > > > > > >>>>>
> > > > > > > >>>>
> > > > > > >
> > > > > >
> > >
> > "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_min"
> > > > > > > >>>> : 89,
> > > > > > > >>>>>
> > > > > > > >>>>
> > > > > > >
> > > > > >
> > >
> > "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_max"
> > > > > > > >>>> : 11300,
> > > > > > > >>>>>
> > > > > > > >>>>
> > > > > > >
> > > > > >
> > >
> > "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_mean"
> > > > > > > >>>> : 654.4949739797315,
> > > > > > > >>>>>
> > > > > > > >>>>
> > > > > > >
> > > > > >
> > >
> > "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_median"
> > > > > > > >>>> : 101.0,
> > > > > > > >>>>>
> > > > > > > >>>>
> > > > > > >
> > > > > >
> > >
> > "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_75th_percentile"
> > > > > > > >>>> : 101.0,
> > > > > > > >>>>>
> > > > > > > >>>>
> > > > > > >
> > > > > >
> > >
> > "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_95th_percentile"
> > > > > > > >>>> : 101.0,
> > > > > > > >>>>>
> > > > > > > >>>>
> > > > > > >
> > > > > >
> > >
> > "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_99th_percentile"
> > > > > > > >>>> : 113.0,
> > > > > > > >>>>>
> > > > > > > >>>>> Where "delta" is the name of the table I am querying.
> > > > > > > >>>>>
> > > > > > > >>>>> In addition to all this, i monitored the hardware resources
> > > (CPU,
> > > > > > > disk,
> > > > > > > >>>> and network) of both the client and the region server and
> > > nothing
> > > > > > > seems
> > > > > > > >>>> anywhere near saturation. So I am puzzled by what's going on
> > > and
> > > > > > > where this
> > > > > > > >>>> time is going.
> > > > > > > >>>>>
> > > > > > > >>>>> Few things to note based on the above measurements: both
> > > medians of
> > > > > > > IPC
> > > > > > > >>>> processCallTime and queueCallTime are basically zero (ms I
> > > presume,
> > > > > > > >>>> right?). However, scanNext_median is 101 (ms too, right?). I
> > > am not
> > > > > > > sure
> > > > > > > >>>> how this adds up. Also, even though the 101 figure seems
> > > > > > outrageously
> > > > > > > high
> > > > > > > >>>> and I don't know why, still all these scans should be
> > > happening in
> > > > > > > >>>> parallel, so the overall call should finish fast, given that
> > > no
> > > > > > > hardware
> > > > > > > >>>> resource is contended, right? but this is not what's
> > > happening, so I
> > > > > > > have
> > > > > > > >>>> to be missing something(s).
> > > > > > > >>>>>
> > > > > > > >>>>> So, any help is appreciated there.
> > > > > > > >>>>>
> > > > > > > >>>>> Thanks,
> > > > > > > >>>>> Khaled
> > > > > > > >>>>
> > > > > > > >>>>
> > > > > > >
> > > > > >
> > > >
> > >
> > >
> >

Re: HBase read performance

Posted by Nicolas Liochon <nk...@gmail.com>.

Hi,

I haven't seen it mentioned, but if I understand correctly each scan
returns a single row? If so you should use Scan#setSmall to save some rpc
calls.

Cheers,

Nicolas


On Sun, Oct 5, 2014 at 11:28 AM, Qiang Tian <ti...@gmail.com> wrote:

> when using separate HConnection instance, both its
> RpcClient instance(maintain connection to a  regionserver) and Registry
> instance(maintain  connection to zookeeper) will be separate..
>
> see
>
> http://shammijayasinghe.blogspot.com/2012/02/zookeeper-increase-maximum-number-of.html
>
>
>
>
>
> On Sun, Oct 5, 2014 at 2:24 PM, Khaled Elmeleegy <kd...@hotmail.com>
> wrote:
>
> > I tried creating my own HConnections pool to use for my HBase calls, so
> > that not all the (2K) threads share the same HConnection. However, I
> could
> > only have 10 HConnections. Beyond that I get ZK exceptions, please find
> it
> > below. Also, with 10 HConnections, I don't see noticeable improvement in
> > performance so far.
> > 2014-10-05 06:11:26,490 WARN  [main] zookeeper.RecoverableZooKeeper
> > (RecoverableZooKeeper.java:retryOrThrow(253)) - Possibly transient
> > ZooKeeper, quorum=54.68.206.252:2181,
> > exception=org.apache.zookeeper.KeeperException$ConnectionLossException:
> > KeeperErrorCode = ConnectionLoss for /hbase/hbaseid2014-10-05
> 06:11:26,490
> > INFO  [main] util.RetryCounter
> (RetryCounter.java:sleepUntilNextRetry(155))
> > - Sleeping 1000ms before retry #0...2014-10-05 06:11:27,845 WARN  [main]
> > zookeeper.RecoverableZooKeeper
> > (RecoverableZooKeeper.java:retryOrThrow(253)) - Possibly transient
> > ZooKeeper, quorum=54.68.206.252:2181,
> > exception=org.apache.zookeeper.KeeperException$ConnectionLossException:
> > KeeperErrorCode = ConnectionLoss for /hbase/hbaseid2014-10-05
> 06:11:27,849
> > INFO  [main] util.RetryCounter
> (RetryCounter.java:sleepUntilNextRetry(155))
> > - Sleeping 2000ms before retry #1...2014-10-05 06:11:30,405 WARN  [main]
> > zookeeper.RecoverableZooKeeper
> > (RecoverableZooKeeper.java:retryOrThrow(253)) - Possibly transient
> > ZooKeeper, quorum=54.68.206.252:2181,
> > exception=org.apache.zookeeper.KeeperException$ConnectionLossException:
> > KeeperErrorCode = ConnectionLoss for /hbase/hbaseid2014-10-05
> 06:11:30,405
> > INFO  [main] util.RetryCounter
> (RetryCounter.java:sleepUntilNextRetry(155))
> > - Sleeping 4000ms before retry #2...2014-10-05 06:11:35,278 WARN  [main]
> > zookeeper.RecoverableZooKeeper
> > (RecoverableZooKeeper.java:retryOrThrow(253)) - Possibly transient
> > ZooKeeper, quorum=54.68.206.252:2181,
> > exception=org.apache.zookeeper.KeeperException$ConnectionLossException:
> > KeeperErrorCode = ConnectionLoss for /hbase/hbaseid2014-10-05
> 06:11:35,279
> > INFO  [main] util.RetryCounter
> (RetryCounter.java:sleepUntilNextRetry(155))
> > - Sleeping 8000ms before retry #3...2014-10-05 06:11:44,393 WARN  [main]
> > zookeeper.RecoverableZooKeeper
> > (RecoverableZooKeeper.java:retryOrThrow(253)) - Possibly transient
> > ZooKeeper, quorum=54.68.206.252:2181,
> > exception=org.apache.zookeeper.KeeperException$ConnectionLossException:
> > KeeperErrorCode = ConnectionLoss for /hbase/hbaseid2014-10-05
> 06:11:44,393
> > ERROR [main] zookeeper.RecoverableZooKeeper
> > (RecoverableZooKeeper.java:retryOrThrow(255)) - ZooKeeper exists failed
> > after 4 attempts2014-10-05 06:11:44,394 WARN  [main] zookeeper.ZKUtil
> > (ZKUtil.java:checkExists(482)) - hconnection-0x4e174f3b, quorum=
> > 54.68.206.252:2181, baseZNode=/hbase Unable to set watcher on znode
> >
> (/hbase/hbaseid)org.apache.zookeeper.KeeperException$ConnectionLossException:
> > KeeperErrorCode = ConnectionLoss for /hbase/hbaseid at
> > org.apache.zookeeper.KeeperException.create(KeeperException.java:99) at
> > org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at
> > org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1041)   at
> >
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.exists(RecoverableZooKeeper.java:199)
> > at org.apache.hadoop.hbase.zookeeper.ZKUtil.checkExists(ZKUtil.java:479)
> >     at
> >
> org.apache.hadoop.hbase.zookeeper.ZKClusterId.readClusterIdZNode(ZKClusterId.java:65)
> >       at
> >
> org.apache.hadoop.hbase.client.ZooKeeperRegistry.getClusterId(ZooKeeperRegistry.java:83)
> >    at
> >
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.retrieveClusterId(HConnectionManager.java:857)
> >  at
> >
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.<init>(HConnectionManager.java:662)
> >     at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> > Method)        at
> >
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
> > at
> >
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> > at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
> at
> >
> org.apache.hadoop.hbase.client.HConnectionManager.createConnection(HConnectionManager.java:414)
> >     at
> >
> org.apache.hadoop.hbase.client.HConnectionManager.createConnection(HConnectionManager.java:335)...
> > > From: kdiaa@hotmail.com
> > > To: user@hbase.apache.org
> > > Subject: RE: HBase read performance
> > > Date: Fri, 3 Oct 2014 12:37:39 -0700
> > >
> > > Lars, Ted, and Qiang,
> > > Thanks for all the input.
> > > Qiang: yes all the threads are in the same client process sharing the
> > same connection. And since I don't see hardware contention, may be there
> is
> > contention over this code path. I'll try using many connections and see
> if
> > it alleviates the problems and I'll report back.
> > > Thanks again,Khaled
> > >
> > > > Date: Fri, 3 Oct 2014 15:18:30 +0800
> > > > Subject: Re: HBase read performance
> > > > From: tianq01@gmail.com
> > > > To: user@hbase.apache.org
> > > >
> > > > Regarding to profiling, Andrew introduced
> > > > http://www.brendangregg.com/blog/2014-06-12/java-flame-graphs.html
> > months
> > > > ago.
> > > >
> > > > processCallTime comes from RpcServer#call, so it looks good?
> > > >
> > > > I have a suspect: https://issues.apache.org/jira/browse/HBASE-11306
> > > >
> > > > how many processes do you have for your 2000 threads?
> > > > if olny 1 process, those threads will share just 1 connection to that
> > > > regionserver, there might be big contention on the RPC code path.
> > ---for
> > > > such case, could you try using different connections?
> > > >
> >
> https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HConnectionManager.html
> > > >
> > > >
> > > >
> > > >
> > > > On Fri, Oct 3, 2014 at 9:55 AM, Ted Yu <yu...@gmail.com> wrote:
> > > >
> > > > > Khaled:
> > > > > Do you have profiler such as jprofiler ?
> > > > > Profiling would give us more hint.
> > > > >
> > > > > Otherwise capturing stack trace during the period of reverse scan
> > would
> > > > > help.
> > > > >
> > > > > Cheers
> > > > >
> > > > > On Thu, Oct 2, 2014 at 4:52 PM, lars hofhansl <la...@apache.org>
> > wrote:
> > > > >
> > > > > > You might have the data in the OS buffer cache, without short
> > circuit
> > > > > > reading the region server has to request the block from the data
> > node
> > > > > > process, which then reads it from the block cache.
> > > > > > That is a few context switches per RPC that do not show up in CPU
> > > > > metrics.
> > > > > > In that you also would not see disk IO.
> > > > > >
> > > > > >
> > > > > > If - as you say - you see a lot of evicted blocks the data *has*
> > to come
> > > > > > from the OS. If you do not see disk IO is *has* to come from the
> OS
> > > > > cache.
> > > > > > I.e. there's more RAM on your boxes, and you should increase the
> > heap
> > > > > block
> > > > > > cache.
> > > > > >
> > > > > >
> > > > > > You can measure the context switches with vmstat. Other than that
> > I have
> > > > > > no suggestion until I reproduce the problem.
> > > > > > Also check the data locality index of the region server it should
> > be
> > > > > close
> > > > > > to 100%.
> > > > > >
> > > > > >
> > > > > > -- Lars
> > > > > >
> > > > > >
> > > > > >
> > > > > > ________________________________
> > > > > >  From: Khaled Elmeleegy <kd...@hotmail.com>
> > > > > > To: "user@hbase.apache.org" <us...@hbase.apache.org>
> > > > > > Sent: Thursday, October 2, 2014 3:24 PM
> > > > > > Subject: RE: HBase read performance
> > > > > >
> > > > > >
> > > > > > Lars thanks a lot about all the tips. I'll make sure I cover all
> > of them
> > > > > > and get back to you. I am not sure they are the bottleneck though
> > as they
> > > > > > all are about optimizing physical resource usage. As I said, I
> > don't see
> > > > > > any contended physical resources now. I'll also try to reproduce
> > this
> > > > > > problem in a simpler environment and pass to you the test program
> > to play
> > > > > > with.
> > > > > >
> > > > > >
> > > > > > Couple of high level points to make. You are right that my use
> > case is
> > > > > > kind of a worst case for HBase reads. But, if things go the way
> you
> > > > > > described them, there should be tons of disk IO and that should
> be
> > > > > clearly
> > > > > > the bottleneck. This is not the case though. That's for the
> simple
> > reason
> > > > > > that this is done in a test environment (I am still prototyping),
> > and
> > > > > not a
> > > > > > lot of data is yet written to HBase. However for the real use
> > case, there
> > > > > > should writers constantly writing data to HBase and readers
> > occasionally
> > > > > > doing this scatter/gather. At steady state, things should only
> get
> > worse
> > > > > > and all the issues you mentioned should get far more pronounced.
> > At this
> > > > > > point, one can try to mitigate it using more memory or so. I am
> > not there
> > > > > > yet as I think I am hitting some software bottleneck, which I
> > don't know
> > > > > > how to work around.
> > > > > >
> > > > > > Khaled
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > ----------------------------------------
> > > > > > > Date: Thu, 2 Oct 2014 14:20:47 -0700
> > > > > > > From: larsh@apache.org
> > > > > > > Subject: Re: HBase read performance
> > > > > > > To: user@hbase.apache.org
> > > > > > >
> > > > > > > OK... We might need to investigate this.
> > > > > > > Any chance that you can provide a minimal test program and
> > instruction
> > > > > > about how to set it up.
> > > > > > > We can do some profiling then.
> > > > > > >
> > > > > > > One thing to note is that with scanning HBase cannot use bloom
> > filters
> > > > > > to rule out HFiles ahead of time, it needs to look into all of
> > them.
> > > > > > > So you kind of hit on the absolute worst case:
> > > > > > > - random reads that do not fit into the block cache
> > > > > > > - cannot use bloom filters
> > > > > > >
> > > > > > >
> > > > > > > Few more question/comments:
> > > > > > > - Do you have short circuit reading enabled? If not, you
> should.
> > > > > > > - Is your table major compacted? That will reduce the number of
> > files
> > > > > to
> > > > > > look at.
> > > > > > > - Did you disable Nagle's everywhere (enabled tcpnodelay)? It
> > disabled
> > > > > > by default in HBase, but necessarily in your install of HDFS.
> > > > > > > - Which version of HDFS are you using as backing filesystem?
> > > > > > > - If your disk is idle, it means the data fits into the OS
> buffer
> > > > > cache.
> > > > > > In turn that means that you increase the heap for the region
> > servers. You
> > > > > > can also use block encoding (FAST_DIFF) to try to make sure the
> > entire
> > > > > > working set fits into the cache.
> > > > > > >
> > > > > > > - Also try to reduce the block size - although if your overall
> > working
> > > > > > set does not fit in the heap it won't help much.
> > > > > > >
> > > > > > >
> > > > > > > This is a good section of the book to read through generally
> > (even
> > > > > > though you might know most of this already):
> > > > > > http://hbase.apache.org/book.html#perf.configurations
> > > > > > >
> > > > > > >
> > > > > > > -- Lars
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > ----- Original Message -----
> > > > > > > From: Khaled Elmeleegy <kd...@hotmail.com>
> > > > > > > To: "user@hbase.apache.org" <us...@hbase.apache.org>
> > > > > > > Cc:
> > > > > > > Sent: Thursday, October 2, 2014 11:27 AM
> > > > > > > Subject: RE: HBase read performance
> > > > > > >
> > > > > > > I do see a very brief spike in CPU (user/system), but it's no
> > where
> > > > > near
> > > > > > 0% idle. It goes from 99% idle down to something like 40% idle
> for
> > a
> > > > > second
> > > > > > or so. The thing to note, this is all on a test cluster, so no
> > real load.
> > > > > > Things are generally idle until i issue 2-3 of these
> > multi-scan-requests
> > > > > to
> > > > > > render a web page. Then, you see the spike in the cpu and some
> > activity
> > > > > in
> > > > > > the network and disk, but nowhere near saturation.
> > > > > > >
> > > > > > >
> > > > > > > If there are specific tests you'd like me to do to debug this,
> > I'd be
> > > > > > more than happy to do it.
> > > > > > >
> > > > > > >
> > > > > > > Khaled
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > ----------------------------------------
> > > > > > >> Date: Thu, 2 Oct 2014 11:15:59 -0700
> > > > > > >> From: larsh@apache.org
> > > > > > >> Subject: Re: HBase read performance
> > > > > > >> To: user@hbase.apache.org
> > > > > > >>
> > > > > > >> I still think you're waiting on disk. No IOWAIT? So CPU is not
> > waiting
> > > > > > a lot for IO. No high User/System CPU either?
> > > > > > >>
> > > > > > >> If you see a lot of evicted block then each RPC has a high
> > chance of
> > > > > > requiring to bring an entire 64k block in. You'll see bad
> > performance
> > > > > with
> > > > > > this.
> > > > > > >>
> > > > > > >> We might need to trace this specific scenario.
> > > > > > >>
> > > > > > >> -- Lars
> > > > > > >>
> > > > > > >>
> > > > > > >>
> > > > > > >> ________________________________
> > > > > > >> From: Khaled Elmeleegy <kd...@hotmail.com>
> > > > > > >> To: "user@hbase.apache.org" <us...@hbase.apache.org>
> > > > > > >> Sent: Thursday, October 2, 2014 10:46 AM
> > > > > > >> Subject: RE: HBase read performance
> > > > > > >>
> > > > > > >>
> > > > > > >> I've set the heap size to 6GB and I do have gc logging. No
> long
> > pauses
> > > > > > there -- occasional 0.1s or 0.2s.
> > > > > > >>
> > > > > > >> Other than the discrepancy between what's reported on the
> > client and
> > > > > > what's reported at the RS, there is also the issue of not getting
> > proper
> > > > > > concurrency. So, even if a reverse get takes 100ms or so (this
> has
> > to be
> > > > > > mostly blocking on various things as no physical resource is
> > contended),
> > > > > > then the other gets/scans should be able to proceed in parallel,
> > so a
> > > > > > thousand concurrent gets/scans should finish in few hundreds of
> ms
> > not
> > > > > many
> > > > > > seconds. That's why I thought I'd increase the handlers count to
> > try to
> > > > > get
> > > > > > more concurrency, but it didn't help. So, there must be something
> > else.
> > > > > > >>
> > > > > > >> Khaled
> > > > > > >>
> > > > > > >> ----------------------------------------
> > > > > > >>> From: ndimiduk@gmail.com
> > > > > > >>> Date: Thu, 2 Oct 2014 10:36:39 -0700
> > > > > > >>> Subject: Re: HBase read performance
> > > > > > >>> To: user@hbase.apache.org
> > > > > > >>>
> > > > > > >>> Do check again on the heap size of the region servers. The
> > default
> > > > > > >>> unconfigured size is 1G; too small for much of anything.
> Check
> > your
> > > > > RS
> > > > > > logs
> > > > > > >>> -- look for lines produced by the JVMPauseMonitor thread.
> They
> > > > > usually
> > > > > > >>> correlate with long GC pauses or other process-freeze events.
> > > > > > >>>
> > > > > > >>> Get is implemented as a Scan of a single row, so a reverse
> > scan of a
> > > > > > single
> > > > > > >>> row should be functionally equivalent.
> > > > > > >>>
> > > > > > >>> In practice, I have seen discrepancy between the latencies
> > reported
> > > > > by
> > > > > > the
> > > > > > >>> RS and the latencies experienced by the client. I've not
> > investigated
> > > > > > this
> > > > > > >>> area thoroughly.
> > > > > > >>>
> > > > > > >>> On Thu, Oct 2, 2014 at 10:05 AM, Khaled Elmeleegy <
> > kdiaa@hotmail.com
> > > > > >
> > > > > > wrote:
> > > > > > >>>
> > > > > > >>>> Thanks Lars for your quick reply.
> > > > > > >>>>
> > > > > > >>>> Yes performance is similar with less handlers (I tried with
> > 100
> > > > > > first).
> > > > > > >>>>
> > > > > > >>>> The payload is not big ~1KB or so. The working set doesn't
> > seem to
> > > > > > fit in
> > > > > > >>>> memory as there are many cache misses. However, disk is far
> > from
> > > > > > being a
> > > > > > >>>> bottleneck. I checked using iostat. I also verified that
> > neither the
> > > > > > >>>> network nor the CPU of the region server or the client are a
> > > > > > bottleneck.
> > > > > > >>>> This leads me to believe that likely this is a software
> > bottleneck,
> > > > > > >>>> possibly due to a misconfiguration on my side. I just don't
> > know how
> > > > > > to
> > > > > > >>>> debug it. A clear disconnect I see is the individual request
> > latency
> > > > > > as
> > > > > > >>>> reported by metrics on the region server (IPC
> processCallTime
> > vs
> > > > > > scanNext)
> > > > > > >>>> vs what's measured on the client. Does this sound right? Any
> > ideas
> > > > > on
> > > > > > how
> > > > > > >>>> to better debug it?
> > > > > > >>>>
> > > > > > >>>> About this trick with the timestamps to be able to do a
> > forward
> > > > > scan,
> > > > > > >>>> thanks for pointing it out. Actually, I am aware of it. The
> > problem
> > > > > I
> > > > > > have
> > > > > > >>>> is, sometimes I want to get the key after a particular
> > timestamp and
> > > > > > >>>> sometimes I want to get the key before, so just relying on
> > the key
> > > > > > order
> > > > > > >>>> doesn't work. Ideally, I want a reverse get(). I thought
> > reverse
> > > > > scan
> > > > > > can
> > > > > > >>>> do the trick though.
> > > > > > >>>>
> > > > > > >>>> Khaled
> > > > > > >>>>
> > > > > > >>>> ----------------------------------------
> > > > > > >>>>> Date: Thu, 2 Oct 2014 09:40:37 -0700
> > > > > > >>>>> From: larsh@apache.org
> > > > > > >>>>> Subject: Re: HBase read performance
> > > > > > >>>>> To: user@hbase.apache.org
> > > > > > >>>>>
> > > > > > >>>>> Hi Khaled,
> > > > > > >>>>> is it the same with fewer threads? 1500 handler threads
> > seems to
> > > > > be a
> > > > > > >>>> lot. Typically a good number of threads depends on the
> > hardware
> > > > > > (number of
> > > > > > >>>> cores, number of spindles, etc). I cannot think of any type
> of
> > > > > > scenario
> > > > > > >>>> where more than 100 would give any improvement.
> > > > > > >>>>>
> > > > > > >>>>> How large is the payload per KV retrieved that way? If
> large
> > (as
> > > > > in a
> > > > > > >>>> few 100k) you definitely want to lower the number of the
> > handler
> > > > > > threads.
> > > > > > >>>>> How much heap do you give the region server? Does the
> > working set
> > > > > fit
> > > > > > >>>> into the cache? (i.e. in the metrics, do you see the
> eviction
> > count
> > > > > > going
> > > > > > >>>> up, if so it does not fit into the cache).
> > > > > > >>>>>
> > > > > > >>>>> If the working set does not fit into the cache (eviction
> > count goes
> > > > > > up)
> > > > > > >>>> then HBase will need to bring a new block in from disk on
> > each Get
> > > > > > >>>> (assuming the Gets are more or less random as far as the
> > server is
> > > > > > >>>> concerned).
> > > > > > >>>>> In case you'll benefit from reducing the HFile block size
> > (from 64k
> > > > > > to
> > > > > > >>>> 8k or even 4k).
> > > > > > >>>>>
> > > > > > >>>>> Lastly I don't think we tested the performance of using
> > reverse
> > > > > scan
> > > > > > >>>> this way, there is probably room to optimize this.
> > > > > > >>>>> Can you restructure your keys to allow forwards scanning?
> For
> > > > > example
> > > > > > >>>> you could store the time as MAX_LONG-time. Or you could
> > invert all
> > > > > > the bits
> > > > > > >>>> of the time portion of the key, so that it sort the other
> > way. Then
> > > > > > you
> > > > > > >>>> could do a forward scan.
> > > > > > >>>>>
> > > > > > >>>>> Let us know how it goes.
> > > > > > >>>>>
> > > > > > >>>>> -- Lars
> > > > > > >>>>>
> > > > > > >>>>>
> > > > > > >>>>> ----- Original Message -----
> > > > > > >>>>> From: Khaled Elmeleegy <kd...@hotmail.com>
> > > > > > >>>>> To: "user@hbase.apache.org" <us...@hbase.apache.org>
> > > > > > >>>>> Cc:
> > > > > > >>>>> Sent: Thursday, October 2, 2014 12:12 AM
> > > > > > >>>>> Subject: HBase read performance
> > > > > > >>>>>
> > > > > > >>>>> Hi,
> > > > > > >>>>>
> > > > > > >>>>> I am trying to do a scatter/gather on hbase (0.98.6.1),
> > where I
> > > > > have
> > > > > > a
> > > > > > >>>> client reading ~1000 keys from an HBase table. These keys
> > happen to
> > > > > > fall on
> > > > > > >>>> the same region server. For my reads I use reverse scan to
> > read each
> > > > > > key as
> > > > > > >>>> I want the key prior to a specific time stamp (time stamps
> are
> > > > > stored
> > > > > > in
> > > > > > >>>> reverse order). I don't believe gets can accomplish that,
> > right? so
> > > > > I
> > > > > > use
> > > > > > >>>> scan, with caching set to 1.
> > > > > > >>>>>
> > > > > > >>>>> I use 2000 reader threads in the client and on HBase, I've
> > set
> > > > > > >>>> hbase.regionserver.handler.count to 1500. With this setup,
> my
> > > > > scatter
> > > > > > >>>> gather is very slow and can take up to 10s in total. Timing
> an
> > > > > > individual
> > > > > > >>>> getScanner(..) call on the client side, it can easily take
> few
> > > > > > hundreds of
> > > > > > >>>> ms. I also got the following metrics from the region server
> in
> > > > > > question:
> > > > > > >>>>>
> > > > > > >>>>> "queueCallTime_mean" : 2.190855525775637,
> > > > > > >>>>> "queueCallTime_median" : 0.0,
> > > > > > >>>>> "queueCallTime_75th_percentile" : 0.0,
> > > > > > >>>>> "queueCallTime_95th_percentile" : 1.0,
> > > > > > >>>>> "queueCallTime_99th_percentile" : 556.9799999999818,
> > > > > > >>>>>
> > > > > > >>>>> "processCallTime_min" : 0,
> > > > > > >>>>> "processCallTime_max" : 12755,
> > > > > > >>>>> "processCallTime_mean" : 105.64873440912682,
> > > > > > >>>>> "processCallTime_median" : 0.0,
> > > > > > >>>>> "processCallTime_75th_percentile" : 2.0,
> > > > > > >>>>> "processCallTime_95th_percentile" : 7917.95,
> > > > > > >>>>> "processCallTime_99th_percentile" : 8876.89,
> > > > > > >>>>>
> > > > > > >>>>>
> > > > > > >>>>
> > > > > >
> > > > >
> >
> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_min"
> > > > > > >>>> : 89,
> > > > > > >>>>>
> > > > > > >>>>
> > > > > >
> > > > >
> >
> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_max"
> > > > > > >>>> : 11300,
> > > > > > >>>>>
> > > > > > >>>>
> > > > > >
> > > > >
> >
> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_mean"
> > > > > > >>>> : 654.4949739797315,
> > > > > > >>>>>
> > > > > > >>>>
> > > > > >
> > > > >
> >
> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_median"
> > > > > > >>>> : 101.0,
> > > > > > >>>>>
> > > > > > >>>>
> > > > > >
> > > > >
> >
> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_75th_percentile"
> > > > > > >>>> : 101.0,
> > > > > > >>>>>
> > > > > > >>>>
> > > > > >
> > > > >
> >
> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_95th_percentile"
> > > > > > >>>> : 101.0,
> > > > > > >>>>>
> > > > > > >>>>
> > > > > >
> > > > >
> >
> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_99th_percentile"
> > > > > > >>>> : 113.0,
> > > > > > >>>>>
> > > > > > >>>>> Where "delta" is the name of the table I am querying.
> > > > > > >>>>>
> > > > > > >>>>> In addition to all this, i monitored the hardware resources
> > (CPU,
> > > > > > disk,
> > > > > > >>>> and network) of both the client and the region server and
> > nothing
> > > > > > seems
> > > > > > >>>> anywhere near saturation. So I am puzzled by what's going on
> > and
> > > > > > where this
> > > > > > >>>> time is going.
> > > > > > >>>>>
> > > > > > >>>>> Few things to note based on the above measurements: both
> > medians of
> > > > > > IPC
> > > > > > >>>> processCallTime and queueCallTime are basically zero (ms I
> > presume,
> > > > > > >>>> right?). However, scanNext_median is 101 (ms too, right?). I
> > am not
> > > > > > sure
> > > > > > >>>> how this adds up. Also, even though the 101 figure seems
> > > > > outrageously
> > > > > > high
> > > > > > >>>> and I don't know why, still all these scans should be
> > happening in
> > > > > > >>>> parallel, so the overall call should finish fast, given that
> > no
> > > > > > hardware
> > > > > > >>>> resource is contended, right? but this is not what's
> > happening, so I
> > > > > > have
> > > > > > >>>> to be missing something(s).
> > > > > > >>>>>
> > > > > > >>>>> So, any help is appreciated there.
> > > > > > >>>>>
> > > > > > >>>>> Thanks,
> > > > > > >>>>> Khaled
> > > > > > >>>>
> > > > > > >>>>
> > > > > >
> > > > >
> > >
> >
> >
>

Re: HBase read performance

Posted by Qiang Tian <ti...@gmail.com>.

when using separate HConnection instance, both its
RpcClient instance(maintain connection to a  regionserver) and Registry
instance(maintain  connection to zookeeper) will be separate..

see
http://shammijayasinghe.blogspot.com/2012/02/zookeeper-increase-maximum-number-of.html





On Sun, Oct 5, 2014 at 2:24 PM, Khaled Elmeleegy <kd...@hotmail.com> wrote:

> I tried creating my own HConnections pool to use for my HBase calls, so
> that not all the (2K) threads share the same HConnection. However, I could
> only have 10 HConnections. Beyond that I get ZK exceptions, please find it
> below. Also, with 10 HConnections, I don't see noticeable improvement in
> performance so far.
> 2014-10-05 06:11:26,490 WARN  [main] zookeeper.RecoverableZooKeeper
> (RecoverableZooKeeper.java:retryOrThrow(253)) - Possibly transient
> ZooKeeper, quorum=54.68.206.252:2181,
> exception=org.apache.zookeeper.KeeperException$ConnectionLossException:
> KeeperErrorCode = ConnectionLoss for /hbase/hbaseid2014-10-05 06:11:26,490
> INFO  [main] util.RetryCounter (RetryCounter.java:sleepUntilNextRetry(155))
> - Sleeping 1000ms before retry #0...2014-10-05 06:11:27,845 WARN  [main]
> zookeeper.RecoverableZooKeeper
> (RecoverableZooKeeper.java:retryOrThrow(253)) - Possibly transient
> ZooKeeper, quorum=54.68.206.252:2181,
> exception=org.apache.zookeeper.KeeperException$ConnectionLossException:
> KeeperErrorCode = ConnectionLoss for /hbase/hbaseid2014-10-05 06:11:27,849
> INFO  [main] util.RetryCounter (RetryCounter.java:sleepUntilNextRetry(155))
> - Sleeping 2000ms before retry #1...2014-10-05 06:11:30,405 WARN  [main]
> zookeeper.RecoverableZooKeeper
> (RecoverableZooKeeper.java:retryOrThrow(253)) - Possibly transient
> ZooKeeper, quorum=54.68.206.252:2181,
> exception=org.apache.zookeeper.KeeperException$ConnectionLossException:
> KeeperErrorCode = ConnectionLoss for /hbase/hbaseid2014-10-05 06:11:30,405
> INFO  [main] util.RetryCounter (RetryCounter.java:sleepUntilNextRetry(155))
> - Sleeping 4000ms before retry #2...2014-10-05 06:11:35,278 WARN  [main]
> zookeeper.RecoverableZooKeeper
> (RecoverableZooKeeper.java:retryOrThrow(253)) - Possibly transient
> ZooKeeper, quorum=54.68.206.252:2181,
> exception=org.apache.zookeeper.KeeperException$ConnectionLossException:
> KeeperErrorCode = ConnectionLoss for /hbase/hbaseid2014-10-05 06:11:35,279
> INFO  [main] util.RetryCounter (RetryCounter.java:sleepUntilNextRetry(155))
> - Sleeping 8000ms before retry #3...2014-10-05 06:11:44,393 WARN  [main]
> zookeeper.RecoverableZooKeeper
> (RecoverableZooKeeper.java:retryOrThrow(253)) - Possibly transient
> ZooKeeper, quorum=54.68.206.252:2181,
> exception=org.apache.zookeeper.KeeperException$ConnectionLossException:
> KeeperErrorCode = ConnectionLoss for /hbase/hbaseid2014-10-05 06:11:44,393
> ERROR [main] zookeeper.RecoverableZooKeeper
> (RecoverableZooKeeper.java:retryOrThrow(255)) - ZooKeeper exists failed
> after 4 attempts2014-10-05 06:11:44,394 WARN  [main] zookeeper.ZKUtil
> (ZKUtil.java:checkExists(482)) - hconnection-0x4e174f3b, quorum=
> 54.68.206.252:2181, baseZNode=/hbase Unable to set watcher on znode
> (/hbase/hbaseid)org.apache.zookeeper.KeeperException$ConnectionLossException:
> KeeperErrorCode = ConnectionLoss for /hbase/hbaseid at
> org.apache.zookeeper.KeeperException.create(KeeperException.java:99) at
> org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at
> org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1041)   at
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.exists(RecoverableZooKeeper.java:199)
> at org.apache.hadoop.hbase.zookeeper.ZKUtil.checkExists(ZKUtil.java:479)
>     at
> org.apache.hadoop.hbase.zookeeper.ZKClusterId.readClusterIdZNode(ZKClusterId.java:65)
>       at
> org.apache.hadoop.hbase.client.ZooKeeperRegistry.getClusterId(ZooKeeperRegistry.java:83)
>    at
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.retrieveClusterId(HConnectionManager.java:857)
>  at
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.<init>(HConnectionManager.java:662)
>     at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> Method)        at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
> at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:526)      at
> org.apache.hadoop.hbase.client.HConnectionManager.createConnection(HConnectionManager.java:414)
>     at
> org.apache.hadoop.hbase.client.HConnectionManager.createConnection(HConnectionManager.java:335)...
> > From: kdiaa@hotmail.com
> > To: user@hbase.apache.org
> > Subject: RE: HBase read performance
> > Date: Fri, 3 Oct 2014 12:37:39 -0700
> >
> > Lars, Ted, and Qiang,
> > Thanks for all the input.
> > Qiang: yes all the threads are in the same client process sharing the
> same connection. And since I don't see hardware contention, may be there is
> contention over this code path. I'll try using many connections and see if
> it alleviates the problems and I'll report back.
> > Thanks again,Khaled
> >
> > > Date: Fri, 3 Oct 2014 15:18:30 +0800
> > > Subject: Re: HBase read performance
> > > From: tianq01@gmail.com
> > > To: user@hbase.apache.org
> > >
> > > Regarding to profiling, Andrew introduced
> > > http://www.brendangregg.com/blog/2014-06-12/java-flame-graphs.html
> months
> > > ago.
> > >
> > > processCallTime comes from RpcServer#call, so it looks good?
> > >
> > > I have a suspect: https://issues.apache.org/jira/browse/HBASE-11306
> > >
> > > how many processes do you have for your 2000 threads?
> > > if olny 1 process, those threads will share just 1 connection to that
> > > regionserver, there might be big contention on the RPC code path.
> ---for
> > > such case, could you try using different connections?
> > >
> https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HConnectionManager.html
> > >
> > >
> > >
> > >
> > > On Fri, Oct 3, 2014 at 9:55 AM, Ted Yu <yu...@gmail.com> wrote:
> > >
> > > > Khaled:
> > > > Do you have profiler such as jprofiler ?
> > > > Profiling would give us more hint.
> > > >
> > > > Otherwise capturing stack trace during the period of reverse scan
> would
> > > > help.
> > > >
> > > > Cheers
> > > >
> > > > On Thu, Oct 2, 2014 at 4:52 PM, lars hofhansl <la...@apache.org>
> wrote:
> > > >
> > > > > You might have the data in the OS buffer cache, without short
> circuit
> > > > > reading the region server has to request the block from the data
> node
> > > > > process, which then reads it from the block cache.
> > > > > That is a few context switches per RPC that do not show up in CPU
> > > > metrics.
> > > > > In that you also would not see disk IO.
> > > > >
> > > > >
> > > > > If - as you say - you see a lot of evicted blocks the data *has*
> to come
> > > > > from the OS. If you do not see disk IO is *has* to come from the OS
> > > > cache.
> > > > > I.e. there's more RAM on your boxes, and you should increase the
> heap
> > > > block
> > > > > cache.
> > > > >
> > > > >
> > > > > You can measure the context switches with vmstat. Other than that
> I have
> > > > > no suggestion until I reproduce the problem.
> > > > > Also check the data locality index of the region server it should
> be
> > > > close
> > > > > to 100%.
> > > > >
> > > > >
> > > > > -- Lars
> > > > >
> > > > >
> > > > >
> > > > > ________________________________
> > > > >  From: Khaled Elmeleegy <kd...@hotmail.com>
> > > > > To: "user@hbase.apache.org" <us...@hbase.apache.org>
> > > > > Sent: Thursday, October 2, 2014 3:24 PM
> > > > > Subject: RE: HBase read performance
> > > > >
> > > > >
> > > > > Lars thanks a lot about all the tips. I'll make sure I cover all
> of them
> > > > > and get back to you. I am not sure they are the bottleneck though
> as they
> > > > > all are about optimizing physical resource usage. As I said, I
> don't see
> > > > > any contended physical resources now. I'll also try to reproduce
> this
> > > > > problem in a simpler environment and pass to you the test program
> to play
> > > > > with.
> > > > >
> > > > >
> > > > > Couple of high level points to make. You are right that my use
> case is
> > > > > kind of a worst case for HBase reads. But, if things go the way you
> > > > > described them, there should be tons of disk IO and that should be
> > > > clearly
> > > > > the bottleneck. This is not the case though. That's for the simple
> reason
> > > > > that this is done in a test environment (I am still prototyping),
> and
> > > > not a
> > > > > lot of data is yet written to HBase. However for the real use
> case, there
> > > > > should writers constantly writing data to HBase and readers
> occasionally
> > > > > doing this scatter/gather. At steady state, things should only get
> worse
> > > > > and all the issues you mentioned should get far more pronounced.
> At this
> > > > > point, one can try to mitigate it using more memory or so. I am
> not there
> > > > > yet as I think I am hitting some software bottleneck, which I
> don't know
> > > > > how to work around.
> > > > >
> > > > > Khaled
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > ----------------------------------------
> > > > > > Date: Thu, 2 Oct 2014 14:20:47 -0700
> > > > > > From: larsh@apache.org
> > > > > > Subject: Re: HBase read performance
> > > > > > To: user@hbase.apache.org
> > > > > >
> > > > > > OK... We might need to investigate this.
> > > > > > Any chance that you can provide a minimal test program and
> instruction
> > > > > about how to set it up.
> > > > > > We can do some profiling then.
> > > > > >
> > > > > > One thing to note is that with scanning HBase cannot use bloom
> filters
> > > > > to rule out HFiles ahead of time, it needs to look into all of
> them.
> > > > > > So you kind of hit on the absolute worst case:
> > > > > > - random reads that do not fit into the block cache
> > > > > > - cannot use bloom filters
> > > > > >
> > > > > >
> > > > > > Few more question/comments:
> > > > > > - Do you have short circuit reading enabled? If not, you should.
> > > > > > - Is your table major compacted? That will reduce the number of
> files
> > > > to
> > > > > look at.
> > > > > > - Did you disable Nagle's everywhere (enabled tcpnodelay)? It
> disabled
> > > > > by default in HBase, but necessarily in your install of HDFS.
> > > > > > - Which version of HDFS are you using as backing filesystem?
> > > > > > - If your disk is idle, it means the data fits into the OS buffer
> > > > cache.
> > > > > In turn that means that you increase the heap for the region
> servers. You
> > > > > can also use block encoding (FAST_DIFF) to try to make sure the
> entire
> > > > > working set fits into the cache.
> > > > > >
> > > > > > - Also try to reduce the block size - although if your overall
> working
> > > > > set does not fit in the heap it won't help much.
> > > > > >
> > > > > >
> > > > > > This is a good section of the book to read through generally
> (even
> > > > > though you might know most of this already):
> > > > > http://hbase.apache.org/book.html#perf.configurations
> > > > > >
> > > > > >
> > > > > > -- Lars
> > > > > >
> > > > > >
> > > > > >
> > > > > > ----- Original Message -----
> > > > > > From: Khaled Elmeleegy <kd...@hotmail.com>
> > > > > > To: "user@hbase.apache.org" <us...@hbase.apache.org>
> > > > > > Cc:
> > > > > > Sent: Thursday, October 2, 2014 11:27 AM
> > > > > > Subject: RE: HBase read performance
> > > > > >
> > > > > > I do see a very brief spike in CPU (user/system), but it's no
> where
> > > > near
> > > > > 0% idle. It goes from 99% idle down to something like 40% idle for
> a
> > > > second
> > > > > or so. The thing to note, this is all on a test cluster, so no
> real load.
> > > > > Things are generally idle until i issue 2-3 of these
> multi-scan-requests
> > > > to
> > > > > render a web page. Then, you see the spike in the cpu and some
> activity
> > > > in
> > > > > the network and disk, but nowhere near saturation.
> > > > > >
> > > > > >
> > > > > > If there are specific tests you'd like me to do to debug this,
> I'd be
> > > > > more than happy to do it.
> > > > > >
> > > > > >
> > > > > > Khaled
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > ----------------------------------------
> > > > > >> Date: Thu, 2 Oct 2014 11:15:59 -0700
> > > > > >> From: larsh@apache.org
> > > > > >> Subject: Re: HBase read performance
> > > > > >> To: user@hbase.apache.org
> > > > > >>
> > > > > >> I still think you're waiting on disk. No IOWAIT? So CPU is not
> waiting
> > > > > a lot for IO. No high User/System CPU either?
> > > > > >>
> > > > > >> If you see a lot of evicted block then each RPC has a high
> chance of
> > > > > requiring to bring an entire 64k block in. You'll see bad
> performance
> > > > with
> > > > > this.
> > > > > >>
> > > > > >> We might need to trace this specific scenario.
> > > > > >>
> > > > > >> -- Lars
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >> ________________________________
> > > > > >> From: Khaled Elmeleegy <kd...@hotmail.com>
> > > > > >> To: "user@hbase.apache.org" <us...@hbase.apache.org>
> > > > > >> Sent: Thursday, October 2, 2014 10:46 AM
> > > > > >> Subject: RE: HBase read performance
> > > > > >>
> > > > > >>
> > > > > >> I've set the heap size to 6GB and I do have gc logging. No long
> pauses
> > > > > there -- occasional 0.1s or 0.2s.
> > > > > >>
> > > > > >> Other than the discrepancy between what's reported on the
> client and
> > > > > what's reported at the RS, there is also the issue of not getting
> proper
> > > > > concurrency. So, even if a reverse get takes 100ms or so (this has
> to be
> > > > > mostly blocking on various things as no physical resource is
> contended),
> > > > > then the other gets/scans should be able to proceed in parallel,
> so a
> > > > > thousand concurrent gets/scans should finish in few hundreds of ms
> not
> > > > many
> > > > > seconds. That's why I thought I'd increase the handlers count to
> try to
> > > > get
> > > > > more concurrency, but it didn't help. So, there must be something
> else.
> > > > > >>
> > > > > >> Khaled
> > > > > >>
> > > > > >> ----------------------------------------
> > > > > >>> From: ndimiduk@gmail.com
> > > > > >>> Date: Thu, 2 Oct 2014 10:36:39 -0700
> > > > > >>> Subject: Re: HBase read performance
> > > > > >>> To: user@hbase.apache.org
> > > > > >>>
> > > > > >>> Do check again on the heap size of the region servers. The
> default
> > > > > >>> unconfigured size is 1G; too small for much of anything. Check
> your
> > > > RS
> > > > > logs
> > > > > >>> -- look for lines produced by the JVMPauseMonitor thread. They
> > > > usually
> > > > > >>> correlate with long GC pauses or other process-freeze events.
> > > > > >>>
> > > > > >>> Get is implemented as a Scan of a single row, so a reverse
> scan of a
> > > > > single
> > > > > >>> row should be functionally equivalent.
> > > > > >>>
> > > > > >>> In practice, I have seen discrepancy between the latencies
> reported
> > > > by
> > > > > the
> > > > > >>> RS and the latencies experienced by the client. I've not
> investigated
> > > > > this
> > > > > >>> area thoroughly.
> > > > > >>>
> > > > > >>> On Thu, Oct 2, 2014 at 10:05 AM, Khaled Elmeleegy <
> kdiaa@hotmail.com
> > > > >
> > > > > wrote:
> > > > > >>>
> > > > > >>>> Thanks Lars for your quick reply.
> > > > > >>>>
> > > > > >>>> Yes performance is similar with less handlers (I tried with
> 100
> > > > > first).
> > > > > >>>>
> > > > > >>>> The payload is not big ~1KB or so. The working set doesn't
> seem to
> > > > > fit in
> > > > > >>>> memory as there are many cache misses. However, disk is far
> from
> > > > > being a
> > > > > >>>> bottleneck. I checked using iostat. I also verified that
> neither the
> > > > > >>>> network nor the CPU of the region server or the client are a
> > > > > bottleneck.
> > > > > >>>> This leads me to believe that likely this is a software
> bottleneck,
> > > > > >>>> possibly due to a misconfiguration on my side. I just don't
> know how
> > > > > to
> > > > > >>>> debug it. A clear disconnect I see is the individual request
> latency
> > > > > as
> > > > > >>>> reported by metrics on the region server (IPC processCallTime
> vs
> > > > > scanNext)
> > > > > >>>> vs what's measured on the client. Does this sound right? Any
> ideas
> > > > on
> > > > > how
> > > > > >>>> to better debug it?
> > > > > >>>>
> > > > > >>>> About this trick with the timestamps to be able to do a
> forward
> > > > scan,
> > > > > >>>> thanks for pointing it out. Actually, I am aware of it. The
> problem
> > > > I
> > > > > have
> > > > > >>>> is, sometimes I want to get the key after a particular
> timestamp and
> > > > > >>>> sometimes I want to get the key before, so just relying on
> the key
> > > > > order
> > > > > >>>> doesn't work. Ideally, I want a reverse get(). I thought
> reverse
> > > > scan
> > > > > can
> > > > > >>>> do the trick though.
> > > > > >>>>
> > > > > >>>> Khaled
> > > > > >>>>
> > > > > >>>> ----------------------------------------
> > > > > >>>>> Date: Thu, 2 Oct 2014 09:40:37 -0700
> > > > > >>>>> From: larsh@apache.org
> > > > > >>>>> Subject: Re: HBase read performance
> > > > > >>>>> To: user@hbase.apache.org
> > > > > >>>>>
> > > > > >>>>> Hi Khaled,
> > > > > >>>>> is it the same with fewer threads? 1500 handler threads
> seems to
> > > > be a
> > > > > >>>> lot. Typically a good number of threads depends on the
> hardware
> > > > > (number of
> > > > > >>>> cores, number of spindles, etc). I cannot think of any type of
> > > > > scenario
> > > > > >>>> where more than 100 would give any improvement.
> > > > > >>>>>
> > > > > >>>>> How large is the payload per KV retrieved that way? If large
> (as
> > > > in a
> > > > > >>>> few 100k) you definitely want to lower the number of the
> handler
> > > > > threads.
> > > > > >>>>> How much heap do you give the region server? Does the
> working set
> > > > fit
> > > > > >>>> into the cache? (i.e. in the metrics, do you see the eviction
> count
> > > > > going
> > > > > >>>> up, if so it does not fit into the cache).
> > > > > >>>>>
> > > > > >>>>> If the working set does not fit into the cache (eviction
> count goes
> > > > > up)
> > > > > >>>> then HBase will need to bring a new block in from disk on
> each Get
> > > > > >>>> (assuming the Gets are more or less random as far as the
> server is
> > > > > >>>> concerned).
> > > > > >>>>> In case you'll benefit from reducing the HFile block size
> (from 64k
> > > > > to
> > > > > >>>> 8k or even 4k).
> > > > > >>>>>
> > > > > >>>>> Lastly I don't think we tested the performance of using
> reverse
> > > > scan
> > > > > >>>> this way, there is probably room to optimize this.
> > > > > >>>>> Can you restructure your keys to allow forwards scanning? For
> > > > example
> > > > > >>>> you could store the time as MAX_LONG-time. Or you could
> invert all
> > > > > the bits
> > > > > >>>> of the time portion of the key, so that it sort the other
> way. Then
> > > > > you
> > > > > >>>> could do a forward scan.
> > > > > >>>>>
> > > > > >>>>> Let us know how it goes.
> > > > > >>>>>
> > > > > >>>>> -- Lars
> > > > > >>>>>
> > > > > >>>>>
> > > > > >>>>> ----- Original Message -----
> > > > > >>>>> From: Khaled Elmeleegy <kd...@hotmail.com>
> > > > > >>>>> To: "user@hbase.apache.org" <us...@hbase.apache.org>
> > > > > >>>>> Cc:
> > > > > >>>>> Sent: Thursday, October 2, 2014 12:12 AM
> > > > > >>>>> Subject: HBase read performance
> > > > > >>>>>
> > > > > >>>>> Hi,
> > > > > >>>>>
> > > > > >>>>> I am trying to do a scatter/gather on hbase (0.98.6.1),
> where I
> > > > have
> > > > > a
> > > > > >>>> client reading ~1000 keys from an HBase table. These keys
> happen to
> > > > > fall on
> > > > > >>>> the same region server. For my reads I use reverse scan to
> read each
> > > > > key as
> > > > > >>>> I want the key prior to a specific time stamp (time stamps are
> > > > stored
> > > > > in
> > > > > >>>> reverse order). I don't believe gets can accomplish that,
> right? so
> > > > I
> > > > > use
> > > > > >>>> scan, with caching set to 1.
> > > > > >>>>>
> > > > > >>>>> I use 2000 reader threads in the client and on HBase, I've
> set
> > > > > >>>> hbase.regionserver.handler.count to 1500. With this setup, my
> > > > scatter
> > > > > >>>> gather is very slow and can take up to 10s in total. Timing an
> > > > > individual
> > > > > >>>> getScanner(..) call on the client side, it can easily take few
> > > > > hundreds of
> > > > > >>>> ms. I also got the following metrics from the region server in
> > > > > question:
> > > > > >>>>>
> > > > > >>>>> "queueCallTime_mean" : 2.190855525775637,
> > > > > >>>>> "queueCallTime_median" : 0.0,
> > > > > >>>>> "queueCallTime_75th_percentile" : 0.0,
> > > > > >>>>> "queueCallTime_95th_percentile" : 1.0,
> > > > > >>>>> "queueCallTime_99th_percentile" : 556.9799999999818,
> > > > > >>>>>
> > > > > >>>>> "processCallTime_min" : 0,
> > > > > >>>>> "processCallTime_max" : 12755,
> > > > > >>>>> "processCallTime_mean" : 105.64873440912682,
> > > > > >>>>> "processCallTime_median" : 0.0,
> > > > > >>>>> "processCallTime_75th_percentile" : 2.0,
> > > > > >>>>> "processCallTime_95th_percentile" : 7917.95,
> > > > > >>>>> "processCallTime_99th_percentile" : 8876.89,
> > > > > >>>>>
> > > > > >>>>>
> > > > > >>>>
> > > > >
> > > >
> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_min"
> > > > > >>>> : 89,
> > > > > >>>>>
> > > > > >>>>
> > > > >
> > > >
> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_max"
> > > > > >>>> : 11300,
> > > > > >>>>>
> > > > > >>>>
> > > > >
> > > >
> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_mean"
> > > > > >>>> : 654.4949739797315,
> > > > > >>>>>
> > > > > >>>>
> > > > >
> > > >
> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_median"
> > > > > >>>> : 101.0,
> > > > > >>>>>
> > > > > >>>>
> > > > >
> > > >
> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_75th_percentile"
> > > > > >>>> : 101.0,
> > > > > >>>>>
> > > > > >>>>
> > > > >
> > > >
> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_95th_percentile"
> > > > > >>>> : 101.0,
> > > > > >>>>>
> > > > > >>>>
> > > > >
> > > >
> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_99th_percentile"
> > > > > >>>> : 113.0,
> > > > > >>>>>
> > > > > >>>>> Where "delta" is the name of the table I am querying.
> > > > > >>>>>
> > > > > >>>>> In addition to all this, i monitored the hardware resources
> (CPU,
> > > > > disk,
> > > > > >>>> and network) of both the client and the region server and
> nothing
> > > > > seems
> > > > > >>>> anywhere near saturation. So I am puzzled by what's going on
> and
> > > > > where this
> > > > > >>>> time is going.
> > > > > >>>>>
> > > > > >>>>> Few things to note based on the above measurements: both
> medians of
> > > > > IPC
> > > > > >>>> processCallTime and queueCallTime are basically zero (ms I
> presume,
> > > > > >>>> right?). However, scanNext_median is 101 (ms too, right?). I
> am not
> > > > > sure
> > > > > >>>> how this adds up. Also, even though the 101 figure seems
> > > > outrageously
> > > > > high
> > > > > >>>> and I don't know why, still all these scans should be
> happening in
> > > > > >>>> parallel, so the overall call should finish fast, given that
> no
> > > > > hardware
> > > > > >>>> resource is contended, right? but this is not what's
> happening, so I
> > > > > have
> > > > > >>>> to be missing something(s).
> > > > > >>>>>
> > > > > >>>>> So, any help is appreciated there.
> > > > > >>>>>
> > > > > >>>>> Thanks,
> > > > > >>>>> Khaled
> > > > > >>>>
> > > > > >>>>
> > > > >
> > > >
> >
>
>

RE: HBase read performance

Posted by Khaled Elmeleegy <kd...@hotmail.com>.

I tried creating my own HConnections pool to use for my HBase calls, so that not all the (2K) threads share the same HConnection. However, I could only have 10 HConnections. Beyond that I get ZK exceptions, please find it below. Also, with 10 HConnections, I don't see noticeable improvement in performance so far.
2014-10-05 06:11:26,490 WARN  [main] zookeeper.RecoverableZooKeeper (RecoverableZooKeeper.java:retryOrThrow(253)) - Possibly transient ZooKeeper, quorum=54.68.206.252:2181, exception=org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/hbaseid2014-10-05 06:11:26,490 INFO  [main] util.RetryCounter (RetryCounter.java:sleepUntilNextRetry(155)) - Sleeping 1000ms before retry #0...2014-10-05 06:11:27,845 WARN  [main] zookeeper.RecoverableZooKeeper (RecoverableZooKeeper.java:retryOrThrow(253)) - Possibly transient ZooKeeper, quorum=54.68.206.252:2181, exception=org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/hbaseid2014-10-05 06:11:27,849 INFO  [main] util.RetryCounter (RetryCounter.java:sleepUntilNextRetry(155)) - Sleeping 2000ms before retry #1...2014-10-05 06:11:30,405 WARN  [main] zookeeper.RecoverableZooKeeper (RecoverableZooKeeper.java:retryOrThrow(253)) - Possibly transient ZooKeeper, quorum=54.68.206.252:2181, exception=org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/hbaseid2014-10-05 06:11:30,405 INFO  [main] util.RetryCounter (RetryCounter.java:sleepUntilNextRetry(155)) - Sleeping 4000ms before retry #2...2014-10-05 06:11:35,278 WARN  [main] zookeeper.RecoverableZooKeeper (RecoverableZooKeeper.java:retryOrThrow(253)) - Possibly transient ZooKeeper, quorum=54.68.206.252:2181, exception=org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/hbaseid2014-10-05 06:11:35,279 INFO  [main] util.RetryCounter (RetryCounter.java:sleepUntilNextRetry(155)) - Sleeping 8000ms before retry #3...2014-10-05 06:11:44,393 WARN  [main] zookeeper.RecoverableZooKeeper (RecoverableZooKeeper.java:retryOrThrow(253)) - Possibly transient ZooKeeper, quorum=54.68.206.252:2181, exception=org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/hbaseid2014-10-05 06:11:44,393 ERROR [main] zookeeper.RecoverableZooKeeper (RecoverableZooKeeper.java:retryOrThrow(255)) - ZooKeeper exists failed after 4 attempts2014-10-05 06:11:44,394 WARN  [main] zookeeper.ZKUtil (ZKUtil.java:checkExists(482)) - hconnection-0x4e174f3b, quorum=54.68.206.252:2181, baseZNode=/hbase Unable to set watcher on znode (/hbase/hbaseid)org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/hbaseid	at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)	at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)	at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1041)	at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.exists(RecoverableZooKeeper.java:199)	at org.apache.hadoop.hbase.zookeeper.ZKUtil.checkExists(ZKUtil.java:479)	at org.apache.hadoop.hbase.zookeeper.ZKClusterId.readClusterIdZNode(ZKClusterId.java:65)	at org.apache.hadoop.hbase.client.ZooKeeperRegistry.getClusterId(ZooKeeperRegistry.java:83)	at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.retrieveClusterId(HConnectionManager.java:857)	at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.<init>(HConnectionManager.java:662)	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)	at java.lang.reflect.Constructor.newInstance(Constructor.java:526)	at org.apache.hadoop.hbase.client.HConnectionManager.createConnection(HConnectionManager.java:414)	at org.apache.hadoop.hbase.client.HConnectionManager.createConnection(HConnectionManager.java:335)...
> From: kdiaa@hotmail.com
> To: user@hbase.apache.org
> Subject: RE: HBase read performance
> Date: Fri, 3 Oct 2014 12:37:39 -0700
> 
> Lars, Ted, and Qiang,
> Thanks for all the input. 
> Qiang: yes all the threads are in the same client process sharing the same connection. And since I don't see hardware contention, may be there is contention over this code path. I'll try using many connections and see if it alleviates the problems and I'll report back.
> Thanks again,Khaled
> 
> > Date: Fri, 3 Oct 2014 15:18:30 +0800
> > Subject: Re: HBase read performance
> > From: tianq01@gmail.com
> > To: user@hbase.apache.org
> > 
> > Regarding to profiling, Andrew introduced
> > http://www.brendangregg.com/blog/2014-06-12/java-flame-graphs.html months
> > ago.
> > 
> > processCallTime comes from RpcServer#call, so it looks good?
> > 
> > I have a suspect: https://issues.apache.org/jira/browse/HBASE-11306
> > 
> > how many processes do you have for your 2000 threads?
> > if olny 1 process, those threads will share just 1 connection to that
> > regionserver, there might be big contention on the RPC code path. ---for
> > such case, could you try using different connections?
> > https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HConnectionManager.html
> > 
> > 
> > 
> > 
> > On Fri, Oct 3, 2014 at 9:55 AM, Ted Yu <yu...@gmail.com> wrote:
> > 
> > > Khaled:
> > > Do you have profiler such as jprofiler ?
> > > Profiling would give us more hint.
> > >
> > > Otherwise capturing stack trace during the period of reverse scan would
> > > help.
> > >
> > > Cheers
> > >
> > > On Thu, Oct 2, 2014 at 4:52 PM, lars hofhansl <la...@apache.org> wrote:
> > >
> > > > You might have the data in the OS buffer cache, without short circuit
> > > > reading the region server has to request the block from the data node
> > > > process, which then reads it from the block cache.
> > > > That is a few context switches per RPC that do not show up in CPU
> > > metrics.
> > > > In that you also would not see disk IO.
> > > >
> > > >
> > > > If - as you say - you see a lot of evicted blocks the data *has* to come
> > > > from the OS. If you do not see disk IO is *has* to come from the OS
> > > cache.
> > > > I.e. there's more RAM on your boxes, and you should increase the heap
> > > block
> > > > cache.
> > > >
> > > >
> > > > You can measure the context switches with vmstat. Other than that I have
> > > > no suggestion until I reproduce the problem.
> > > > Also check the data locality index of the region server it should be
> > > close
> > > > to 100%.
> > > >
> > > >
> > > > -- Lars
> > > >
> > > >
> > > >
> > > > ________________________________
> > > >  From: Khaled Elmeleegy <kd...@hotmail.com>
> > > > To: "user@hbase.apache.org" <us...@hbase.apache.org>
> > > > Sent: Thursday, October 2, 2014 3:24 PM
> > > > Subject: RE: HBase read performance
> > > >
> > > >
> > > > Lars thanks a lot about all the tips. I'll make sure I cover all of them
> > > > and get back to you. I am not sure they are the bottleneck though as they
> > > > all are about optimizing physical resource usage. As I said, I don't see
> > > > any contended physical resources now. I'll also try to reproduce this
> > > > problem in a simpler environment and pass to you the test program to play
> > > > with.
> > > >
> > > >
> > > > Couple of high level points to make. You are right that my use case is
> > > > kind of a worst case for HBase reads. But, if things go the way you
> > > > described them, there should be tons of disk IO and that should be
> > > clearly
> > > > the bottleneck. This is not the case though. That's for the simple reason
> > > > that this is done in a test environment (I am still prototyping), and
> > > not a
> > > > lot of data is yet written to HBase. However for the real use case, there
> > > > should writers constantly writing data to HBase and readers occasionally
> > > > doing this scatter/gather. At steady state, things should only get worse
> > > > and all the issues you mentioned should get far more pronounced. At this
> > > > point, one can try to mitigate it using more memory or so. I am not there
> > > > yet as I think I am hitting some software bottleneck, which I don't know
> > > > how to work around.
> > > >
> > > > Khaled
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > ----------------------------------------
> > > > > Date: Thu, 2 Oct 2014 14:20:47 -0700
> > > > > From: larsh@apache.org
> > > > > Subject: Re: HBase read performance
> > > > > To: user@hbase.apache.org
> > > > >
> > > > > OK... We might need to investigate this.
> > > > > Any chance that you can provide a minimal test program and instruction
> > > > about how to set it up.
> > > > > We can do some profiling then.
> > > > >
> > > > > One thing to note is that with scanning HBase cannot use bloom filters
> > > > to rule out HFiles ahead of time, it needs to look into all of them.
> > > > > So you kind of hit on the absolute worst case:
> > > > > - random reads that do not fit into the block cache
> > > > > - cannot use bloom filters
> > > > >
> > > > >
> > > > > Few more question/comments:
> > > > > - Do you have short circuit reading enabled? If not, you should.
> > > > > - Is your table major compacted? That will reduce the number of files
> > > to
> > > > look at.
> > > > > - Did you disable Nagle's everywhere (enabled tcpnodelay)? It disabled
> > > > by default in HBase, but necessarily in your install of HDFS.
> > > > > - Which version of HDFS are you using as backing filesystem?
> > > > > - If your disk is idle, it means the data fits into the OS buffer
> > > cache.
> > > > In turn that means that you increase the heap for the region servers. You
> > > > can also use block encoding (FAST_DIFF) to try to make sure the entire
> > > > working set fits into the cache.
> > > > >
> > > > > - Also try to reduce the block size - although if your overall working
> > > > set does not fit in the heap it won't help much.
> > > > >
> > > > >
> > > > > This is a good section of the book to read through generally (even
> > > > though you might know most of this already):
> > > > http://hbase.apache.org/book.html#perf.configurations
> > > > >
> > > > >
> > > > > -- Lars
> > > > >
> > > > >
> > > > >
> > > > > ----- Original Message -----
> > > > > From: Khaled Elmeleegy <kd...@hotmail.com>
> > > > > To: "user@hbase.apache.org" <us...@hbase.apache.org>
> > > > > Cc:
> > > > > Sent: Thursday, October 2, 2014 11:27 AM
> > > > > Subject: RE: HBase read performance
> > > > >
> > > > > I do see a very brief spike in CPU (user/system), but it's no where
> > > near
> > > > 0% idle. It goes from 99% idle down to something like 40% idle for a
> > > second
> > > > or so. The thing to note, this is all on a test cluster, so no real load.
> > > > Things are generally idle until i issue 2-3 of these multi-scan-requests
> > > to
> > > > render a web page. Then, you see the spike in the cpu and some activity
> > > in
> > > > the network and disk, but nowhere near saturation.
> > > > >
> > > > >
> > > > > If there are specific tests you'd like me to do to debug this, I'd be
> > > > more than happy to do it.
> > > > >
> > > > >
> > > > > Khaled
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > ----------------------------------------
> > > > >> Date: Thu, 2 Oct 2014 11:15:59 -0700
> > > > >> From: larsh@apache.org
> > > > >> Subject: Re: HBase read performance
> > > > >> To: user@hbase.apache.org
> > > > >>
> > > > >> I still think you're waiting on disk. No IOWAIT? So CPU is not waiting
> > > > a lot for IO. No high User/System CPU either?
> > > > >>
> > > > >> If you see a lot of evicted block then each RPC has a high chance of
> > > > requiring to bring an entire 64k block in. You'll see bad performance
> > > with
> > > > this.
> > > > >>
> > > > >> We might need to trace this specific scenario.
> > > > >>
> > > > >> -- Lars
> > > > >>
> > > > >>
> > > > >>
> > > > >> ________________________________
> > > > >> From: Khaled Elmeleegy <kd...@hotmail.com>
> > > > >> To: "user@hbase.apache.org" <us...@hbase.apache.org>
> > > > >> Sent: Thursday, October 2, 2014 10:46 AM
> > > > >> Subject: RE: HBase read performance
> > > > >>
> > > > >>
> > > > >> I've set the heap size to 6GB and I do have gc logging. No long pauses
> > > > there -- occasional 0.1s or 0.2s.
> > > > >>
> > > > >> Other than the discrepancy between what's reported on the client and
> > > > what's reported at the RS, there is also the issue of not getting proper
> > > > concurrency. So, even if a reverse get takes 100ms or so (this has to be
> > > > mostly blocking on various things as no physical resource is contended),
> > > > then the other gets/scans should be able to proceed in parallel, so a
> > > > thousand concurrent gets/scans should finish in few hundreds of ms not
> > > many
> > > > seconds. That's why I thought I'd increase the handlers count to try to
> > > get
> > > > more concurrency, but it didn't help. So, there must be something else.
> > > > >>
> > > > >> Khaled
> > > > >>
> > > > >> ----------------------------------------
> > > > >>> From: ndimiduk@gmail.com
> > > > >>> Date: Thu, 2 Oct 2014 10:36:39 -0700
> > > > >>> Subject: Re: HBase read performance
> > > > >>> To: user@hbase.apache.org
> > > > >>>
> > > > >>> Do check again on the heap size of the region servers. The default
> > > > >>> unconfigured size is 1G; too small for much of anything. Check your
> > > RS
> > > > logs
> > > > >>> -- look for lines produced by the JVMPauseMonitor thread. They
> > > usually
> > > > >>> correlate with long GC pauses or other process-freeze events.
> > > > >>>
> > > > >>> Get is implemented as a Scan of a single row, so a reverse scan of a
> > > > single
> > > > >>> row should be functionally equivalent.
> > > > >>>
> > > > >>> In practice, I have seen discrepancy between the latencies reported
> > > by
> > > > the
> > > > >>> RS and the latencies experienced by the client. I've not investigated
> > > > this
> > > > >>> area thoroughly.
> > > > >>>
> > > > >>> On Thu, Oct 2, 2014 at 10:05 AM, Khaled Elmeleegy <kdiaa@hotmail.com
> > > >
> > > > wrote:
> > > > >>>
> > > > >>>> Thanks Lars for your quick reply.
> > > > >>>>
> > > > >>>> Yes performance is similar with less handlers (I tried with 100
> > > > first).
> > > > >>>>
> > > > >>>> The payload is not big ~1KB or so. The working set doesn't seem to
> > > > fit in
> > > > >>>> memory as there are many cache misses. However, disk is far from
> > > > being a
> > > > >>>> bottleneck. I checked using iostat. I also verified that neither the
> > > > >>>> network nor the CPU of the region server or the client are a
> > > > bottleneck.
> > > > >>>> This leads me to believe that likely this is a software bottleneck,
> > > > >>>> possibly due to a misconfiguration on my side. I just don't know how
> > > > to
> > > > >>>> debug it. A clear disconnect I see is the individual request latency
> > > > as
> > > > >>>> reported by metrics on the region server (IPC processCallTime vs
> > > > scanNext)
> > > > >>>> vs what's measured on the client. Does this sound right? Any ideas
> > > on
> > > > how
> > > > >>>> to better debug it?
> > > > >>>>
> > > > >>>> About this trick with the timestamps to be able to do a forward
> > > scan,
> > > > >>>> thanks for pointing it out. Actually, I am aware of it. The problem
> > > I
> > > > have
> > > > >>>> is, sometimes I want to get the key after a particular timestamp and
> > > > >>>> sometimes I want to get the key before, so just relying on the key
> > > > order
> > > > >>>> doesn't work. Ideally, I want a reverse get(). I thought reverse
> > > scan
> > > > can
> > > > >>>> do the trick though.
> > > > >>>>
> > > > >>>> Khaled
> > > > >>>>
> > > > >>>> ----------------------------------------
> > > > >>>>> Date: Thu, 2 Oct 2014 09:40:37 -0700
> > > > >>>>> From: larsh@apache.org
> > > > >>>>> Subject: Re: HBase read performance
> > > > >>>>> To: user@hbase.apache.org
> > > > >>>>>
> > > > >>>>> Hi Khaled,
> > > > >>>>> is it the same with fewer threads? 1500 handler threads seems to
> > > be a
> > > > >>>> lot. Typically a good number of threads depends on the hardware
> > > > (number of
> > > > >>>> cores, number of spindles, etc). I cannot think of any type of
> > > > scenario
> > > > >>>> where more than 100 would give any improvement.
> > > > >>>>>
> > > > >>>>> How large is the payload per KV retrieved that way? If large (as
> > > in a
> > > > >>>> few 100k) you definitely want to lower the number of the handler
> > > > threads.
> > > > >>>>> How much heap do you give the region server? Does the working set
> > > fit
> > > > >>>> into the cache? (i.e. in the metrics, do you see the eviction count
> > > > going
> > > > >>>> up, if so it does not fit into the cache).
> > > > >>>>>
> > > > >>>>> If the working set does not fit into the cache (eviction count goes
> > > > up)
> > > > >>>> then HBase will need to bring a new block in from disk on each Get
> > > > >>>> (assuming the Gets are more or less random as far as the server is
> > > > >>>> concerned).
> > > > >>>>> In case you'll benefit from reducing the HFile block size (from 64k
> > > > to
> > > > >>>> 8k or even 4k).
> > > > >>>>>
> > > > >>>>> Lastly I don't think we tested the performance of using reverse
> > > scan
> > > > >>>> this way, there is probably room to optimize this.
> > > > >>>>> Can you restructure your keys to allow forwards scanning? For
> > > example
> > > > >>>> you could store the time as MAX_LONG-time. Or you could invert all
> > > > the bits
> > > > >>>> of the time portion of the key, so that it sort the other way. Then
> > > > you
> > > > >>>> could do a forward scan.
> > > > >>>>>
> > > > >>>>> Let us know how it goes.
> > > > >>>>>
> > > > >>>>> -- Lars
> > > > >>>>>
> > > > >>>>>
> > > > >>>>> ----- Original Message -----
> > > > >>>>> From: Khaled Elmeleegy <kd...@hotmail.com>
> > > > >>>>> To: "user@hbase.apache.org" <us...@hbase.apache.org>
> > > > >>>>> Cc:
> > > > >>>>> Sent: Thursday, October 2, 2014 12:12 AM
> > > > >>>>> Subject: HBase read performance
> > > > >>>>>
> > > > >>>>> Hi,
> > > > >>>>>
> > > > >>>>> I am trying to do a scatter/gather on hbase (0.98.6.1), where I
> > > have
> > > > a
> > > > >>>> client reading ~1000 keys from an HBase table. These keys happen to
> > > > fall on
> > > > >>>> the same region server. For my reads I use reverse scan to read each
> > > > key as
> > > > >>>> I want the key prior to a specific time stamp (time stamps are
> > > stored
> > > > in
> > > > >>>> reverse order). I don't believe gets can accomplish that, right? so
> > > I
> > > > use
> > > > >>>> scan, with caching set to 1.
> > > > >>>>>
> > > > >>>>> I use 2000 reader threads in the client and on HBase, I've set
> > > > >>>> hbase.regionserver.handler.count to 1500. With this setup, my
> > > scatter
> > > > >>>> gather is very slow and can take up to 10s in total. Timing an
> > > > individual
> > > > >>>> getScanner(..) call on the client side, it can easily take few
> > > > hundreds of
> > > > >>>> ms. I also got the following metrics from the region server in
> > > > question:
> > > > >>>>>
> > > > >>>>> "queueCallTime_mean" : 2.190855525775637,
> > > > >>>>> "queueCallTime_median" : 0.0,
> > > > >>>>> "queueCallTime_75th_percentile" : 0.0,
> > > > >>>>> "queueCallTime_95th_percentile" : 1.0,
> > > > >>>>> "queueCallTime_99th_percentile" : 556.9799999999818,
> > > > >>>>>
> > > > >>>>> "processCallTime_min" : 0,
> > > > >>>>> "processCallTime_max" : 12755,
> > > > >>>>> "processCallTime_mean" : 105.64873440912682,
> > > > >>>>> "processCallTime_median" : 0.0,
> > > > >>>>> "processCallTime_75th_percentile" : 2.0,
> > > > >>>>> "processCallTime_95th_percentile" : 7917.95,
> > > > >>>>> "processCallTime_99th_percentile" : 8876.89,
> > > > >>>>>
> > > > >>>>>
> > > > >>>>
> > > >
> > > "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_min"
> > > > >>>> : 89,
> > > > >>>>>
> > > > >>>>
> > > >
> > > "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_max"
> > > > >>>> : 11300,
> > > > >>>>>
> > > > >>>>
> > > >
> > > "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_mean"
> > > > >>>> : 654.4949739797315,
> > > > >>>>>
> > > > >>>>
> > > >
> > > "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_median"
> > > > >>>> : 101.0,
> > > > >>>>>
> > > > >>>>
> > > >
> > > "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_75th_percentile"
> > > > >>>> : 101.0,
> > > > >>>>>
> > > > >>>>
> > > >
> > > "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_95th_percentile"
> > > > >>>> : 101.0,
> > > > >>>>>
> > > > >>>>
> > > >
> > > "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_99th_percentile"
> > > > >>>> : 113.0,
> > > > >>>>>
> > > > >>>>> Where "delta" is the name of the table I am querying.
> > > > >>>>>
> > > > >>>>> In addition to all this, i monitored the hardware resources (CPU,
> > > > disk,
> > > > >>>> and network) of both the client and the region server and nothing
> > > > seems
> > > > >>>> anywhere near saturation. So I am puzzled by what's going on and
> > > > where this
> > > > >>>> time is going.
> > > > >>>>>
> > > > >>>>> Few things to note based on the above measurements: both medians of
> > > > IPC
> > > > >>>> processCallTime and queueCallTime are basically zero (ms I presume,
> > > > >>>> right?). However, scanNext_median is 101 (ms too, right?). I am not
> > > > sure
> > > > >>>> how this adds up. Also, even though the 101 figure seems
> > > outrageously
> > > > high
> > > > >>>> and I don't know why, still all these scans should be happening in
> > > > >>>> parallel, so the overall call should finish fast, given that no
> > > > hardware
> > > > >>>> resource is contended, right? but this is not what's happening, so I
> > > > have
> > > > >>>> to be missing something(s).
> > > > >>>>>
> > > > >>>>> So, any help is appreciated there.
> > > > >>>>>
> > > > >>>>> Thanks,
> > > > >>>>> Khaled
> > > > >>>>
> > > > >>>>
> > > >
> > >
>

RE: HBase read performance

Posted by Khaled Elmeleegy <kd...@hotmail.com>.

Lars, Ted, and Qiang,
Thanks for all the input. 
Qiang: yes all the threads are in the same client process sharing the same connection. And since I don't see hardware contention, may be there is contention over this code path. I'll try using many connections and see if it alleviates the problems and I'll report back.
Thanks again,Khaled

> Date: Fri, 3 Oct 2014 15:18:30 +0800
> Subject: Re: HBase read performance
> From: tianq01@gmail.com
> To: user@hbase.apache.org
> 
> Regarding to profiling, Andrew introduced
> http://www.brendangregg.com/blog/2014-06-12/java-flame-graphs.html months
> ago.
> 
> processCallTime comes from RpcServer#call, so it looks good?
> 
> I have a suspect: https://issues.apache.org/jira/browse/HBASE-11306
> 
> how many processes do you have for your 2000 threads?
> if olny 1 process, those threads will share just 1 connection to that
> regionserver, there might be big contention on the RPC code path. ---for
> such case, could you try using different connections?
> https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HConnectionManager.html
> 
> 
> 
> 
> On Fri, Oct 3, 2014 at 9:55 AM, Ted Yu <yu...@gmail.com> wrote:
> 
> > Khaled:
> > Do you have profiler such as jprofiler ?
> > Profiling would give us more hint.
> >
> > Otherwise capturing stack trace during the period of reverse scan would
> > help.
> >
> > Cheers
> >
> > On Thu, Oct 2, 2014 at 4:52 PM, lars hofhansl <la...@apache.org> wrote:
> >
> > > You might have the data in the OS buffer cache, without short circuit
> > > reading the region server has to request the block from the data node
> > > process, which then reads it from the block cache.
> > > That is a few context switches per RPC that do not show up in CPU
> > metrics.
> > > In that you also would not see disk IO.
> > >
> > >
> > > If - as you say - you see a lot of evicted blocks the data *has* to come
> > > from the OS. If you do not see disk IO is *has* to come from the OS
> > cache.
> > > I.e. there's more RAM on your boxes, and you should increase the heap
> > block
> > > cache.
> > >
> > >
> > > You can measure the context switches with vmstat. Other than that I have
> > > no suggestion until I reproduce the problem.
> > > Also check the data locality index of the region server it should be
> > close
> > > to 100%.
> > >
> > >
> > > -- Lars
> > >
> > >
> > >
> > > ________________________________
> > >  From: Khaled Elmeleegy <kd...@hotmail.com>
> > > To: "user@hbase.apache.org" <us...@hbase.apache.org>
> > > Sent: Thursday, October 2, 2014 3:24 PM
> > > Subject: RE: HBase read performance
> > >
> > >
> > > Lars thanks a lot about all the tips. I'll make sure I cover all of them
> > > and get back to you. I am not sure they are the bottleneck though as they
> > > all are about optimizing physical resource usage. As I said, I don't see
> > > any contended physical resources now. I'll also try to reproduce this
> > > problem in a simpler environment and pass to you the test program to play
> > > with.
> > >
> > >
> > > Couple of high level points to make. You are right that my use case is
> > > kind of a worst case for HBase reads. But, if things go the way you
> > > described them, there should be tons of disk IO and that should be
> > clearly
> > > the bottleneck. This is not the case though. That's for the simple reason
> > > that this is done in a test environment (I am still prototyping), and
> > not a
> > > lot of data is yet written to HBase. However for the real use case, there
> > > should writers constantly writing data to HBase and readers occasionally
> > > doing this scatter/gather. At steady state, things should only get worse
> > > and all the issues you mentioned should get far more pronounced. At this
> > > point, one can try to mitigate it using more memory or so. I am not there
> > > yet as I think I am hitting some software bottleneck, which I don't know
> > > how to work around.
> > >
> > > Khaled
> > >
> > >
> > >
> > >
> > >
> > >
> > > ----------------------------------------
> > > > Date: Thu, 2 Oct 2014 14:20:47 -0700
> > > > From: larsh@apache.org
> > > > Subject: Re: HBase read performance
> > > > To: user@hbase.apache.org
> > > >
> > > > OK... We might need to investigate this.
> > > > Any chance that you can provide a minimal test program and instruction
> > > about how to set it up.
> > > > We can do some profiling then.
> > > >
> > > > One thing to note is that with scanning HBase cannot use bloom filters
> > > to rule out HFiles ahead of time, it needs to look into all of them.
> > > > So you kind of hit on the absolute worst case:
> > > > - random reads that do not fit into the block cache
> > > > - cannot use bloom filters
> > > >
> > > >
> > > > Few more question/comments:
> > > > - Do you have short circuit reading enabled? If not, you should.
> > > > - Is your table major compacted? That will reduce the number of files
> > to
> > > look at.
> > > > - Did you disable Nagle's everywhere (enabled tcpnodelay)? It disabled
> > > by default in HBase, but necessarily in your install of HDFS.
> > > > - Which version of HDFS are you using as backing filesystem?
> > > > - If your disk is idle, it means the data fits into the OS buffer
> > cache.
> > > In turn that means that you increase the heap for the region servers. You
> > > can also use block encoding (FAST_DIFF) to try to make sure the entire
> > > working set fits into the cache.
> > > >
> > > > - Also try to reduce the block size - although if your overall working
> > > set does not fit in the heap it won't help much.
> > > >
> > > >
> > > > This is a good section of the book to read through generally (even
> > > though you might know most of this already):
> > > http://hbase.apache.org/book.html#perf.configurations
> > > >
> > > >
> > > > -- Lars
> > > >
> > > >
> > > >
> > > > ----- Original Message -----
> > > > From: Khaled Elmeleegy <kd...@hotmail.com>
> > > > To: "user@hbase.apache.org" <us...@hbase.apache.org>
> > > > Cc:
> > > > Sent: Thursday, October 2, 2014 11:27 AM
> > > > Subject: RE: HBase read performance
> > > >
> > > > I do see a very brief spike in CPU (user/system), but it's no where
> > near
> > > 0% idle. It goes from 99% idle down to something like 40% idle for a
> > second
> > > or so. The thing to note, this is all on a test cluster, so no real load.
> > > Things are generally idle until i issue 2-3 of these multi-scan-requests
> > to
> > > render a web page. Then, you see the spike in the cpu and some activity
> > in
> > > the network and disk, but nowhere near saturation.
> > > >
> > > >
> > > > If there are specific tests you'd like me to do to debug this, I'd be
> > > more than happy to do it.
> > > >
> > > >
> > > > Khaled
> > > >
> > > >
> > > >
> > > >
> > > > ----------------------------------------
> > > >> Date: Thu, 2 Oct 2014 11:15:59 -0700
> > > >> From: larsh@apache.org
> > > >> Subject: Re: HBase read performance
> > > >> To: user@hbase.apache.org
> > > >>
> > > >> I still think you're waiting on disk. No IOWAIT? So CPU is not waiting
> > > a lot for IO. No high User/System CPU either?
> > > >>
> > > >> If you see a lot of evicted block then each RPC has a high chance of
> > > requiring to bring an entire 64k block in. You'll see bad performance
> > with
> > > this.
> > > >>
> > > >> We might need to trace this specific scenario.
> > > >>
> > > >> -- Lars
> > > >>
> > > >>
> > > >>
> > > >> ________________________________
> > > >> From: Khaled Elmeleegy <kd...@hotmail.com>
> > > >> To: "user@hbase.apache.org" <us...@hbase.apache.org>
> > > >> Sent: Thursday, October 2, 2014 10:46 AM
> > > >> Subject: RE: HBase read performance
> > > >>
> > > >>
> > > >> I've set the heap size to 6GB and I do have gc logging. No long pauses
> > > there -- occasional 0.1s or 0.2s.
> > > >>
> > > >> Other than the discrepancy between what's reported on the client and
> > > what's reported at the RS, there is also the issue of not getting proper
> > > concurrency. So, even if a reverse get takes 100ms or so (this has to be
> > > mostly blocking on various things as no physical resource is contended),
> > > then the other gets/scans should be able to proceed in parallel, so a
> > > thousand concurrent gets/scans should finish in few hundreds of ms not
> > many
> > > seconds. That's why I thought I'd increase the handlers count to try to
> > get
> > > more concurrency, but it didn't help. So, there must be something else.
> > > >>
> > > >> Khaled
> > > >>
> > > >> ----------------------------------------
> > > >>> From: ndimiduk@gmail.com
> > > >>> Date: Thu, 2 Oct 2014 10:36:39 -0700
> > > >>> Subject: Re: HBase read performance
> > > >>> To: user@hbase.apache.org
> > > >>>
> > > >>> Do check again on the heap size of the region servers. The default
> > > >>> unconfigured size is 1G; too small for much of anything. Check your
> > RS
> > > logs
> > > >>> -- look for lines produced by the JVMPauseMonitor thread. They
> > usually
> > > >>> correlate with long GC pauses or other process-freeze events.
> > > >>>
> > > >>> Get is implemented as a Scan of a single row, so a reverse scan of a
> > > single
> > > >>> row should be functionally equivalent.
> > > >>>
> > > >>> In practice, I have seen discrepancy between the latencies reported
> > by
> > > the
> > > >>> RS and the latencies experienced by the client. I've not investigated
> > > this
> > > >>> area thoroughly.
> > > >>>
> > > >>> On Thu, Oct 2, 2014 at 10:05 AM, Khaled Elmeleegy <kdiaa@hotmail.com
> > >
> > > wrote:
> > > >>>
> > > >>>> Thanks Lars for your quick reply.
> > > >>>>
> > > >>>> Yes performance is similar with less handlers (I tried with 100
> > > first).
> > > >>>>
> > > >>>> The payload is not big ~1KB or so. The working set doesn't seem to
> > > fit in
> > > >>>> memory as there are many cache misses. However, disk is far from
> > > being a
> > > >>>> bottleneck. I checked using iostat. I also verified that neither the
> > > >>>> network nor the CPU of the region server or the client are a
> > > bottleneck.
> > > >>>> This leads me to believe that likely this is a software bottleneck,
> > > >>>> possibly due to a misconfiguration on my side. I just don't know how
> > > to
> > > >>>> debug it. A clear disconnect I see is the individual request latency
> > > as
> > > >>>> reported by metrics on the region server (IPC processCallTime vs
> > > scanNext)
> > > >>>> vs what's measured on the client. Does this sound right? Any ideas
> > on
> > > how
> > > >>>> to better debug it?
> > > >>>>
> > > >>>> About this trick with the timestamps to be able to do a forward
> > scan,
> > > >>>> thanks for pointing it out. Actually, I am aware of it. The problem
> > I
> > > have
> > > >>>> is, sometimes I want to get the key after a particular timestamp and
> > > >>>> sometimes I want to get the key before, so just relying on the key
> > > order
> > > >>>> doesn't work. Ideally, I want a reverse get(). I thought reverse
> > scan
> > > can
> > > >>>> do the trick though.
> > > >>>>
> > > >>>> Khaled
> > > >>>>
> > > >>>> ----------------------------------------
> > > >>>>> Date: Thu, 2 Oct 2014 09:40:37 -0700
> > > >>>>> From: larsh@apache.org
> > > >>>>> Subject: Re: HBase read performance
> > > >>>>> To: user@hbase.apache.org
> > > >>>>>
> > > >>>>> Hi Khaled,
> > > >>>>> is it the same with fewer threads? 1500 handler threads seems to
> > be a
> > > >>>> lot. Typically a good number of threads depends on the hardware
> > > (number of
> > > >>>> cores, number of spindles, etc). I cannot think of any type of
> > > scenario
> > > >>>> where more than 100 would give any improvement.
> > > >>>>>
> > > >>>>> How large is the payload per KV retrieved that way? If large (as
> > in a
> > > >>>> few 100k) you definitely want to lower the number of the handler
> > > threads.
> > > >>>>> How much heap do you give the region server? Does the working set
> > fit
> > > >>>> into the cache? (i.e. in the metrics, do you see the eviction count
> > > going
> > > >>>> up, if so it does not fit into the cache).
> > > >>>>>
> > > >>>>> If the working set does not fit into the cache (eviction count goes
> > > up)
> > > >>>> then HBase will need to bring a new block in from disk on each Get
> > > >>>> (assuming the Gets are more or less random as far as the server is
> > > >>>> concerned).
> > > >>>>> In case you'll benefit from reducing the HFile block size (from 64k
> > > to
> > > >>>> 8k or even 4k).
> > > >>>>>
> > > >>>>> Lastly I don't think we tested the performance of using reverse
> > scan
> > > >>>> this way, there is probably room to optimize this.
> > > >>>>> Can you restructure your keys to allow forwards scanning? For
> > example
> > > >>>> you could store the time as MAX_LONG-time. Or you could invert all
> > > the bits
> > > >>>> of the time portion of the key, so that it sort the other way. Then
> > > you
> > > >>>> could do a forward scan.
> > > >>>>>
> > > >>>>> Let us know how it goes.
> > > >>>>>
> > > >>>>> -- Lars
> > > >>>>>
> > > >>>>>
> > > >>>>> ----- Original Message -----
> > > >>>>> From: Khaled Elmeleegy <kd...@hotmail.com>
> > > >>>>> To: "user@hbase.apache.org" <us...@hbase.apache.org>
> > > >>>>> Cc:
> > > >>>>> Sent: Thursday, October 2, 2014 12:12 AM
> > > >>>>> Subject: HBase read performance
> > > >>>>>
> > > >>>>> Hi,
> > > >>>>>
> > > >>>>> I am trying to do a scatter/gather on hbase (0.98.6.1), where I
> > have
> > > a
> > > >>>> client reading ~1000 keys from an HBase table. These keys happen to
> > > fall on
> > > >>>> the same region server. For my reads I use reverse scan to read each
> > > key as
> > > >>>> I want the key prior to a specific time stamp (time stamps are
> > stored
> > > in
> > > >>>> reverse order). I don't believe gets can accomplish that, right? so
> > I
> > > use
> > > >>>> scan, with caching set to 1.
> > > >>>>>
> > > >>>>> I use 2000 reader threads in the client and on HBase, I've set
> > > >>>> hbase.regionserver.handler.count to 1500. With this setup, my
> > scatter
> > > >>>> gather is very slow and can take up to 10s in total. Timing an
> > > individual
> > > >>>> getScanner(..) call on the client side, it can easily take few
> > > hundreds of
> > > >>>> ms. I also got the following metrics from the region server in
> > > question:
> > > >>>>>
> > > >>>>> "queueCallTime_mean" : 2.190855525775637,
> > > >>>>> "queueCallTime_median" : 0.0,
> > > >>>>> "queueCallTime_75th_percentile" : 0.0,
> > > >>>>> "queueCallTime_95th_percentile" : 1.0,
> > > >>>>> "queueCallTime_99th_percentile" : 556.9799999999818,
> > > >>>>>
> > > >>>>> "processCallTime_min" : 0,
> > > >>>>> "processCallTime_max" : 12755,
> > > >>>>> "processCallTime_mean" : 105.64873440912682,
> > > >>>>> "processCallTime_median" : 0.0,
> > > >>>>> "processCallTime_75th_percentile" : 2.0,
> > > >>>>> "processCallTime_95th_percentile" : 7917.95,
> > > >>>>> "processCallTime_99th_percentile" : 8876.89,
> > > >>>>>
> > > >>>>>
> > > >>>>
> > >
> > "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_min"
> > > >>>> : 89,
> > > >>>>>
> > > >>>>
> > >
> > "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_max"
> > > >>>> : 11300,
> > > >>>>>
> > > >>>>
> > >
> > "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_mean"
> > > >>>> : 654.4949739797315,
> > > >>>>>
> > > >>>>
> > >
> > "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_median"
> > > >>>> : 101.0,
> > > >>>>>
> > > >>>>
> > >
> > "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_75th_percentile"
> > > >>>> : 101.0,
> > > >>>>>
> > > >>>>
> > >
> > "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_95th_percentile"
> > > >>>> : 101.0,
> > > >>>>>
> > > >>>>
> > >
> > "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_99th_percentile"
> > > >>>> : 113.0,
> > > >>>>>
> > > >>>>> Where "delta" is the name of the table I am querying.
> > > >>>>>
> > > >>>>> In addition to all this, i monitored the hardware resources (CPU,
> > > disk,
> > > >>>> and network) of both the client and the region server and nothing
> > > seems
> > > >>>> anywhere near saturation. So I am puzzled by what's going on and
> > > where this
> > > >>>> time is going.
> > > >>>>>
> > > >>>>> Few things to note based on the above measurements: both medians of
> > > IPC
> > > >>>> processCallTime and queueCallTime are basically zero (ms I presume,
> > > >>>> right?). However, scanNext_median is 101 (ms too, right?). I am not
> > > sure
> > > >>>> how this adds up. Also, even though the 101 figure seems
> > outrageously
> > > high
> > > >>>> and I don't know why, still all these scans should be happening in
> > > >>>> parallel, so the overall call should finish fast, given that no
> > > hardware
> > > >>>> resource is contended, right? but this is not what's happening, so I
> > > have
> > > >>>> to be missing something(s).
> > > >>>>>
> > > >>>>> So, any help is appreciated there.
> > > >>>>>
> > > >>>>> Thanks,
> > > >>>>> Khaled
> > > >>>>
> > > >>>>
> > >
> >

Re: HBase read performance

Posted by Qiang Tian <ti...@gmail.com>.

Regarding to profiling, Andrew introduced
http://www.brendangregg.com/blog/2014-06-12/java-flame-graphs.html months
ago.

processCallTime comes from RpcServer#call, so it looks good?

I have a suspect: https://issues.apache.org/jira/browse/HBASE-11306

how many processes do you have for your 2000 threads?
if olny 1 process, those threads will share just 1 connection to that
regionserver, there might be big contention on the RPC code path. ---for
such case, could you try using different connections?
https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HConnectionManager.html




On Fri, Oct 3, 2014 at 9:55 AM, Ted Yu <yu...@gmail.com> wrote:

> Khaled:
> Do you have profiler such as jprofiler ?
> Profiling would give us more hint.
>
> Otherwise capturing stack trace during the period of reverse scan would
> help.
>
> Cheers
>
> On Thu, Oct 2, 2014 at 4:52 PM, lars hofhansl <la...@apache.org> wrote:
>
> > You might have the data in the OS buffer cache, without short circuit
> > reading the region server has to request the block from the data node
> > process, which then reads it from the block cache.
> > That is a few context switches per RPC that do not show up in CPU
> metrics.
> > In that you also would not see disk IO.
> >
> >
> > If - as you say - you see a lot of evicted blocks the data *has* to come
> > from the OS. If you do not see disk IO is *has* to come from the OS
> cache.
> > I.e. there's more RAM on your boxes, and you should increase the heap
> block
> > cache.
> >
> >
> > You can measure the context switches with vmstat. Other than that I have
> > no suggestion until I reproduce the problem.
> > Also check the data locality index of the region server it should be
> close
> > to 100%.
> >
> >
> > -- Lars
> >
> >
> >
> > ________________________________
> >  From: Khaled Elmeleegy <kd...@hotmail.com>
> > To: "user@hbase.apache.org" <us...@hbase.apache.org>
> > Sent: Thursday, October 2, 2014 3:24 PM
> > Subject: RE: HBase read performance
> >
> >
> > Lars thanks a lot about all the tips. I'll make sure I cover all of them
> > and get back to you. I am not sure they are the bottleneck though as they
> > all are about optimizing physical resource usage. As I said, I don't see
> > any contended physical resources now. I'll also try to reproduce this
> > problem in a simpler environment and pass to you the test program to play
> > with.
> >
> >
> > Couple of high level points to make. You are right that my use case is
> > kind of a worst case for HBase reads. But, if things go the way you
> > described them, there should be tons of disk IO and that should be
> clearly
> > the bottleneck. This is not the case though. That's for the simple reason
> > that this is done in a test environment (I am still prototyping), and
> not a
> > lot of data is yet written to HBase. However for the real use case, there
> > should writers constantly writing data to HBase and readers occasionally
> > doing this scatter/gather. At steady state, things should only get worse
> > and all the issues you mentioned should get far more pronounced. At this
> > point, one can try to mitigate it using more memory or so. I am not there
> > yet as I think I am hitting some software bottleneck, which I don't know
> > how to work around.
> >
> > Khaled
> >
> >
> >
> >
> >
> >
> > ----------------------------------------
> > > Date: Thu, 2 Oct 2014 14:20:47 -0700
> > > From: larsh@apache.org
> > > Subject: Re: HBase read performance
> > > To: user@hbase.apache.org
> > >
> > > OK... We might need to investigate this.
> > > Any chance that you can provide a minimal test program and instruction
> > about how to set it up.
> > > We can do some profiling then.
> > >
> > > One thing to note is that with scanning HBase cannot use bloom filters
> > to rule out HFiles ahead of time, it needs to look into all of them.
> > > So you kind of hit on the absolute worst case:
> > > - random reads that do not fit into the block cache
> > > - cannot use bloom filters
> > >
> > >
> > > Few more question/comments:
> > > - Do you have short circuit reading enabled? If not, you should.
> > > - Is your table major compacted? That will reduce the number of files
> to
> > look at.
> > > - Did you disable Nagle's everywhere (enabled tcpnodelay)? It disabled
> > by default in HBase, but necessarily in your install of HDFS.
> > > - Which version of HDFS are you using as backing filesystem?
> > > - If your disk is idle, it means the data fits into the OS buffer
> cache.
> > In turn that means that you increase the heap for the region servers. You
> > can also use block encoding (FAST_DIFF) to try to make sure the entire
> > working set fits into the cache.
> > >
> > > - Also try to reduce the block size - although if your overall working
> > set does not fit in the heap it won't help much.
> > >
> > >
> > > This is a good section of the book to read through generally (even
> > though you might know most of this already):
> > http://hbase.apache.org/book.html#perf.configurations
> > >
> > >
> > > -- Lars
> > >
> > >
> > >
> > > ----- Original Message -----
> > > From: Khaled Elmeleegy <kd...@hotmail.com>
> > > To: "user@hbase.apache.org" <us...@hbase.apache.org>
> > > Cc:
> > > Sent: Thursday, October 2, 2014 11:27 AM
> > > Subject: RE: HBase read performance
> > >
> > > I do see a very brief spike in CPU (user/system), but it's no where
> near
> > 0% idle. It goes from 99% idle down to something like 40% idle for a
> second
> > or so. The thing to note, this is all on a test cluster, so no real load.
> > Things are generally idle until i issue 2-3 of these multi-scan-requests
> to
> > render a web page. Then, you see the spike in the cpu and some activity
> in
> > the network and disk, but nowhere near saturation.
> > >
> > >
> > > If there are specific tests you'd like me to do to debug this, I'd be
> > more than happy to do it.
> > >
> > >
> > > Khaled
> > >
> > >
> > >
> > >
> > > ----------------------------------------
> > >> Date: Thu, 2 Oct 2014 11:15:59 -0700
> > >> From: larsh@apache.org
> > >> Subject: Re: HBase read performance
> > >> To: user@hbase.apache.org
> > >>
> > >> I still think you're waiting on disk. No IOWAIT? So CPU is not waiting
> > a lot for IO. No high User/System CPU either?
> > >>
> > >> If you see a lot of evicted block then each RPC has a high chance of
> > requiring to bring an entire 64k block in. You'll see bad performance
> with
> > this.
> > >>
> > >> We might need to trace this specific scenario.
> > >>
> > >> -- Lars
> > >>
> > >>
> > >>
> > >> ________________________________
> > >> From: Khaled Elmeleegy <kd...@hotmail.com>
> > >> To: "user@hbase.apache.org" <us...@hbase.apache.org>
> > >> Sent: Thursday, October 2, 2014 10:46 AM
> > >> Subject: RE: HBase read performance
> > >>
> > >>
> > >> I've set the heap size to 6GB and I do have gc logging. No long pauses
> > there -- occasional 0.1s or 0.2s.
> > >>
> > >> Other than the discrepancy between what's reported on the client and
> > what's reported at the RS, there is also the issue of not getting proper
> > concurrency. So, even if a reverse get takes 100ms or so (this has to be
> > mostly blocking on various things as no physical resource is contended),
> > then the other gets/scans should be able to proceed in parallel, so a
> > thousand concurrent gets/scans should finish in few hundreds of ms not
> many
> > seconds. That's why I thought I'd increase the handlers count to try to
> get
> > more concurrency, but it didn't help. So, there must be something else.
> > >>
> > >> Khaled
> > >>
> > >> ----------------------------------------
> > >>> From: ndimiduk@gmail.com
> > >>> Date: Thu, 2 Oct 2014 10:36:39 -0700
> > >>> Subject: Re: HBase read performance
> > >>> To: user@hbase.apache.org
> > >>>
> > >>> Do check again on the heap size of the region servers. The default
> > >>> unconfigured size is 1G; too small for much of anything. Check your
> RS
> > logs
> > >>> -- look for lines produced by the JVMPauseMonitor thread. They
> usually
> > >>> correlate with long GC pauses or other process-freeze events.
> > >>>
> > >>> Get is implemented as a Scan of a single row, so a reverse scan of a
> > single
> > >>> row should be functionally equivalent.
> > >>>
> > >>> In practice, I have seen discrepancy between the latencies reported
> by
> > the
> > >>> RS and the latencies experienced by the client. I've not investigated
> > this
> > >>> area thoroughly.
> > >>>
> > >>> On Thu, Oct 2, 2014 at 10:05 AM, Khaled Elmeleegy <kdiaa@hotmail.com
> >
> > wrote:
> > >>>
> > >>>> Thanks Lars for your quick reply.
> > >>>>
> > >>>> Yes performance is similar with less handlers (I tried with 100
> > first).
> > >>>>
> > >>>> The payload is not big ~1KB or so. The working set doesn't seem to
> > fit in
> > >>>> memory as there are many cache misses. However, disk is far from
> > being a
> > >>>> bottleneck. I checked using iostat. I also verified that neither the
> > >>>> network nor the CPU of the region server or the client are a
> > bottleneck.
> > >>>> This leads me to believe that likely this is a software bottleneck,
> > >>>> possibly due to a misconfiguration on my side. I just don't know how
> > to
> > >>>> debug it. A clear disconnect I see is the individual request latency
> > as
> > >>>> reported by metrics on the region server (IPC processCallTime vs
> > scanNext)
> > >>>> vs what's measured on the client. Does this sound right? Any ideas
> on
> > how
> > >>>> to better debug it?
> > >>>>
> > >>>> About this trick with the timestamps to be able to do a forward
> scan,
> > >>>> thanks for pointing it out. Actually, I am aware of it. The problem
> I
> > have
> > >>>> is, sometimes I want to get the key after a particular timestamp and
> > >>>> sometimes I want to get the key before, so just relying on the key
> > order
> > >>>> doesn't work. Ideally, I want a reverse get(). I thought reverse
> scan
> > can
> > >>>> do the trick though.
> > >>>>
> > >>>> Khaled
> > >>>>
> > >>>> ----------------------------------------
> > >>>>> Date: Thu, 2 Oct 2014 09:40:37 -0700
> > >>>>> From: larsh@apache.org
> > >>>>> Subject: Re: HBase read performance
> > >>>>> To: user@hbase.apache.org
> > >>>>>
> > >>>>> Hi Khaled,
> > >>>>> is it the same with fewer threads? 1500 handler threads seems to
> be a
> > >>>> lot. Typically a good number of threads depends on the hardware
> > (number of
> > >>>> cores, number of spindles, etc). I cannot think of any type of
> > scenario
> > >>>> where more than 100 would give any improvement.
> > >>>>>
> > >>>>> How large is the payload per KV retrieved that way? If large (as
> in a
> > >>>> few 100k) you definitely want to lower the number of the handler
> > threads.
> > >>>>> How much heap do you give the region server? Does the working set
> fit
> > >>>> into the cache? (i.e. in the metrics, do you see the eviction count
> > going
> > >>>> up, if so it does not fit into the cache).
> > >>>>>
> > >>>>> If the working set does not fit into the cache (eviction count goes
> > up)
> > >>>> then HBase will need to bring a new block in from disk on each Get
> > >>>> (assuming the Gets are more or less random as far as the server is
> > >>>> concerned).
> > >>>>> In case you'll benefit from reducing the HFile block size (from 64k
> > to
> > >>>> 8k or even 4k).
> > >>>>>
> > >>>>> Lastly I don't think we tested the performance of using reverse
> scan
> > >>>> this way, there is probably room to optimize this.
> > >>>>> Can you restructure your keys to allow forwards scanning? For
> example
> > >>>> you could store the time as MAX_LONG-time. Or you could invert all
> > the bits
> > >>>> of the time portion of the key, so that it sort the other way. Then
> > you
> > >>>> could do a forward scan.
> > >>>>>
> > >>>>> Let us know how it goes.
> > >>>>>
> > >>>>> -- Lars
> > >>>>>
> > >>>>>
> > >>>>> ----- Original Message -----
> > >>>>> From: Khaled Elmeleegy <kd...@hotmail.com>
> > >>>>> To: "user@hbase.apache.org" <us...@hbase.apache.org>
> > >>>>> Cc:
> > >>>>> Sent: Thursday, October 2, 2014 12:12 AM
> > >>>>> Subject: HBase read performance
> > >>>>>
> > >>>>> Hi,
> > >>>>>
> > >>>>> I am trying to do a scatter/gather on hbase (0.98.6.1), where I
> have
> > a
> > >>>> client reading ~1000 keys from an HBase table. These keys happen to
> > fall on
> > >>>> the same region server. For my reads I use reverse scan to read each
> > key as
> > >>>> I want the key prior to a specific time stamp (time stamps are
> stored
> > in
> > >>>> reverse order). I don't believe gets can accomplish that, right? so
> I
> > use
> > >>>> scan, with caching set to 1.
> > >>>>>
> > >>>>> I use 2000 reader threads in the client and on HBase, I've set
> > >>>> hbase.regionserver.handler.count to 1500. With this setup, my
> scatter
> > >>>> gather is very slow and can take up to 10s in total. Timing an
> > individual
> > >>>> getScanner(..) call on the client side, it can easily take few
> > hundreds of
> > >>>> ms. I also got the following metrics from the region server in
> > question:
> > >>>>>
> > >>>>> "queueCallTime_mean" : 2.190855525775637,
> > >>>>> "queueCallTime_median" : 0.0,
> > >>>>> "queueCallTime_75th_percentile" : 0.0,
> > >>>>> "queueCallTime_95th_percentile" : 1.0,
> > >>>>> "queueCallTime_99th_percentile" : 556.9799999999818,
> > >>>>>
> > >>>>> "processCallTime_min" : 0,
> > >>>>> "processCallTime_max" : 12755,
> > >>>>> "processCallTime_mean" : 105.64873440912682,
> > >>>>> "processCallTime_median" : 0.0,
> > >>>>> "processCallTime_75th_percentile" : 2.0,
> > >>>>> "processCallTime_95th_percentile" : 7917.95,
> > >>>>> "processCallTime_99th_percentile" : 8876.89,
> > >>>>>
> > >>>>>
> > >>>>
> >
> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_min"
> > >>>> : 89,
> > >>>>>
> > >>>>
> >
> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_max"
> > >>>> : 11300,
> > >>>>>
> > >>>>
> >
> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_mean"
> > >>>> : 654.4949739797315,
> > >>>>>
> > >>>>
> >
> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_median"
> > >>>> : 101.0,
> > >>>>>
> > >>>>
> >
> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_75th_percentile"
> > >>>> : 101.0,
> > >>>>>
> > >>>>
> >
> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_95th_percentile"
> > >>>> : 101.0,
> > >>>>>
> > >>>>
> >
> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_99th_percentile"
> > >>>> : 113.0,
> > >>>>>
> > >>>>> Where "delta" is the name of the table I am querying.
> > >>>>>
> > >>>>> In addition to all this, i monitored the hardware resources (CPU,
> > disk,
> > >>>> and network) of both the client and the region server and nothing
> > seems
> > >>>> anywhere near saturation. So I am puzzled by what's going on and
> > where this
> > >>>> time is going.
> > >>>>>
> > >>>>> Few things to note based on the above measurements: both medians of
> > IPC
> > >>>> processCallTime and queueCallTime are basically zero (ms I presume,
> > >>>> right?). However, scanNext_median is 101 (ms too, right?). I am not
> > sure
> > >>>> how this adds up. Also, even though the 101 figure seems
> outrageously
> > high
> > >>>> and I don't know why, still all these scans should be happening in
> > >>>> parallel, so the overall call should finish fast, given that no
> > hardware
> > >>>> resource is contended, right? but this is not what's happening, so I
> > have
> > >>>> to be missing something(s).
> > >>>>>
> > >>>>> So, any help is appreciated there.
> > >>>>>
> > >>>>> Thanks,
> > >>>>> Khaled
> > >>>>
> > >>>>
> >
>

Re: HBase read performance

Posted by Ted Yu <yu...@gmail.com>.

Khaled:
Do you have profiler such as jprofiler ?
Profiling would give us more hint.

Otherwise capturing stack trace during the period of reverse scan would
help.

Cheers

On Thu, Oct 2, 2014 at 4:52 PM, lars hofhansl <la...@apache.org> wrote:

> You might have the data in the OS buffer cache, without short circuit
> reading the region server has to request the block from the data node
> process, which then reads it from the block cache.
> That is a few context switches per RPC that do not show up in CPU metrics.
> In that you also would not see disk IO.
>
>
> If - as you say - you see a lot of evicted blocks the data *has* to come
> from the OS. If you do not see disk IO is *has* to come from the OS cache.
> I.e. there's more RAM on your boxes, and you should increase the heap block
> cache.
>
>
> You can measure the context switches with vmstat. Other than that I have
> no suggestion until I reproduce the problem.
> Also check the data locality index of the region server it should be close
> to 100%.
>
>
> -- Lars
>
>
>
> ________________________________
>  From: Khaled Elmeleegy <kd...@hotmail.com>
> To: "user@hbase.apache.org" <us...@hbase.apache.org>
> Sent: Thursday, October 2, 2014 3:24 PM
> Subject: RE: HBase read performance
>
>
> Lars thanks a lot about all the tips. I'll make sure I cover all of them
> and get back to you. I am not sure they are the bottleneck though as they
> all are about optimizing physical resource usage. As I said, I don't see
> any contended physical resources now. I'll also try to reproduce this
> problem in a simpler environment and pass to you the test program to play
> with.
>
>
> Couple of high level points to make. You are right that my use case is
> kind of a worst case for HBase reads. But, if things go the way you
> described them, there should be tons of disk IO and that should be clearly
> the bottleneck. This is not the case though. That's for the simple reason
> that this is done in a test environment (I am still prototyping), and not a
> lot of data is yet written to HBase. However for the real use case, there
> should writers constantly writing data to HBase and readers occasionally
> doing this scatter/gather. At steady state, things should only get worse
> and all the issues you mentioned should get far more pronounced. At this
> point, one can try to mitigate it using more memory or so. I am not there
> yet as I think I am hitting some software bottleneck, which I don't know
> how to work around.
>
> Khaled
>
>
>
>
>
>
> ----------------------------------------
> > Date: Thu, 2 Oct 2014 14:20:47 -0700
> > From: larsh@apache.org
> > Subject: Re: HBase read performance
> > To: user@hbase.apache.org
> >
> > OK... We might need to investigate this.
> > Any chance that you can provide a minimal test program and instruction
> about how to set it up.
> > We can do some profiling then.
> >
> > One thing to note is that with scanning HBase cannot use bloom filters
> to rule out HFiles ahead of time, it needs to look into all of them.
> > So you kind of hit on the absolute worst case:
> > - random reads that do not fit into the block cache
> > - cannot use bloom filters
> >
> >
> > Few more question/comments:
> > - Do you have short circuit reading enabled? If not, you should.
> > - Is your table major compacted? That will reduce the number of files to
> look at.
> > - Did you disable Nagle's everywhere (enabled tcpnodelay)? It disabled
> by default in HBase, but necessarily in your install of HDFS.
> > - Which version of HDFS are you using as backing filesystem?
> > - If your disk is idle, it means the data fits into the OS buffer cache.
> In turn that means that you increase the heap for the region servers. You
> can also use block encoding (FAST_DIFF) to try to make sure the entire
> working set fits into the cache.
> >
> > - Also try to reduce the block size - although if your overall working
> set does not fit in the heap it won't help much.
> >
> >
> > This is a good section of the book to read through generally (even
> though you might know most of this already):
> http://hbase.apache.org/book.html#perf.configurations
> >
> >
> > -- Lars
> >
> >
> >
> > ----- Original Message -----
> > From: Khaled Elmeleegy <kd...@hotmail.com>
> > To: "user@hbase.apache.org" <us...@hbase.apache.org>
> > Cc:
> > Sent: Thursday, October 2, 2014 11:27 AM
> > Subject: RE: HBase read performance
> >
> > I do see a very brief spike in CPU (user/system), but it's no where near
> 0% idle. It goes from 99% idle down to something like 40% idle for a second
> or so. The thing to note, this is all on a test cluster, so no real load.
> Things are generally idle until i issue 2-3 of these multi-scan-requests to
> render a web page. Then, you see the spike in the cpu and some activity in
> the network and disk, but nowhere near saturation.
> >
> >
> > If there are specific tests you'd like me to do to debug this, I'd be
> more than happy to do it.
> >
> >
> > Khaled
> >
> >
> >
> >
> > ----------------------------------------
> >> Date: Thu, 2 Oct 2014 11:15:59 -0700
> >> From: larsh@apache.org
> >> Subject: Re: HBase read performance
> >> To: user@hbase.apache.org
> >>
> >> I still think you're waiting on disk. No IOWAIT? So CPU is not waiting
> a lot for IO. No high User/System CPU either?
> >>
> >> If you see a lot of evicted block then each RPC has a high chance of
> requiring to bring an entire 64k block in. You'll see bad performance with
> this.
> >>
> >> We might need to trace this specific scenario.
> >>
> >> -- Lars
> >>
> >>
> >>
> >> ________________________________
> >> From: Khaled Elmeleegy <kd...@hotmail.com>
> >> To: "user@hbase.apache.org" <us...@hbase.apache.org>
> >> Sent: Thursday, October 2, 2014 10:46 AM
> >> Subject: RE: HBase read performance
> >>
> >>
> >> I've set the heap size to 6GB and I do have gc logging. No long pauses
> there -- occasional 0.1s or 0.2s.
> >>
> >> Other than the discrepancy between what's reported on the client and
> what's reported at the RS, there is also the issue of not getting proper
> concurrency. So, even if a reverse get takes 100ms or so (this has to be
> mostly blocking on various things as no physical resource is contended),
> then the other gets/scans should be able to proceed in parallel, so a
> thousand concurrent gets/scans should finish in few hundreds of ms not many
> seconds. That's why I thought I'd increase the handlers count to try to get
> more concurrency, but it didn't help. So, there must be something else.
> >>
> >> Khaled
> >>
> >> ----------------------------------------
> >>> From: ndimiduk@gmail.com
> >>> Date: Thu, 2 Oct 2014 10:36:39 -0700
> >>> Subject: Re: HBase read performance
> >>> To: user@hbase.apache.org
> >>>
> >>> Do check again on the heap size of the region servers. The default
> >>> unconfigured size is 1G; too small for much of anything. Check your RS
> logs
> >>> -- look for lines produced by the JVMPauseMonitor thread. They usually
> >>> correlate with long GC pauses or other process-freeze events.
> >>>
> >>> Get is implemented as a Scan of a single row, so a reverse scan of a
> single
> >>> row should be functionally equivalent.
> >>>
> >>> In practice, I have seen discrepancy between the latencies reported by
> the
> >>> RS and the latencies experienced by the client. I've not investigated
> this
> >>> area thoroughly.
> >>>
> >>> On Thu, Oct 2, 2014 at 10:05 AM, Khaled Elmeleegy <kd...@hotmail.com>
> wrote:
> >>>
> >>>> Thanks Lars for your quick reply.
> >>>>
> >>>> Yes performance is similar with less handlers (I tried with 100
> first).
> >>>>
> >>>> The payload is not big ~1KB or so. The working set doesn't seem to
> fit in
> >>>> memory as there are many cache misses. However, disk is far from
> being a
> >>>> bottleneck. I checked using iostat. I also verified that neither the
> >>>> network nor the CPU of the region server or the client are a
> bottleneck.
> >>>> This leads me to believe that likely this is a software bottleneck,
> >>>> possibly due to a misconfiguration on my side. I just don't know how
> to
> >>>> debug it. A clear disconnect I see is the individual request latency
> as
> >>>> reported by metrics on the region server (IPC processCallTime vs
> scanNext)
> >>>> vs what's measured on the client. Does this sound right? Any ideas on
> how
> >>>> to better debug it?
> >>>>
> >>>> About this trick with the timestamps to be able to do a forward scan,
> >>>> thanks for pointing it out. Actually, I am aware of it. The problem I
> have
> >>>> is, sometimes I want to get the key after a particular timestamp and
> >>>> sometimes I want to get the key before, so just relying on the key
> order
> >>>> doesn't work. Ideally, I want a reverse get(). I thought reverse scan
> can
> >>>> do the trick though.
> >>>>
> >>>> Khaled
> >>>>
> >>>> ----------------------------------------
> >>>>> Date: Thu, 2 Oct 2014 09:40:37 -0700
> >>>>> From: larsh@apache.org
> >>>>> Subject: Re: HBase read performance
> >>>>> To: user@hbase.apache.org
> >>>>>
> >>>>> Hi Khaled,
> >>>>> is it the same with fewer threads? 1500 handler threads seems to be a
> >>>> lot. Typically a good number of threads depends on the hardware
> (number of
> >>>> cores, number of spindles, etc). I cannot think of any type of
> scenario
> >>>> where more than 100 would give any improvement.
> >>>>>
> >>>>> How large is the payload per KV retrieved that way? If large (as in a
> >>>> few 100k) you definitely want to lower the number of the handler
> threads.
> >>>>> How much heap do you give the region server? Does the working set fit
> >>>> into the cache? (i.e. in the metrics, do you see the eviction count
> going
> >>>> up, if so it does not fit into the cache).
> >>>>>
> >>>>> If the working set does not fit into the cache (eviction count goes
> up)
> >>>> then HBase will need to bring a new block in from disk on each Get
> >>>> (assuming the Gets are more or less random as far as the server is
> >>>> concerned).
> >>>>> In case you'll benefit from reducing the HFile block size (from 64k
> to
> >>>> 8k or even 4k).
> >>>>>
> >>>>> Lastly I don't think we tested the performance of using reverse scan
> >>>> this way, there is probably room to optimize this.
> >>>>> Can you restructure your keys to allow forwards scanning? For example
> >>>> you could store the time as MAX_LONG-time. Or you could invert all
> the bits
> >>>> of the time portion of the key, so that it sort the other way. Then
> you
> >>>> could do a forward scan.
> >>>>>
> >>>>> Let us know how it goes.
> >>>>>
> >>>>> -- Lars
> >>>>>
> >>>>>
> >>>>> ----- Original Message -----
> >>>>> From: Khaled Elmeleegy <kd...@hotmail.com>
> >>>>> To: "user@hbase.apache.org" <us...@hbase.apache.org>
> >>>>> Cc:
> >>>>> Sent: Thursday, October 2, 2014 12:12 AM
> >>>>> Subject: HBase read performance
> >>>>>
> >>>>> Hi,
> >>>>>
> >>>>> I am trying to do a scatter/gather on hbase (0.98.6.1), where I have
> a
> >>>> client reading ~1000 keys from an HBase table. These keys happen to
> fall on
> >>>> the same region server. For my reads I use reverse scan to read each
> key as
> >>>> I want the key prior to a specific time stamp (time stamps are stored
> in
> >>>> reverse order). I don't believe gets can accomplish that, right? so I
> use
> >>>> scan, with caching set to 1.
> >>>>>
> >>>>> I use 2000 reader threads in the client and on HBase, I've set
> >>>> hbase.regionserver.handler.count to 1500. With this setup, my scatter
> >>>> gather is very slow and can take up to 10s in total. Timing an
> individual
> >>>> getScanner(..) call on the client side, it can easily take few
> hundreds of
> >>>> ms. I also got the following metrics from the region server in
> question:
> >>>>>
> >>>>> "queueCallTime_mean" : 2.190855525775637,
> >>>>> "queueCallTime_median" : 0.0,
> >>>>> "queueCallTime_75th_percentile" : 0.0,
> >>>>> "queueCallTime_95th_percentile" : 1.0,
> >>>>> "queueCallTime_99th_percentile" : 556.9799999999818,
> >>>>>
> >>>>> "processCallTime_min" : 0,
> >>>>> "processCallTime_max" : 12755,
> >>>>> "processCallTime_mean" : 105.64873440912682,
> >>>>> "processCallTime_median" : 0.0,
> >>>>> "processCallTime_75th_percentile" : 2.0,
> >>>>> "processCallTime_95th_percentile" : 7917.95,
> >>>>> "processCallTime_99th_percentile" : 8876.89,
> >>>>>
> >>>>>
> >>>>
> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_min"
> >>>> : 89,
> >>>>>
> >>>>
> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_max"
> >>>> : 11300,
> >>>>>
> >>>>
> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_mean"
> >>>> : 654.4949739797315,
> >>>>>
> >>>>
> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_median"
> >>>> : 101.0,
> >>>>>
> >>>>
> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_75th_percentile"
> >>>> : 101.0,
> >>>>>
> >>>>
> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_95th_percentile"
> >>>> : 101.0,
> >>>>>
> >>>>
> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_99th_percentile"
> >>>> : 113.0,
> >>>>>
> >>>>> Where "delta" is the name of the table I am querying.
> >>>>>
> >>>>> In addition to all this, i monitored the hardware resources (CPU,
> disk,
> >>>> and network) of both the client and the region server and nothing
> seems
> >>>> anywhere near saturation. So I am puzzled by what's going on and
> where this
> >>>> time is going.
> >>>>>
> >>>>> Few things to note based on the above measurements: both medians of
> IPC
> >>>> processCallTime and queueCallTime are basically zero (ms I presume,
> >>>> right?). However, scanNext_median is 101 (ms too, right?). I am not
> sure
> >>>> how this adds up. Also, even though the 101 figure seems outrageously
> high
> >>>> and I don't know why, still all these scans should be happening in
> >>>> parallel, so the overall call should finish fast, given that no
> hardware
> >>>> resource is contended, right? but this is not what's happening, so I
> have
> >>>> to be missing something(s).
> >>>>>
> >>>>> So, any help is appreciated there.
> >>>>>
> >>>>> Thanks,
> >>>>> Khaled
> >>>>
> >>>>
>

Re: HBase read performance

Posted by lars hofhansl <la...@apache.org>.

You might have the data in the OS buffer cache, without short circuit reading the region server has to request the block from the data node process, which then reads it from the block cache.
That is a few context switches per RPC that do not show up in CPU metrics. In that you also would not see disk IO.


If - as you say - you see a lot of evicted blocks the data *has* to come from the OS. If you do not see disk IO is *has* to come from the OS cache. I.e. there's more RAM on your boxes, and you should increase the heap block cache.


You can measure the context switches with vmstat. Other than that I have no suggestion until I reproduce the problem.
Also check the data locality index of the region server it should be close to 100%.


-- Lars



________________________________
 From: Khaled Elmeleegy <kd...@hotmail.com>
To: "user@hbase.apache.org" <us...@hbase.apache.org> 
Sent: Thursday, October 2, 2014 3:24 PM
Subject: RE: HBase read performance
 

Lars thanks a lot about all the tips. I'll make sure I cover all of them and get back to you. I am not sure they are the bottleneck though as they all are about optimizing physical resource usage. As I said, I don't see any contended physical resources now. I'll also try to reproduce this problem in a simpler environment and pass to you the test program to play with.


Couple of high level points to make. You are right that my use case is kind of a worst case for HBase reads. But, if things go the way you described them, there should be tons of disk IO and that should be clearly the bottleneck. This is not the case though. That's for the simple reason that this is done in a test environment (I am still prototyping), and not a lot of data is yet written to HBase. However for the real use case, there should writers constantly writing data to HBase and readers occasionally doing this scatter/gather. At steady state, things should only get worse and all the issues you mentioned should get far more pronounced. At this point, one can try to mitigate it using more memory or so. I am not there yet as I think I am hitting some software bottleneck, which I don't know how to work around.

Khaled






----------------------------------------
> Date: Thu, 2 Oct 2014 14:20:47 -0700
> From: larsh@apache.org
> Subject: Re: HBase read performance
> To: user@hbase.apache.org
>
> OK... We might need to investigate this.
> Any chance that you can provide a minimal test program and instruction about how to set it up.
> We can do some profiling then.
>
> One thing to note is that with scanning HBase cannot use bloom filters to rule out HFiles ahead of time, it needs to look into all of them.
> So you kind of hit on the absolute worst case:
> - random reads that do not fit into the block cache
> - cannot use bloom filters
>
>
> Few more question/comments:
> - Do you have short circuit reading enabled? If not, you should.
> - Is your table major compacted? That will reduce the number of files to look at.
> - Did you disable Nagle's everywhere (enabled tcpnodelay)? It disabled by default in HBase, but necessarily in your install of HDFS.
> - Which version of HDFS are you using as backing filesystem?
> - If your disk is idle, it means the data fits into the OS buffer cache. In turn that means that you increase the heap for the region servers. You can also use block encoding (FAST_DIFF) to try to make sure the entire working set fits into the cache.
>
> - Also try to reduce the block size - although if your overall working set does not fit in the heap it won't help much.
>
>
> This is a good section of the book to read through generally (even though you might know most of this already): http://hbase.apache.org/book.html#perf.configurations
>
>
> -- Lars
>
>
>
> ----- Original Message -----
> From: Khaled Elmeleegy <kd...@hotmail.com>
> To: "user@hbase.apache.org" <us...@hbase.apache.org>
> Cc:
> Sent: Thursday, October 2, 2014 11:27 AM
> Subject: RE: HBase read performance
>
> I do see a very brief spike in CPU (user/system), but it's no where near 0% idle. It goes from 99% idle down to something like 40% idle for a second or so. The thing to note, this is all on a test cluster, so no real load. Things are generally idle until i issue 2-3 of these multi-scan-requests to render a web page. Then, you see the spike in the cpu and some activity in the network and disk, but nowhere near saturation.
>
>
> If there are specific tests you'd like me to do to debug this, I'd be more than happy to do it.
>
>
> Khaled
>
>
>
>
> ----------------------------------------
>> Date: Thu, 2 Oct 2014 11:15:59 -0700
>> From: larsh@apache.org
>> Subject: Re: HBase read performance
>> To: user@hbase.apache.org
>>
>> I still think you're waiting on disk. No IOWAIT? So CPU is not waiting a lot for IO. No high User/System CPU either?
>>
>> If you see a lot of evicted block then each RPC has a high chance of requiring to bring an entire 64k block in. You'll see bad performance with this.
>>
>> We might need to trace this specific scenario.
>>
>> -- Lars
>>
>>
>>
>> ________________________________
>> From: Khaled Elmeleegy <kd...@hotmail.com>
>> To: "user@hbase.apache.org" <us...@hbase.apache.org>
>> Sent: Thursday, October 2, 2014 10:46 AM
>> Subject: RE: HBase read performance
>>
>>
>> I've set the heap size to 6GB and I do have gc logging. No long pauses there -- occasional 0.1s or 0.2s.
>>
>> Other than the discrepancy between what's reported on the client and what's reported at the RS, there is also the issue of not getting proper concurrency. So, even if a reverse get takes 100ms or so (this has to be mostly blocking on various things as no physical resource is contended), then the other gets/scans should be able to proceed in parallel, so a thousand concurrent gets/scans should finish in few hundreds of ms not many seconds. That's why I thought I'd increase the handlers count to try to get more concurrency, but it didn't help. So, there must be something else.
>>
>> Khaled
>>
>> ----------------------------------------
>>> From: ndimiduk@gmail.com
>>> Date: Thu, 2 Oct 2014 10:36:39 -0700
>>> Subject: Re: HBase read performance
>>> To: user@hbase.apache.org
>>>
>>> Do check again on the heap size of the region servers. The default
>>> unconfigured size is 1G; too small for much of anything. Check your RS logs
>>> -- look for lines produced by the JVMPauseMonitor thread. They usually
>>> correlate with long GC pauses or other process-freeze events.
>>>
>>> Get is implemented as a Scan of a single row, so a reverse scan of a single
>>> row should be functionally equivalent.
>>>
>>> In practice, I have seen discrepancy between the latencies reported by the
>>> RS and the latencies experienced by the client. I've not investigated this
>>> area thoroughly.
>>>
>>> On Thu, Oct 2, 2014 at 10:05 AM, Khaled Elmeleegy <kd...@hotmail.com> wrote:
>>>
>>>> Thanks Lars for your quick reply.
>>>>
>>>> Yes performance is similar with less handlers (I tried with 100 first).
>>>>
>>>> The payload is not big ~1KB or so. The working set doesn't seem to fit in
>>>> memory as there are many cache misses. However, disk is far from being a
>>>> bottleneck. I checked using iostat. I also verified that neither the
>>>> network nor the CPU of the region server or the client are a bottleneck.
>>>> This leads me to believe that likely this is a software bottleneck,
>>>> possibly due to a misconfiguration on my side. I just don't know how to
>>>> debug it. A clear disconnect I see is the individual request latency as
>>>> reported by metrics on the region server (IPC processCallTime vs scanNext)
>>>> vs what's measured on the client. Does this sound right? Any ideas on how
>>>> to better debug it?
>>>>
>>>> About this trick with the timestamps to be able to do a forward scan,
>>>> thanks for pointing it out. Actually, I am aware of it. The problem I have
>>>> is, sometimes I want to get the key after a particular timestamp and
>>>> sometimes I want to get the key before, so just relying on the key order
>>>> doesn't work. Ideally, I want a reverse get(). I thought reverse scan can
>>>> do the trick though.
>>>>
>>>> Khaled
>>>>
>>>> ----------------------------------------
>>>>> Date: Thu, 2 Oct 2014 09:40:37 -0700
>>>>> From: larsh@apache.org
>>>>> Subject: Re: HBase read performance
>>>>> To: user@hbase.apache.org
>>>>>
>>>>> Hi Khaled,
>>>>> is it the same with fewer threads? 1500 handler threads seems to be a
>>>> lot. Typically a good number of threads depends on the hardware (number of
>>>> cores, number of spindles, etc). I cannot think of any type of scenario
>>>> where more than 100 would give any improvement.
>>>>>
>>>>> How large is the payload per KV retrieved that way? If large (as in a
>>>> few 100k) you definitely want to lower the number of the handler threads.
>>>>> How much heap do you give the region server? Does the working set fit
>>>> into the cache? (i.e. in the metrics, do you see the eviction count going
>>>> up, if so it does not fit into the cache).
>>>>>
>>>>> If the working set does not fit into the cache (eviction count goes up)
>>>> then HBase will need to bring a new block in from disk on each Get
>>>> (assuming the Gets are more or less random as far as the server is
>>>> concerned).
>>>>> In case you'll benefit from reducing the HFile block size (from 64k to
>>>> 8k or even 4k).
>>>>>
>>>>> Lastly I don't think we tested the performance of using reverse scan
>>>> this way, there is probably room to optimize this.
>>>>> Can you restructure your keys to allow forwards scanning? For example
>>>> you could store the time as MAX_LONG-time. Or you could invert all the bits
>>>> of the time portion of the key, so that it sort the other way. Then you
>>>> could do a forward scan.
>>>>>
>>>>> Let us know how it goes.
>>>>>
>>>>> -- Lars
>>>>>
>>>>>
>>>>> ----- Original Message -----
>>>>> From: Khaled Elmeleegy <kd...@hotmail.com>
>>>>> To: "user@hbase.apache.org" <us...@hbase.apache.org>
>>>>> Cc:
>>>>> Sent: Thursday, October 2, 2014 12:12 AM
>>>>> Subject: HBase read performance
>>>>>
>>>>> Hi,
>>>>>
>>>>> I am trying to do a scatter/gather on hbase (0.98.6.1), where I have a
>>>> client reading ~1000 keys from an HBase table. These keys happen to fall on
>>>> the same region server. For my reads I use reverse scan to read each key as
>>>> I want the key prior to a specific time stamp (time stamps are stored in
>>>> reverse order). I don't believe gets can accomplish that, right? so I use
>>>> scan, with caching set to 1.
>>>>>
>>>>> I use 2000 reader threads in the client and on HBase, I've set
>>>> hbase.regionserver.handler.count to 1500. With this setup, my scatter
>>>> gather is very slow and can take up to 10s in total. Timing an individual
>>>> getScanner(..) call on the client side, it can easily take few hundreds of
>>>> ms. I also got the following metrics from the region server in question:
>>>>>
>>>>> "queueCallTime_mean" : 2.190855525775637,
>>>>> "queueCallTime_median" : 0.0,
>>>>> "queueCallTime_75th_percentile" : 0.0,
>>>>> "queueCallTime_95th_percentile" : 1.0,
>>>>> "queueCallTime_99th_percentile" : 556.9799999999818,
>>>>>
>>>>> "processCallTime_min" : 0,
>>>>> "processCallTime_max" : 12755,
>>>>> "processCallTime_mean" : 105.64873440912682,
>>>>> "processCallTime_median" : 0.0,
>>>>> "processCallTime_75th_percentile" : 2.0,
>>>>> "processCallTime_95th_percentile" : 7917.95,
>>>>> "processCallTime_99th_percentile" : 8876.89,
>>>>>
>>>>>
>>>> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_min"
>>>> : 89,
>>>>>
>>>> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_max"
>>>> : 11300,
>>>>>
>>>> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_mean"
>>>> : 654.4949739797315,
>>>>>
>>>> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_median"
>>>> : 101.0,
>>>>>
>>>> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_75th_percentile"
>>>> : 101.0,
>>>>>
>>>> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_95th_percentile"
>>>> : 101.0,
>>>>>
>>>> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_99th_percentile"
>>>> : 113.0,
>>>>>
>>>>> Where "delta" is the name of the table I am querying.
>>>>>
>>>>> In addition to all this, i monitored the hardware resources (CPU, disk,
>>>> and network) of both the client and the region server and nothing seems
>>>> anywhere near saturation. So I am puzzled by what's going on and where this
>>>> time is going.
>>>>>
>>>>> Few things to note based on the above measurements: both medians of IPC
>>>> processCallTime and queueCallTime are basically zero (ms I presume,
>>>> right?). However, scanNext_median is 101 (ms too, right?). I am not sure
>>>> how this adds up. Also, even though the 101 figure seems outrageously high
>>>> and I don't know why, still all these scans should be happening in
>>>> parallel, so the overall call should finish fast, given that no hardware
>>>> resource is contended, right? but this is not what's happening, so I have
>>>> to be missing something(s).
>>>>>
>>>>> So, any help is appreciated there.
>>>>>
>>>>> Thanks,
>>>>> Khaled
>>>>
>>>>

RE: HBase read performance

Posted by Khaled Elmeleegy <kd...@hotmail.com>.

Lars thanks a lot about all the tips. I'll make sure I cover all of them and get back to you. I am not sure they are the bottleneck though as they all are about optimizing physical resource usage. As I said, I don't see any contended physical resources now. I'll also try to reproduce this problem in a simpler environment and pass to you the test program to play with.


Couple of high level points to make. You are right that my use case is kind of a worst case for HBase reads. But, if things go the way you described them, there should be tons of disk IO and that should be clearly the bottleneck. This is not the case though. That's for the simple reason that this is done in a test environment (I am still prototyping), and not a lot of data is yet written to HBase. However for the real use case, there should writers constantly writing data to HBase and readers occasionally doing this scatter/gather. At steady state, things should only get worse and all the issues you mentioned should get far more pronounced. At this point, one can try to mitigate it using more memory or so. I am not there yet as I think I am hitting some software bottleneck, which I don't know how to work around.

Khaled



----------------------------------------
> Date: Thu, 2 Oct 2014 14:20:47 -0700
> From: larsh@apache.org
> Subject: Re: HBase read performance
> To: user@hbase.apache.org
>
> OK... We might need to investigate this.
> Any chance that you can provide a minimal test program and instruction about how to set it up.
> We can do some profiling then.
>
> One thing to note is that with scanning HBase cannot use bloom filters to rule out HFiles ahead of time, it needs to look into all of them.
> So you kind of hit on the absolute worst case:
> - random reads that do not fit into the block cache
> - cannot use bloom filters
>
>
> Few more question/comments:
> - Do you have short circuit reading enabled? If not, you should.
> - Is your table major compacted? That will reduce the number of files to look at.
> - Did you disable Nagle's everywhere (enabled tcpnodelay)? It disabled by default in HBase, but necessarily in your install of HDFS.
> - Which version of HDFS are you using as backing filesystem?
> - If your disk is idle, it means the data fits into the OS buffer cache. In turn that means that you increase the heap for the region servers. You can also use block encoding (FAST_DIFF) to try to make sure the entire working set fits into the cache.
>
> - Also try to reduce the block size - although if your overall working set does not fit in the heap it won't help much.
>
>
> This is a good section of the book to read through generally (even though you might know most of this already): http://hbase.apache.org/book.html#perf.configurations
>
>
> -- Lars
>
>
>
> ----- Original Message -----
> From: Khaled Elmeleegy <kd...@hotmail.com>
> To: "user@hbase.apache.org" <us...@hbase.apache.org>
> Cc:
> Sent: Thursday, October 2, 2014 11:27 AM
> Subject: RE: HBase read performance
>
> I do see a very brief spike in CPU (user/system), but it's no where near 0% idle. It goes from 99% idle down to something like 40% idle for a second or so. The thing to note, this is all on a test cluster, so no real load. Things are generally idle until i issue 2-3 of these multi-scan-requests to render a web page. Then, you see the spike in the cpu and some activity in the network and disk, but nowhere near saturation.
>
>
> If there are specific tests you'd like me to do to debug this, I'd be more than happy to do it.
>
>
> Khaled
>
>
>
>
> ----------------------------------------
>> Date: Thu, 2 Oct 2014 11:15:59 -0700
>> From: larsh@apache.org
>> Subject: Re: HBase read performance
>> To: user@hbase.apache.org
>>
>> I still think you're waiting on disk. No IOWAIT? So CPU is not waiting a lot for IO. No high User/System CPU either?
>>
>> If you see a lot of evicted block then each RPC has a high chance of requiring to bring an entire 64k block in. You'll see bad performance with this.
>>
>> We might need to trace this specific scenario.
>>
>> -- Lars
>>
>>
>>
>> ________________________________
>> From: Khaled Elmeleegy <kd...@hotmail.com>
>> To: "user@hbase.apache.org" <us...@hbase.apache.org>
>> Sent: Thursday, October 2, 2014 10:46 AM
>> Subject: RE: HBase read performance
>>
>>
>> I've set the heap size to 6GB and I do have gc logging. No long pauses there -- occasional 0.1s or 0.2s.
>>
>> Other than the discrepancy between what's reported on the client and what's reported at the RS, there is also the issue of not getting proper concurrency. So, even if a reverse get takes 100ms or so (this has to be mostly blocking on various things as no physical resource is contended), then the other gets/scans should be able to proceed in parallel, so a thousand concurrent gets/scans should finish in few hundreds of ms not many seconds. That's why I thought I'd increase the handlers count to try to get more concurrency, but it didn't help. So, there must be something else.
>>
>> Khaled
>>
>> ----------------------------------------
>>> From: ndimiduk@gmail.com
>>> Date: Thu, 2 Oct 2014 10:36:39 -0700
>>> Subject: Re: HBase read performance
>>> To: user@hbase.apache.org
>>>
>>> Do check again on the heap size of the region servers. The default
>>> unconfigured size is 1G; too small for much of anything. Check your RS logs
>>> -- look for lines produced by the JVMPauseMonitor thread. They usually
>>> correlate with long GC pauses or other process-freeze events.
>>>
>>> Get is implemented as a Scan of a single row, so a reverse scan of a single
>>> row should be functionally equivalent.
>>>
>>> In practice, I have seen discrepancy between the latencies reported by the
>>> RS and the latencies experienced by the client. I've not investigated this
>>> area thoroughly.
>>>
>>> On Thu, Oct 2, 2014 at 10:05 AM, Khaled Elmeleegy <kd...@hotmail.com> wrote:
>>>
>>>> Thanks Lars for your quick reply.
>>>>
>>>> Yes performance is similar with less handlers (I tried with 100 first).
>>>>
>>>> The payload is not big ~1KB or so. The working set doesn't seem to fit in
>>>> memory as there are many cache misses. However, disk is far from being a
>>>> bottleneck. I checked using iostat. I also verified that neither the
>>>> network nor the CPU of the region server or the client are a bottleneck.
>>>> This leads me to believe that likely this is a software bottleneck,
>>>> possibly due to a misconfiguration on my side. I just don't know how to
>>>> debug it. A clear disconnect I see is the individual request latency as
>>>> reported by metrics on the region server (IPC processCallTime vs scanNext)
>>>> vs what's measured on the client. Does this sound right? Any ideas on how
>>>> to better debug it?
>>>>
>>>> About this trick with the timestamps to be able to do a forward scan,
>>>> thanks for pointing it out. Actually, I am aware of it. The problem I have
>>>> is, sometimes I want to get the key after a particular timestamp and
>>>> sometimes I want to get the key before, so just relying on the key order
>>>> doesn't work. Ideally, I want a reverse get(). I thought reverse scan can
>>>> do the trick though.
>>>>
>>>> Khaled
>>>>
>>>> ----------------------------------------
>>>>> Date: Thu, 2 Oct 2014 09:40:37 -0700
>>>>> From: larsh@apache.org
>>>>> Subject: Re: HBase read performance
>>>>> To: user@hbase.apache.org
>>>>>
>>>>> Hi Khaled,
>>>>> is it the same with fewer threads? 1500 handler threads seems to be a
>>>> lot. Typically a good number of threads depends on the hardware (number of
>>>> cores, number of spindles, etc). I cannot think of any type of scenario
>>>> where more than 100 would give any improvement.
>>>>>
>>>>> How large is the payload per KV retrieved that way? If large (as in a
>>>> few 100k) you definitely want to lower the number of the handler threads.
>>>>> How much heap do you give the region server? Does the working set fit
>>>> into the cache? (i.e. in the metrics, do you see the eviction count going
>>>> up, if so it does not fit into the cache).
>>>>>
>>>>> If the working set does not fit into the cache (eviction count goes up)
>>>> then HBase will need to bring a new block in from disk on each Get
>>>> (assuming the Gets are more or less random as far as the server is
>>>> concerned).
>>>>> In case you'll benefit from reducing the HFile block size (from 64k to
>>>> 8k or even 4k).
>>>>>
>>>>> Lastly I don't think we tested the performance of using reverse scan
>>>> this way, there is probably room to optimize this.
>>>>> Can you restructure your keys to allow forwards scanning? For example
>>>> you could store the time as MAX_LONG-time. Or you could invert all the bits
>>>> of the time portion of the key, so that it sort the other way. Then you
>>>> could do a forward scan.
>>>>>
>>>>> Let us know how it goes.
>>>>>
>>>>> -- Lars
>>>>>
>>>>>
>>>>> ----- Original Message -----
>>>>> From: Khaled Elmeleegy <kd...@hotmail.com>
>>>>> To: "user@hbase.apache.org" <us...@hbase.apache.org>
>>>>> Cc:
>>>>> Sent: Thursday, October 2, 2014 12:12 AM
>>>>> Subject: HBase read performance
>>>>>
>>>>> Hi,
>>>>>
>>>>> I am trying to do a scatter/gather on hbase (0.98.6.1), where I have a
>>>> client reading ~1000 keys from an HBase table. These keys happen to fall on
>>>> the same region server. For my reads I use reverse scan to read each key as
>>>> I want the key prior to a specific time stamp (time stamps are stored in
>>>> reverse order). I don't believe gets can accomplish that, right? so I use
>>>> scan, with caching set to 1.
>>>>>
>>>>> I use 2000 reader threads in the client and on HBase, I've set
>>>> hbase.regionserver.handler.count to 1500. With this setup, my scatter
>>>> gather is very slow and can take up to 10s in total. Timing an individual
>>>> getScanner(..) call on the client side, it can easily take few hundreds of
>>>> ms. I also got the following metrics from the region server in question:
>>>>>
>>>>> "queueCallTime_mean" : 2.190855525775637,
>>>>> "queueCallTime_median" : 0.0,
>>>>> "queueCallTime_75th_percentile" : 0.0,
>>>>> "queueCallTime_95th_percentile" : 1.0,
>>>>> "queueCallTime_99th_percentile" : 556.9799999999818,
>>>>>
>>>>> "processCallTime_min" : 0,
>>>>> "processCallTime_max" : 12755,
>>>>> "processCallTime_mean" : 105.64873440912682,
>>>>> "processCallTime_median" : 0.0,
>>>>> "processCallTime_75th_percentile" : 2.0,
>>>>> "processCallTime_95th_percentile" : 7917.95,
>>>>> "processCallTime_99th_percentile" : 8876.89,
>>>>>
>>>>>
>>>> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_min"
>>>> : 89,
>>>>>
>>>> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_max"
>>>> : 11300,
>>>>>
>>>> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_mean"
>>>> : 654.4949739797315,
>>>>>
>>>> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_median"
>>>> : 101.0,
>>>>>
>>>> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_75th_percentile"
>>>> : 101.0,
>>>>>
>>>> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_95th_percentile"
>>>> : 101.0,
>>>>>
>>>> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_99th_percentile"
>>>> : 113.0,
>>>>>
>>>>> Where "delta" is the name of the table I am querying.
>>>>>
>>>>> In addition to all this, i monitored the hardware resources (CPU, disk,
>>>> and network) of both the client and the region server and nothing seems
>>>> anywhere near saturation. So I am puzzled by what's going on and where this
>>>> time is going.
>>>>>
>>>>> Few things to note based on the above measurements: both medians of IPC
>>>> processCallTime and queueCallTime are basically zero (ms I presume,
>>>> right?). However, scanNext_median is 101 (ms too, right?). I am not sure
>>>> how this adds up. Also, even though the 101 figure seems outrageously high
>>>> and I don't know why, still all these scans should be happening in
>>>> parallel, so the overall call should finish fast, given that no hardware
>>>> resource is contended, right? but this is not what's happening, so I have
>>>> to be missing something(s).
>>>>>
>>>>> So, any help is appreciated there.
>>>>>
>>>>> Thanks,
>>>>> Khaled
>>>>
>>>>

Re: HBase read performance

Posted by lars hofhansl <la...@apache.org>.

OK... We might need to investigate this.
Any chance that you can provide a minimal test program and instruction about how to set it up.
We can do some profiling then.

One thing to note is that with scanning HBase cannot use bloom filters to rule out HFiles ahead of time, it needs to look into all of them.
So you kind of hit on the absolute worst case:
- random reads that do not fit into the block cache
- cannot use bloom filters


Few more question/comments:
- Do you have short circuit reading enabled? If not, you should.
- Is your table major compacted? That will reduce the number of files to look at.
- Did you disable Nagle's everywhere (enabled tcpnodelay)? It disabled by default in HBase, but necessarily in your install of HDFS.
- Which version of HDFS are you using as backing filesystem?
- If your disk is idle, it means the data fits into the OS buffer cache. In turn that means that you increase the heap for the region servers. You can also use block encoding (FAST_DIFF) to try to make sure the entire working set fits into the cache.

- Also try to reduce the block size - although if your overall working set does not fit in the heap it won't help much.


This is a good section of the book to read through generally (even though you might know most of this already): http://hbase.apache.org/book.html#perf.configurations


-- Lars



----- Original Message -----
From: Khaled Elmeleegy <kd...@hotmail.com>
To: "user@hbase.apache.org" <us...@hbase.apache.org>
Cc: 
Sent: Thursday, October 2, 2014 11:27 AM
Subject: RE: HBase read performance

I do see a very brief spike in CPU (user/system), but it's no where near 0% idle. It goes from 99% idle down to something like 40% idle for a second or so. The thing to note, this is all on a test cluster, so no real load. Things are generally idle until i issue 2-3 of these multi-scan-requests to render a web page. Then, you see the spike in the cpu and some activity in the network and disk, but nowhere near saturation.


If there are specific tests you'd like me to do to debug this, I'd be more than happy to do it.


Khaled




----------------------------------------
> Date: Thu, 2 Oct 2014 11:15:59 -0700
> From: larsh@apache.org
> Subject: Re: HBase read performance
> To: user@hbase.apache.org
>
> I still think you're waiting on disk. No IOWAIT? So CPU is not waiting a lot for IO. No high User/System CPU either?
>
> If you see a lot of evicted block then each RPC has a high chance of requiring to bring an entire 64k block in. You'll see bad performance with this.
>
> We might need to trace this specific scenario.
>
> -- Lars
>
>
>
> ________________________________
> From: Khaled Elmeleegy <kd...@hotmail.com>
> To: "user@hbase.apache.org" <us...@hbase.apache.org>
> Sent: Thursday, October 2, 2014 10:46 AM
> Subject: RE: HBase read performance
>
>
> I've set the heap size to 6GB and I do have gc logging. No long pauses there -- occasional 0.1s or 0.2s.
>
> Other than the discrepancy between what's reported on the client and what's reported at the RS, there is also the issue of not getting proper concurrency. So, even if a reverse get takes 100ms or so (this has to be mostly blocking on various things as no physical resource is contended), then the other gets/scans should be able to proceed in parallel, so a thousand concurrent gets/scans should finish in few hundreds of ms not many seconds. That's why I thought I'd increase the handlers count to try to get more concurrency, but it didn't help. So, there must be something else.
>
> Khaled
>
> ----------------------------------------
>> From: ndimiduk@gmail.com
>> Date: Thu, 2 Oct 2014 10:36:39 -0700
>> Subject: Re: HBase read performance
>> To: user@hbase.apache.org
>>
>> Do check again on the heap size of the region servers. The default
>> unconfigured size is 1G; too small for much of anything. Check your RS logs
>> -- look for lines produced by the JVMPauseMonitor thread. They usually
>> correlate with long GC pauses or other process-freeze events.
>>
>> Get is implemented as a Scan of a single row, so a reverse scan of a single
>> row should be functionally equivalent.
>>
>> In practice, I have seen discrepancy between the latencies reported by the
>> RS and the latencies experienced by the client. I've not investigated this
>> area thoroughly.
>>
>> On Thu, Oct 2, 2014 at 10:05 AM, Khaled Elmeleegy <kd...@hotmail.com> wrote:
>>
>>> Thanks Lars for your quick reply.
>>>
>>> Yes performance is similar with less handlers (I tried with 100 first).
>>>
>>> The payload is not big ~1KB or so. The working set doesn't seem to fit in
>>> memory as there are many cache misses. However, disk is far from being a
>>> bottleneck. I checked using iostat. I also verified that neither the
>>> network nor the CPU of the region server or the client are a bottleneck.
>>> This leads me to believe that likely this is a software bottleneck,
>>> possibly due to a misconfiguration on my side. I just don't know how to
>>> debug it. A clear disconnect I see is the individual request latency as
>>> reported by metrics on the region server (IPC processCallTime vs scanNext)
>>> vs what's measured on the client. Does this sound right? Any ideas on how
>>> to better debug it?
>>>
>>> About this trick with the timestamps to be able to do a forward scan,
>>> thanks for pointing it out. Actually, I am aware of it. The problem I have
>>> is, sometimes I want to get the key after a particular timestamp and
>>> sometimes I want to get the key before, so just relying on the key order
>>> doesn't work. Ideally, I want a reverse get(). I thought reverse scan can
>>> do the trick though.
>>>
>>> Khaled
>>>
>>> ----------------------------------------
>>>> Date: Thu, 2 Oct 2014 09:40:37 -0700
>>>> From: larsh@apache.org
>>>> Subject: Re: HBase read performance
>>>> To: user@hbase.apache.org
>>>>
>>>> Hi Khaled,
>>>> is it the same with fewer threads? 1500 handler threads seems to be a
>>> lot. Typically a good number of threads depends on the hardware (number of
>>> cores, number of spindles, etc). I cannot think of any type of scenario
>>> where more than 100 would give any improvement.
>>>>
>>>> How large is the payload per KV retrieved that way? If large (as in a
>>> few 100k) you definitely want to lower the number of the handler threads.
>>>> How much heap do you give the region server? Does the working set fit
>>> into the cache? (i.e. in the metrics, do you see the eviction count going
>>> up, if so it does not fit into the cache).
>>>>
>>>> If the working set does not fit into the cache (eviction count goes up)
>>> then HBase will need to bring a new block in from disk on each Get
>>> (assuming the Gets are more or less random as far as the server is
>>> concerned).
>>>> In case you'll benefit from reducing the HFile block size (from 64k to
>>> 8k or even 4k).
>>>>
>>>> Lastly I don't think we tested the performance of using reverse scan
>>> this way, there is probably room to optimize this.
>>>> Can you restructure your keys to allow forwards scanning? For example
>>> you could store the time as MAX_LONG-time. Or you could invert all the bits
>>> of the time portion of the key, so that it sort the other way. Then you
>>> could do a forward scan.
>>>>
>>>> Let us know how it goes.
>>>>
>>>> -- Lars
>>>>
>>>>
>>>> ----- Original Message -----
>>>> From: Khaled Elmeleegy <kd...@hotmail.com>
>>>> To: "user@hbase.apache.org" <us...@hbase.apache.org>
>>>> Cc:
>>>> Sent: Thursday, October 2, 2014 12:12 AM
>>>> Subject: HBase read performance
>>>>
>>>> Hi,
>>>>
>>>> I am trying to do a scatter/gather on hbase (0.98.6.1), where I have a
>>> client reading ~1000 keys from an HBase table. These keys happen to fall on
>>> the same region server. For my reads I use reverse scan to read each key as
>>> I want the key prior to a specific time stamp (time stamps are stored in
>>> reverse order). I don't believe gets can accomplish that, right? so I use
>>> scan, with caching set to 1.
>>>>
>>>> I use 2000 reader threads in the client and on HBase, I've set
>>> hbase.regionserver.handler.count to 1500. With this setup, my scatter
>>> gather is very slow and can take up to 10s in total. Timing an individual
>>> getScanner(..) call on the client side, it can easily take few hundreds of
>>> ms. I also got the following metrics from the region server in question:
>>>>
>>>> "queueCallTime_mean" : 2.190855525775637,
>>>> "queueCallTime_median" : 0.0,
>>>> "queueCallTime_75th_percentile" : 0.0,
>>>> "queueCallTime_95th_percentile" : 1.0,
>>>> "queueCallTime_99th_percentile" : 556.9799999999818,
>>>>
>>>> "processCallTime_min" : 0,
>>>> "processCallTime_max" : 12755,
>>>> "processCallTime_mean" : 105.64873440912682,
>>>> "processCallTime_median" : 0.0,
>>>> "processCallTime_75th_percentile" : 2.0,
>>>> "processCallTime_95th_percentile" : 7917.95,
>>>> "processCallTime_99th_percentile" : 8876.89,
>>>>
>>>>
>>> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_min"
>>> : 89,
>>>>
>>> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_max"
>>> : 11300,
>>>>
>>> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_mean"
>>> : 654.4949739797315,
>>>>
>>> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_median"
>>> : 101.0,
>>>>
>>> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_75th_percentile"
>>> : 101.0,
>>>>
>>> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_95th_percentile"
>>> : 101.0,
>>>>
>>> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_99th_percentile"
>>> : 113.0,
>>>>
>>>> Where "delta" is the name of the table I am querying.
>>>>
>>>> In addition to all this, i monitored the hardware resources (CPU, disk,
>>> and network) of both the client and the region server and nothing seems
>>> anywhere near saturation. So I am puzzled by what's going on and where this
>>> time is going.
>>>>
>>>> Few things to note based on the above measurements: both medians of IPC
>>> processCallTime and queueCallTime are basically zero (ms I presume,
>>> right?). However, scanNext_median is 101 (ms too, right?). I am not sure
>>> how this adds up. Also, even though the 101 figure seems outrageously high
>>> and I don't know why, still all these scans should be happening in
>>> parallel, so the overall call should finish fast, given that no hardware
>>> resource is contended, right? but this is not what's happening, so I have
>>> to be missing something(s).
>>>>
>>>> So, any help is appreciated there.
>>>>
>>>> Thanks,
>>>> Khaled
>>>
>>>

RE: HBase read performance

Posted by Khaled Elmeleegy <kd...@hotmail.com>.

I do see a very brief spike in CPU (user/system), but it's no where near 0% idle. It goes from 99% idle down to something like 40% idle for a second or so. The thing to note, this is all on a test cluster, so no real load. Things are generally idle until i issue 2-3 of these multi-scan-requests to render a web page. Then, you see the spike in the cpu and some activity in the network and disk, but nowhere near saturation.


If there are specific tests you'd like me to do to debug this, I'd be more than happy to do it.


Khaled

----------------------------------------
> Date: Thu, 2 Oct 2014 11:15:59 -0700
> From: larsh@apache.org
> Subject: Re: HBase read performance
> To: user@hbase.apache.org
>
> I still think you're waiting on disk. No IOWAIT? So CPU is not waiting a lot for IO. No high User/System CPU either?
>
> If you see a lot of evicted block then each RPC has a high chance of requiring to bring an entire 64k block in. You'll see bad performance with this.
>
> We might need to trace this specific scenario.
>
> -- Lars
>
>
>
> ________________________________
> From: Khaled Elmeleegy <kd...@hotmail.com>
> To: "user@hbase.apache.org" <us...@hbase.apache.org>
> Sent: Thursday, October 2, 2014 10:46 AM
> Subject: RE: HBase read performance
>
>
> I've set the heap size to 6GB and I do have gc logging. No long pauses there -- occasional 0.1s or 0.2s.
>
> Other than the discrepancy between what's reported on the client and what's reported at the RS, there is also the issue of not getting proper concurrency. So, even if a reverse get takes 100ms or so (this has to be mostly blocking on various things as no physical resource is contended), then the other gets/scans should be able to proceed in parallel, so a thousand concurrent gets/scans should finish in few hundreds of ms not many seconds. That's why I thought I'd increase the handlers count to try to get more concurrency, but it didn't help. So, there must be something else.
>
> Khaled
>
> ----------------------------------------
>> From: ndimiduk@gmail.com
>> Date: Thu, 2 Oct 2014 10:36:39 -0700
>> Subject: Re: HBase read performance
>> To: user@hbase.apache.org
>>
>> Do check again on the heap size of the region servers. The default
>> unconfigured size is 1G; too small for much of anything. Check your RS logs
>> -- look for lines produced by the JVMPauseMonitor thread. They usually
>> correlate with long GC pauses or other process-freeze events.
>>
>> Get is implemented as a Scan of a single row, so a reverse scan of a single
>> row should be functionally equivalent.
>>
>> In practice, I have seen discrepancy between the latencies reported by the
>> RS and the latencies experienced by the client. I've not investigated this
>> area thoroughly.
>>
>> On Thu, Oct 2, 2014 at 10:05 AM, Khaled Elmeleegy <kd...@hotmail.com> wrote:
>>
>>> Thanks Lars for your quick reply.
>>>
>>> Yes performance is similar with less handlers (I tried with 100 first).
>>>
>>> The payload is not big ~1KB or so. The working set doesn't seem to fit in
>>> memory as there are many cache misses. However, disk is far from being a
>>> bottleneck. I checked using iostat. I also verified that neither the
>>> network nor the CPU of the region server or the client are a bottleneck.
>>> This leads me to believe that likely this is a software bottleneck,
>>> possibly due to a misconfiguration on my side. I just don't know how to
>>> debug it. A clear disconnect I see is the individual request latency as
>>> reported by metrics on the region server (IPC processCallTime vs scanNext)
>>> vs what's measured on the client. Does this sound right? Any ideas on how
>>> to better debug it?
>>>
>>> About this trick with the timestamps to be able to do a forward scan,
>>> thanks for pointing it out. Actually, I am aware of it. The problem I have
>>> is, sometimes I want to get the key after a particular timestamp and
>>> sometimes I want to get the key before, so just relying on the key order
>>> doesn't work. Ideally, I want a reverse get(). I thought reverse scan can
>>> do the trick though.
>>>
>>> Khaled
>>>
>>> ----------------------------------------
>>>> Date: Thu, 2 Oct 2014 09:40:37 -0700
>>>> From: larsh@apache.org
>>>> Subject: Re: HBase read performance
>>>> To: user@hbase.apache.org
>>>>
>>>> Hi Khaled,
>>>> is it the same with fewer threads? 1500 handler threads seems to be a
>>> lot. Typically a good number of threads depends on the hardware (number of
>>> cores, number of spindles, etc). I cannot think of any type of scenario
>>> where more than 100 would give any improvement.
>>>>
>>>> How large is the payload per KV retrieved that way? If large (as in a
>>> few 100k) you definitely want to lower the number of the handler threads.
>>>> How much heap do you give the region server? Does the working set fit
>>> into the cache? (i.e. in the metrics, do you see the eviction count going
>>> up, if so it does not fit into the cache).
>>>>
>>>> If the working set does not fit into the cache (eviction count goes up)
>>> then HBase will need to bring a new block in from disk on each Get
>>> (assuming the Gets are more or less random as far as the server is
>>> concerned).
>>>> In case you'll benefit from reducing the HFile block size (from 64k to
>>> 8k or even 4k).
>>>>
>>>> Lastly I don't think we tested the performance of using reverse scan
>>> this way, there is probably room to optimize this.
>>>> Can you restructure your keys to allow forwards scanning? For example
>>> you could store the time as MAX_LONG-time. Or you could invert all the bits
>>> of the time portion of the key, so that it sort the other way. Then you
>>> could do a forward scan.
>>>>
>>>> Let us know how it goes.
>>>>
>>>> -- Lars
>>>>
>>>>
>>>> ----- Original Message -----
>>>> From: Khaled Elmeleegy <kd...@hotmail.com>
>>>> To: "user@hbase.apache.org" <us...@hbase.apache.org>
>>>> Cc:
>>>> Sent: Thursday, October 2, 2014 12:12 AM
>>>> Subject: HBase read performance
>>>>
>>>> Hi,
>>>>
>>>> I am trying to do a scatter/gather on hbase (0.98.6.1), where I have a
>>> client reading ~1000 keys from an HBase table. These keys happen to fall on
>>> the same region server. For my reads I use reverse scan to read each key as
>>> I want the key prior to a specific time stamp (time stamps are stored in
>>> reverse order). I don't believe gets can accomplish that, right? so I use
>>> scan, with caching set to 1.
>>>>
>>>> I use 2000 reader threads in the client and on HBase, I've set
>>> hbase.regionserver.handler.count to 1500. With this setup, my scatter
>>> gather is very slow and can take up to 10s in total. Timing an individual
>>> getScanner(..) call on the client side, it can easily take few hundreds of
>>> ms. I also got the following metrics from the region server in question:
>>>>
>>>> "queueCallTime_mean" : 2.190855525775637,
>>>> "queueCallTime_median" : 0.0,
>>>> "queueCallTime_75th_percentile" : 0.0,
>>>> "queueCallTime_95th_percentile" : 1.0,
>>>> "queueCallTime_99th_percentile" : 556.9799999999818,
>>>>
>>>> "processCallTime_min" : 0,
>>>> "processCallTime_max" : 12755,
>>>> "processCallTime_mean" : 105.64873440912682,
>>>> "processCallTime_median" : 0.0,
>>>> "processCallTime_75th_percentile" : 2.0,
>>>> "processCallTime_95th_percentile" : 7917.95,
>>>> "processCallTime_99th_percentile" : 8876.89,
>>>>
>>>>
>>> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_min"
>>> : 89,
>>>>
>>> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_max"
>>> : 11300,
>>>>
>>> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_mean"
>>> : 654.4949739797315,
>>>>
>>> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_median"
>>> : 101.0,
>>>>
>>> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_75th_percentile"
>>> : 101.0,
>>>>
>>> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_95th_percentile"
>>> : 101.0,
>>>>
>>> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_99th_percentile"
>>> : 113.0,
>>>>
>>>> Where "delta" is the name of the table I am querying.
>>>>
>>>> In addition to all this, i monitored the hardware resources (CPU, disk,
>>> and network) of both the client and the region server and nothing seems
>>> anywhere near saturation. So I am puzzled by what's going on and where this
>>> time is going.
>>>>
>>>> Few things to note based on the above measurements: both medians of IPC
>>> processCallTime and queueCallTime are basically zero (ms I presume,
>>> right?). However, scanNext_median is 101 (ms too, right?). I am not sure
>>> how this adds up. Also, even though the 101 figure seems outrageously high
>>> and I don't know why, still all these scans should be happening in
>>> parallel, so the overall call should finish fast, given that no hardware
>>> resource is contended, right? but this is not what's happening, so I have
>>> to be missing something(s).
>>>>
>>>> So, any help is appreciated there.
>>>>
>>>> Thanks,
>>>> Khaled
>>>
>>>

Re: HBase read performance

Posted by lars hofhansl <la...@apache.org>.

I still think you're waiting on disk. No IOWAIT? So CPU is not waiting a lot for IO. No high User/System CPU either?

If you see a lot of evicted block then each RPC has a high chance of requiring to bring an entire 64k block in. You'll see bad performance with this.

We might need to trace this specific scenario.

-- Lars



________________________________
 From: Khaled Elmeleegy <kd...@hotmail.com>
To: "user@hbase.apache.org" <us...@hbase.apache.org> 
Sent: Thursday, October 2, 2014 10:46 AM
Subject: RE: HBase read performance
 

I've set the heap size to 6GB and I do have gc logging. No long pauses there -- occasional 0.1s or 0.2s.

Other than the discrepancy between what's reported on the client and what's reported at the RS, there is also the issue of not getting proper concurrency. So, even if a reverse get takes 100ms or so (this has to be mostly blocking on various things as no physical resource is contended), then the other gets/scans should be able to proceed in parallel, so a thousand concurrent gets/scans should finish in few hundreds of ms not many seconds. That's why I thought I'd increase the handlers count to try to get more concurrency, but it didn't help. So, there must be something else.

Khaled

----------------------------------------
> From: ndimiduk@gmail.com
> Date: Thu, 2 Oct 2014 10:36:39 -0700
> Subject: Re: HBase read performance
> To: user@hbase.apache.org
>
> Do check again on the heap size of the region servers. The default
> unconfigured size is 1G; too small for much of anything. Check your RS logs
> -- look for lines produced by the JVMPauseMonitor thread. They usually
> correlate with long GC pauses or other process-freeze events.
>
> Get is implemented as a Scan of a single row, so a reverse scan of a single
> row should be functionally equivalent.
>
> In practice, I have seen discrepancy between the latencies reported by the
> RS and the latencies experienced by the client. I've not investigated this
> area thoroughly.
>
> On Thu, Oct 2, 2014 at 10:05 AM, Khaled Elmeleegy <kd...@hotmail.com> wrote:
>
>> Thanks Lars for your quick reply.
>>
>> Yes performance is similar with less handlers (I tried with 100 first).
>>
>> The payload is not big ~1KB or so. The working set doesn't seem to fit in
>> memory as there are many cache misses. However, disk is far from being a
>> bottleneck. I checked using iostat. I also verified that neither the
>> network nor the CPU of the region server or the client are a bottleneck.
>> This leads me to believe that likely this is a software bottleneck,
>> possibly due to a misconfiguration on my side. I just don't know how to
>> debug it. A clear disconnect I see is the individual request latency as
>> reported by metrics on the region server (IPC processCallTime vs scanNext)
>> vs what's measured on the client. Does this sound right? Any ideas on how
>> to better debug it?
>>
>> About this trick with the timestamps to be able to do a forward scan,
>> thanks for pointing it out. Actually, I am aware of it. The problem I have
>> is, sometimes I want to get the key after a particular timestamp and
>> sometimes I want to get the key before, so just relying on the key order
>> doesn't work. Ideally, I want a reverse get(). I thought reverse scan can
>> do the trick though.
>>
>> Khaled
>>
>> ----------------------------------------
>>> Date: Thu, 2 Oct 2014 09:40:37 -0700
>>> From: larsh@apache.org
>>> Subject: Re: HBase read performance
>>> To: user@hbase.apache.org
>>>
>>> Hi Khaled,
>>> is it the same with fewer threads? 1500 handler threads seems to be a
>> lot. Typically a good number of threads depends on the hardware (number of
>> cores, number of spindles, etc). I cannot think of any type of scenario
>> where more than 100 would give any improvement.
>>>
>>> How large is the payload per KV retrieved that way? If large (as in a
>> few 100k) you definitely want to lower the number of the handler threads.
>>> How much heap do you give the region server? Does the working set fit
>> into the cache? (i.e. in the metrics, do you see the eviction count going
>> up, if so it does not fit into the cache).
>>>
>>> If the working set does not fit into the cache (eviction count goes up)
>> then HBase will need to bring a new block in from disk on each Get
>> (assuming the Gets are more or less random as far as the server is
>> concerned).
>>> In case you'll benefit from reducing the HFile block size (from 64k to
>> 8k or even 4k).
>>>
>>> Lastly I don't think we tested the performance of using reverse scan
>> this way, there is probably room to optimize this.
>>> Can you restructure your keys to allow forwards scanning? For example
>> you could store the time as MAX_LONG-time. Or you could invert all the bits
>> of the time portion of the key, so that it sort the other way. Then you
>> could do a forward scan.
>>>
>>> Let us know how it goes.
>>>
>>> -- Lars
>>>
>>>
>>> ----- Original Message -----
>>> From: Khaled Elmeleegy <kd...@hotmail.com>
>>> To: "user@hbase.apache.org" <us...@hbase.apache.org>
>>> Cc:
>>> Sent: Thursday, October 2, 2014 12:12 AM
>>> Subject: HBase read performance
>>>
>>> Hi,
>>>
>>> I am trying to do a scatter/gather on hbase (0.98.6.1), where I have a
>> client reading ~1000 keys from an HBase table. These keys happen to fall on
>> the same region server. For my reads I use reverse scan to read each key as
>> I want the key prior to a specific time stamp (time stamps are stored in
>> reverse order). I don't believe gets can accomplish that, right? so I use
>> scan, with caching set to 1.
>>>
>>> I use 2000 reader threads in the client and on HBase, I've set
>> hbase.regionserver.handler.count to 1500. With this setup, my scatter
>> gather is very slow and can take up to 10s in total. Timing an individual
>> getScanner(..) call on the client side, it can easily take few hundreds of
>> ms. I also got the following metrics from the region server in question:
>>>
>>> "queueCallTime_mean" : 2.190855525775637,
>>> "queueCallTime_median" : 0.0,
>>> "queueCallTime_75th_percentile" : 0.0,
>>> "queueCallTime_95th_percentile" : 1.0,
>>> "queueCallTime_99th_percentile" : 556.9799999999818,
>>>
>>> "processCallTime_min" : 0,
>>> "processCallTime_max" : 12755,
>>> "processCallTime_mean" : 105.64873440912682,
>>> "processCallTime_median" : 0.0,
>>> "processCallTime_75th_percentile" : 2.0,
>>> "processCallTime_95th_percentile" : 7917.95,
>>> "processCallTime_99th_percentile" : 8876.89,
>>>
>>>
>> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_min"
>> : 89,
>>>
>> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_max"
>> : 11300,
>>>
>> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_mean"
>> : 654.4949739797315,
>>>
>> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_median"
>> : 101.0,
>>>
>> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_75th_percentile"
>> : 101.0,
>>>
>> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_95th_percentile"
>> : 101.0,
>>>
>> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_99th_percentile"
>> : 113.0,
>>>
>>> Where "delta" is the name of the table I am querying.
>>>
>>> In addition to all this, i monitored the hardware resources (CPU, disk,
>> and network) of both the client and the region server and nothing seems
>> anywhere near saturation. So I am puzzled by what's going on and where this
>> time is going.
>>>
>>> Few things to note based on the above measurements: both medians of IPC
>> processCallTime and queueCallTime are basically zero (ms I presume,
>> right?). However, scanNext_median is 101 (ms too, right?). I am not sure
>> how this adds up. Also, even though the 101 figure seems outrageously high
>> and I don't know why, still all these scans should be happening in
>> parallel, so the overall call should finish fast, given that no hardware
>> resource is contended, right? but this is not what's happening, so I have
>> to be missing something(s).
>>>
>>> So, any help is appreciated there.
>>>
>>> Thanks,
>>> Khaled
>>
>>

RE: HBase read performance

Posted by Khaled Elmeleegy <kd...@hotmail.com>.

No, I was using reverse scan to mimic the behavior of reverse get.

----------------------------------------
> Date: Thu, 2 Oct 2014 10:50:15 -0700
> Subject: Re: HBase read performance
> From: yuzhihong@gmail.com
> To: user@hbase.apache.org
>
> Khaled:
> Were you using this method from HTable ?
>
> public Result[] get(List<Get> gets) throws IOException {
>
> Cheers
>
> On Thu, Oct 2, 2014 at 10:46 AM, Khaled Elmeleegy <kd...@hotmail.com> wrote:
>
>> I've set the heap size to 6GB and I do have gc logging. No long pauses
>> there -- occasional 0.1s or 0.2s.
>>
>> Other than the discrepancy between what's reported on the client and
>> what's reported at the RS, there is also the issue of not getting proper
>> concurrency. So, even if a reverse get takes 100ms or so (this has to be
>> mostly blocking on various things as no physical resource is contended),
>> then the other gets/scans should be able to proceed in parallel, so a
>> thousand concurrent gets/scans should finish in few hundreds of ms not many
>> seconds. That's why I thought I'd increase the handlers count to try to get
>> more concurrency, but it didn't help. So, there must be something else.
>>
>> Khaled
>>
>> ----------------------------------------
>>> From: ndimiduk@gmail.com
>>> Date: Thu, 2 Oct 2014 10:36:39 -0700
>>> Subject: Re: HBase read performance
>>> To: user@hbase.apache.org
>>>
>>> Do check again on the heap size of the region servers. The default
>>> unconfigured size is 1G; too small for much of anything. Check your RS
>> logs
>>> -- look for lines produced by the JVMPauseMonitor thread. They usually
>>> correlate with long GC pauses or other process-freeze events.
>>>
>>> Get is implemented as a Scan of a single row, so a reverse scan of a
>> single
>>> row should be functionally equivalent.
>>>
>>> In practice, I have seen discrepancy between the latencies reported by
>> the
>>> RS and the latencies experienced by the client. I've not investigated
>> this
>>> area thoroughly.
>>>
>>> On Thu, Oct 2, 2014 at 10:05 AM, Khaled Elmeleegy <kd...@hotmail.com>
>> wrote:
>>>
>>>> Thanks Lars for your quick reply.
>>>>
>>>> Yes performance is similar with less handlers (I tried with 100 first).
>>>>
>>>> The payload is not big ~1KB or so. The working set doesn't seem to fit
>> in
>>>> memory as there are many cache misses. However, disk is far from being a
>>>> bottleneck. I checked using iostat. I also verified that neither the
>>>> network nor the CPU of the region server or the client are a bottleneck.
>>>> This leads me to believe that likely this is a software bottleneck,
>>>> possibly due to a misconfiguration on my side. I just don't know how to
>>>> debug it. A clear disconnect I see is the individual request latency as
>>>> reported by metrics on the region server (IPC processCallTime vs
>> scanNext)
>>>> vs what's measured on the client. Does this sound right? Any ideas on
>> how
>>>> to better debug it?
>>>>
>>>> About this trick with the timestamps to be able to do a forward scan,
>>>> thanks for pointing it out. Actually, I am aware of it. The problem I
>> have
>>>> is, sometimes I want to get the key after a particular timestamp and
>>>> sometimes I want to get the key before, so just relying on the key order
>>>> doesn't work. Ideally, I want a reverse get(). I thought reverse scan
>> can
>>>> do the trick though.
>>>>
>>>> Khaled
>>>>
>>>> ----------------------------------------
>>>>> Date: Thu, 2 Oct 2014 09:40:37 -0700
>>>>> From: larsh@apache.org
>>>>> Subject: Re: HBase read performance
>>>>> To: user@hbase.apache.org
>>>>>
>>>>> Hi Khaled,
>>>>> is it the same with fewer threads? 1500 handler threads seems to be a
>>>> lot. Typically a good number of threads depends on the hardware (number
>> of
>>>> cores, number of spindles, etc). I cannot think of any type of scenario
>>>> where more than 100 would give any improvement.
>>>>>
>>>>> How large is the payload per KV retrieved that way? If large (as in a
>>>> few 100k) you definitely want to lower the number of the handler
>> threads.
>>>>> How much heap do you give the region server? Does the working set fit
>>>> into the cache? (i.e. in the metrics, do you see the eviction count
>> going
>>>> up, if so it does not fit into the cache).
>>>>>
>>>>> If the working set does not fit into the cache (eviction count goes up)
>>>> then HBase will need to bring a new block in from disk on each Get
>>>> (assuming the Gets are more or less random as far as the server is
>>>> concerned).
>>>>> In case you'll benefit from reducing the HFile block size (from 64k to
>>>> 8k or even 4k).
>>>>>
>>>>> Lastly I don't think we tested the performance of using reverse scan
>>>> this way, there is probably room to optimize this.
>>>>> Can you restructure your keys to allow forwards scanning? For example
>>>> you could store the time as MAX_LONG-time. Or you could invert all the
>> bits
>>>> of the time portion of the key, so that it sort the other way. Then you
>>>> could do a forward scan.
>>>>>
>>>>> Let us know how it goes.
>>>>>
>>>>> -- Lars
>>>>>
>>>>>
>>>>> ----- Original Message -----
>>>>> From: Khaled Elmeleegy <kd...@hotmail.com>
>>>>> To: "user@hbase.apache.org" <us...@hbase.apache.org>
>>>>> Cc:
>>>>> Sent: Thursday, October 2, 2014 12:12 AM
>>>>> Subject: HBase read performance
>>>>>
>>>>> Hi,
>>>>>
>>>>> I am trying to do a scatter/gather on hbase (0.98.6.1), where I have a
>>>> client reading ~1000 keys from an HBase table. These keys happen to
>> fall on
>>>> the same region server. For my reads I use reverse scan to read each
>> key as
>>>> I want the key prior to a specific time stamp (time stamps are stored in
>>>> reverse order). I don't believe gets can accomplish that, right? so I
>> use
>>>> scan, with caching set to 1.
>>>>>
>>>>> I use 2000 reader threads in the client and on HBase, I've set
>>>> hbase.regionserver.handler.count to 1500. With this setup, my scatter
>>>> gather is very slow and can take up to 10s in total. Timing an
>> individual
>>>> getScanner(..) call on the client side, it can easily take few hundreds
>> of
>>>> ms. I also got the following metrics from the region server in question:
>>>>>
>>>>> "queueCallTime_mean" : 2.190855525775637,
>>>>> "queueCallTime_median" : 0.0,
>>>>> "queueCallTime_75th_percentile" : 0.0,
>>>>> "queueCallTime_95th_percentile" : 1.0,
>>>>> "queueCallTime_99th_percentile" : 556.9799999999818,
>>>>>
>>>>> "processCallTime_min" : 0,
>>>>> "processCallTime_max" : 12755,
>>>>> "processCallTime_mean" : 105.64873440912682,
>>>>> "processCallTime_median" : 0.0,
>>>>> "processCallTime_75th_percentile" : 2.0,
>>>>> "processCallTime_95th_percentile" : 7917.95,
>>>>> "processCallTime_99th_percentile" : 8876.89,
>>>>>
>>>>>
>>>>
>> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_min"
>>>> : 89,
>>>>>
>>>>
>> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_max"
>>>> : 11300,
>>>>>
>>>>
>> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_mean"
>>>> : 654.4949739797315,
>>>>>
>>>>
>> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_median"
>>>> : 101.0,
>>>>>
>>>>
>> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_75th_percentile"
>>>> : 101.0,
>>>>>
>>>>
>> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_95th_percentile"
>>>> : 101.0,
>>>>>
>>>>
>> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_99th_percentile"
>>>> : 113.0,
>>>>>
>>>>> Where "delta" is the name of the table I am querying.
>>>>>
>>>>> In addition to all this, i monitored the hardware resources (CPU, disk,
>>>> and network) of both the client and the region server and nothing seems
>>>> anywhere near saturation. So I am puzzled by what's going on and where
>> this
>>>> time is going.
>>>>>
>>>>> Few things to note based on the above measurements: both medians of IPC
>>>> processCallTime and queueCallTime are basically zero (ms I presume,
>>>> right?). However, scanNext_median is 101 (ms too, right?). I am not sure
>>>> how this adds up. Also, even though the 101 figure seems outrageously
>> high
>>>> and I don't know why, still all these scans should be happening in
>>>> parallel, so the overall call should finish fast, given that no hardware
>>>> resource is contended, right? but this is not what's happening, so I
>> have
>>>> to be missing something(s).
>>>>>
>>>>> So, any help is appreciated there.
>>>>>
>>>>> Thanks,
>>>>> Khaled
>>>>
>>>>
>>
>>

Re: HBase read performance

Posted by Ted Yu <yu...@gmail.com>.

Khaled:
Were you using this method from HTable ?

  public Result[] get(List<Get> gets) throws IOException {

Cheers

On Thu, Oct 2, 2014 at 10:46 AM, Khaled Elmeleegy <kd...@hotmail.com> wrote:

> I've set the heap size to 6GB and I do have gc logging. No long pauses
> there -- occasional 0.1s or 0.2s.
>
> Other than the discrepancy between what's reported on the client and
> what's reported at the RS, there is also the issue of not getting proper
> concurrency. So, even if a reverse get takes 100ms or so (this has to be
> mostly blocking on various things as no physical resource is contended),
> then the other gets/scans should be able to proceed in parallel, so a
> thousand concurrent gets/scans should finish in few hundreds of ms not many
> seconds. That's why I thought I'd increase the handlers count to try to get
> more concurrency, but it didn't help. So, there must be something else.
>
> Khaled
>
> ----------------------------------------
> > From: ndimiduk@gmail.com
> > Date: Thu, 2 Oct 2014 10:36:39 -0700
> > Subject: Re: HBase read performance
> > To: user@hbase.apache.org
> >
> > Do check again on the heap size of the region servers. The default
> > unconfigured size is 1G; too small for much of anything. Check your RS
> logs
> > -- look for lines produced by the JVMPauseMonitor thread. They usually
> > correlate with long GC pauses or other process-freeze events.
> >
> > Get is implemented as a Scan of a single row, so a reverse scan of a
> single
> > row should be functionally equivalent.
> >
> > In practice, I have seen discrepancy between the latencies reported by
> the
> > RS and the latencies experienced by the client. I've not investigated
> this
> > area thoroughly.
> >
> > On Thu, Oct 2, 2014 at 10:05 AM, Khaled Elmeleegy <kd...@hotmail.com>
> wrote:
> >
> >> Thanks Lars for your quick reply.
> >>
> >> Yes performance is similar with less handlers (I tried with 100 first).
> >>
> >> The payload is not big ~1KB or so. The working set doesn't seem to fit
> in
> >> memory as there are many cache misses. However, disk is far from being a
> >> bottleneck. I checked using iostat. I also verified that neither the
> >> network nor the CPU of the region server or the client are a bottleneck.
> >> This leads me to believe that likely this is a software bottleneck,
> >> possibly due to a misconfiguration on my side. I just don't know how to
> >> debug it. A clear disconnect I see is the individual request latency as
> >> reported by metrics on the region server (IPC processCallTime vs
> scanNext)
> >> vs what's measured on the client. Does this sound right? Any ideas on
> how
> >> to better debug it?
> >>
> >> About this trick with the timestamps to be able to do a forward scan,
> >> thanks for pointing it out. Actually, I am aware of it. The problem I
> have
> >> is, sometimes I want to get the key after a particular timestamp and
> >> sometimes I want to get the key before, so just relying on the key order
> >> doesn't work. Ideally, I want a reverse get(). I thought reverse scan
> can
> >> do the trick though.
> >>
> >> Khaled
> >>
> >> ----------------------------------------
> >>> Date: Thu, 2 Oct 2014 09:40:37 -0700
> >>> From: larsh@apache.org
> >>> Subject: Re: HBase read performance
> >>> To: user@hbase.apache.org
> >>>
> >>> Hi Khaled,
> >>> is it the same with fewer threads? 1500 handler threads seems to be a
> >> lot. Typically a good number of threads depends on the hardware (number
> of
> >> cores, number of spindles, etc). I cannot think of any type of scenario
> >> where more than 100 would give any improvement.
> >>>
> >>> How large is the payload per KV retrieved that way? If large (as in a
> >> few 100k) you definitely want to lower the number of the handler
> threads.
> >>> How much heap do you give the region server? Does the working set fit
> >> into the cache? (i.e. in the metrics, do you see the eviction count
> going
> >> up, if so it does not fit into the cache).
> >>>
> >>> If the working set does not fit into the cache (eviction count goes up)
> >> then HBase will need to bring a new block in from disk on each Get
> >> (assuming the Gets are more or less random as far as the server is
> >> concerned).
> >>> In case you'll benefit from reducing the HFile block size (from 64k to
> >> 8k or even 4k).
> >>>
> >>> Lastly I don't think we tested the performance of using reverse scan
> >> this way, there is probably room to optimize this.
> >>> Can you restructure your keys to allow forwards scanning? For example
> >> you could store the time as MAX_LONG-time. Or you could invert all the
> bits
> >> of the time portion of the key, so that it sort the other way. Then you
> >> could do a forward scan.
> >>>
> >>> Let us know how it goes.
> >>>
> >>> -- Lars
> >>>
> >>>
> >>> ----- Original Message -----
> >>> From: Khaled Elmeleegy <kd...@hotmail.com>
> >>> To: "user@hbase.apache.org" <us...@hbase.apache.org>
> >>> Cc:
> >>> Sent: Thursday, October 2, 2014 12:12 AM
> >>> Subject: HBase read performance
> >>>
> >>> Hi,
> >>>
> >>> I am trying to do a scatter/gather on hbase (0.98.6.1), where I have a
> >> client reading ~1000 keys from an HBase table. These keys happen to
> fall on
> >> the same region server. For my reads I use reverse scan to read each
> key as
> >> I want the key prior to a specific time stamp (time stamps are stored in
> >> reverse order). I don't believe gets can accomplish that, right? so I
> use
> >> scan, with caching set to 1.
> >>>
> >>> I use 2000 reader threads in the client and on HBase, I've set
> >> hbase.regionserver.handler.count to 1500. With this setup, my scatter
> >> gather is very slow and can take up to 10s in total. Timing an
> individual
> >> getScanner(..) call on the client side, it can easily take few hundreds
> of
> >> ms. I also got the following metrics from the region server in question:
> >>>
> >>> "queueCallTime_mean" : 2.190855525775637,
> >>> "queueCallTime_median" : 0.0,
> >>> "queueCallTime_75th_percentile" : 0.0,
> >>> "queueCallTime_95th_percentile" : 1.0,
> >>> "queueCallTime_99th_percentile" : 556.9799999999818,
> >>>
> >>> "processCallTime_min" : 0,
> >>> "processCallTime_max" : 12755,
> >>> "processCallTime_mean" : 105.64873440912682,
> >>> "processCallTime_median" : 0.0,
> >>> "processCallTime_75th_percentile" : 2.0,
> >>> "processCallTime_95th_percentile" : 7917.95,
> >>> "processCallTime_99th_percentile" : 8876.89,
> >>>
> >>>
> >>
> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_min"
> >> : 89,
> >>>
> >>
> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_max"
> >> : 11300,
> >>>
> >>
> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_mean"
> >> : 654.4949739797315,
> >>>
> >>
> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_median"
> >> : 101.0,
> >>>
> >>
> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_75th_percentile"
> >> : 101.0,
> >>>
> >>
> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_95th_percentile"
> >> : 101.0,
> >>>
> >>
> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_99th_percentile"
> >> : 113.0,
> >>>
> >>> Where "delta" is the name of the table I am querying.
> >>>
> >>> In addition to all this, i monitored the hardware resources (CPU, disk,
> >> and network) of both the client and the region server and nothing seems
> >> anywhere near saturation. So I am puzzled by what's going on and where
> this
> >> time is going.
> >>>
> >>> Few things to note based on the above measurements: both medians of IPC
> >> processCallTime and queueCallTime are basically zero (ms I presume,
> >> right?). However, scanNext_median is 101 (ms too, right?). I am not sure
> >> how this adds up. Also, even though the 101 figure seems outrageously
> high
> >> and I don't know why, still all these scans should be happening in
> >> parallel, so the overall call should finish fast, given that no hardware
> >> resource is contended, right? but this is not what's happening, so I
> have
> >> to be missing something(s).
> >>>
> >>> So, any help is appreciated there.
> >>>
> >>> Thanks,
> >>> Khaled
> >>
> >>
>
>

RE: HBase read performance

Posted by Khaled Elmeleegy <kd...@hotmail.com>.

I've set the heap size to 6GB and I do have gc logging. No long pauses there -- occasional 0.1s or 0.2s.

Other than the discrepancy between what's reported on the client and what's reported at the RS, there is also the issue of not getting proper concurrency. So, even if a reverse get takes 100ms or so (this has to be mostly blocking on various things as no physical resource is contended), then the other gets/scans should be able to proceed in parallel, so a thousand concurrent gets/scans should finish in few hundreds of ms not many seconds. That's why I thought I'd increase the handlers count to try to get more concurrency, but it didn't help. So, there must be something else.

Khaled

----------------------------------------
> From: ndimiduk@gmail.com
> Date: Thu, 2 Oct 2014 10:36:39 -0700
> Subject: Re: HBase read performance
> To: user@hbase.apache.org
>
> Do check again on the heap size of the region servers. The default
> unconfigured size is 1G; too small for much of anything. Check your RS logs
> -- look for lines produced by the JVMPauseMonitor thread. They usually
> correlate with long GC pauses or other process-freeze events.
>
> Get is implemented as a Scan of a single row, so a reverse scan of a single
> row should be functionally equivalent.
>
> In practice, I have seen discrepancy between the latencies reported by the
> RS and the latencies experienced by the client. I've not investigated this
> area thoroughly.
>
> On Thu, Oct 2, 2014 at 10:05 AM, Khaled Elmeleegy <kd...@hotmail.com> wrote:
>
>> Thanks Lars for your quick reply.
>>
>> Yes performance is similar with less handlers (I tried with 100 first).
>>
>> The payload is not big ~1KB or so. The working set doesn't seem to fit in
>> memory as there are many cache misses. However, disk is far from being a
>> bottleneck. I checked using iostat. I also verified that neither the
>> network nor the CPU of the region server or the client are a bottleneck.
>> This leads me to believe that likely this is a software bottleneck,
>> possibly due to a misconfiguration on my side. I just don't know how to
>> debug it. A clear disconnect I see is the individual request latency as
>> reported by metrics on the region server (IPC processCallTime vs scanNext)
>> vs what's measured on the client. Does this sound right? Any ideas on how
>> to better debug it?
>>
>> About this trick with the timestamps to be able to do a forward scan,
>> thanks for pointing it out. Actually, I am aware of it. The problem I have
>> is, sometimes I want to get the key after a particular timestamp and
>> sometimes I want to get the key before, so just relying on the key order
>> doesn't work. Ideally, I want a reverse get(). I thought reverse scan can
>> do the trick though.
>>
>> Khaled
>>
>> ----------------------------------------
>>> Date: Thu, 2 Oct 2014 09:40:37 -0700
>>> From: larsh@apache.org
>>> Subject: Re: HBase read performance
>>> To: user@hbase.apache.org
>>>
>>> Hi Khaled,
>>> is it the same with fewer threads? 1500 handler threads seems to be a
>> lot. Typically a good number of threads depends on the hardware (number of
>> cores, number of spindles, etc). I cannot think of any type of scenario
>> where more than 100 would give any improvement.
>>>
>>> How large is the payload per KV retrieved that way? If large (as in a
>> few 100k) you definitely want to lower the number of the handler threads.
>>> How much heap do you give the region server? Does the working set fit
>> into the cache? (i.e. in the metrics, do you see the eviction count going
>> up, if so it does not fit into the cache).
>>>
>>> If the working set does not fit into the cache (eviction count goes up)
>> then HBase will need to bring a new block in from disk on each Get
>> (assuming the Gets are more or less random as far as the server is
>> concerned).
>>> In case you'll benefit from reducing the HFile block size (from 64k to
>> 8k or even 4k).
>>>
>>> Lastly I don't think we tested the performance of using reverse scan
>> this way, there is probably room to optimize this.
>>> Can you restructure your keys to allow forwards scanning? For example
>> you could store the time as MAX_LONG-time. Or you could invert all the bits
>> of the time portion of the key, so that it sort the other way. Then you
>> could do a forward scan.
>>>
>>> Let us know how it goes.
>>>
>>> -- Lars
>>>
>>>
>>> ----- Original Message -----
>>> From: Khaled Elmeleegy <kd...@hotmail.com>
>>> To: "user@hbase.apache.org" <us...@hbase.apache.org>
>>> Cc:
>>> Sent: Thursday, October 2, 2014 12:12 AM
>>> Subject: HBase read performance
>>>
>>> Hi,
>>>
>>> I am trying to do a scatter/gather on hbase (0.98.6.1), where I have a
>> client reading ~1000 keys from an HBase table. These keys happen to fall on
>> the same region server. For my reads I use reverse scan to read each key as
>> I want the key prior to a specific time stamp (time stamps are stored in
>> reverse order). I don't believe gets can accomplish that, right? so I use
>> scan, with caching set to 1.
>>>
>>> I use 2000 reader threads in the client and on HBase, I've set
>> hbase.regionserver.handler.count to 1500. With this setup, my scatter
>> gather is very slow and can take up to 10s in total. Timing an individual
>> getScanner(..) call on the client side, it can easily take few hundreds of
>> ms. I also got the following metrics from the region server in question:
>>>
>>> "queueCallTime_mean" : 2.190855525775637,
>>> "queueCallTime_median" : 0.0,
>>> "queueCallTime_75th_percentile" : 0.0,
>>> "queueCallTime_95th_percentile" : 1.0,
>>> "queueCallTime_99th_percentile" : 556.9799999999818,
>>>
>>> "processCallTime_min" : 0,
>>> "processCallTime_max" : 12755,
>>> "processCallTime_mean" : 105.64873440912682,
>>> "processCallTime_median" : 0.0,
>>> "processCallTime_75th_percentile" : 2.0,
>>> "processCallTime_95th_percentile" : 7917.95,
>>> "processCallTime_99th_percentile" : 8876.89,
>>>
>>>
>> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_min"
>> : 89,
>>>
>> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_max"
>> : 11300,
>>>
>> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_mean"
>> : 654.4949739797315,
>>>
>> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_median"
>> : 101.0,
>>>
>> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_75th_percentile"
>> : 101.0,
>>>
>> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_95th_percentile"
>> : 101.0,
>>>
>> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_99th_percentile"
>> : 113.0,
>>>
>>> Where "delta" is the name of the table I am querying.
>>>
>>> In addition to all this, i monitored the hardware resources (CPU, disk,
>> and network) of both the client and the region server and nothing seems
>> anywhere near saturation. So I am puzzled by what's going on and where this
>> time is going.
>>>
>>> Few things to note based on the above measurements: both medians of IPC
>> processCallTime and queueCallTime are basically zero (ms I presume,
>> right?). However, scanNext_median is 101 (ms too, right?). I am not sure
>> how this adds up. Also, even though the 101 figure seems outrageously high
>> and I don't know why, still all these scans should be happening in
>> parallel, so the overall call should finish fast, given that no hardware
>> resource is contended, right? but this is not what's happening, so I have
>> to be missing something(s).
>>>
>>> So, any help is appreciated there.
>>>
>>> Thanks,
>>> Khaled
>>
>>

Re: HBase read performance

Posted by Nick Dimiduk <nd...@gmail.com>.

Do check again on the heap size of the region servers. The default
unconfigured size is 1G; too small for much of anything. Check your RS logs
-- look for lines produced by the JVMPauseMonitor thread. They usually
correlate with long GC pauses or other process-freeze events.

Get is implemented as a Scan of a single row, so a reverse scan of a single
row should be functionally equivalent.

In practice, I have seen discrepancy between the latencies reported by the
RS and the latencies experienced by the client. I've not investigated this
area thoroughly.

On Thu, Oct 2, 2014 at 10:05 AM, Khaled Elmeleegy <kd...@hotmail.com> wrote:

> Thanks Lars for your quick reply.
>
> Yes performance is similar with less handlers (I tried with 100 first).
>
> The payload is not big ~1KB or so. The working set doesn't seem to fit in
> memory as there are many cache misses. However, disk is far from being a
> bottleneck. I checked using iostat. I also verified that neither the
> network nor the CPU of the region server or the client are a bottleneck.
> This leads me to believe that likely this is a software bottleneck,
> possibly due to a misconfiguration on my side. I just don't know how to
> debug it. A clear disconnect I see is the individual request latency as
> reported by metrics on the region server (IPC processCallTime vs scanNext)
> vs what's measured on the client. Does this sound right? Any ideas on how
> to better debug it?
>
> About this trick with the timestamps to be able to do a forward scan,
> thanks for pointing it out. Actually, I am aware of it. The problem I have
> is, sometimes I want to get the key after a particular timestamp and
> sometimes I want to get the key before, so just relying on the key order
> doesn't work. Ideally, I want a reverse get(). I thought reverse scan can
> do the trick though.
>
> Khaled
>
> ----------------------------------------
> > Date: Thu, 2 Oct 2014 09:40:37 -0700
> > From: larsh@apache.org
> > Subject: Re: HBase read performance
> > To: user@hbase.apache.org
> >
> > Hi Khaled,
> > is it the same with fewer threads? 1500 handler threads seems to be a
> lot. Typically a good number of threads depends on the hardware (number of
> cores, number of spindles, etc). I cannot think of any type of scenario
> where more than 100 would give any improvement.
> >
> > How large is the payload per KV retrieved that way? If large (as in a
> few 100k) you definitely want to lower the number of the handler threads.
> > How much heap do you give the region server? Does the working set fit
> into the cache? (i.e. in the metrics, do you see the eviction count going
> up, if so it does not fit into the cache).
> >
> > If the working set does not fit into the cache (eviction count goes up)
> then HBase will need to bring a new block in from disk on each Get
> (assuming the Gets are more or less random as far as the server is
> concerned).
> > In case you'll benefit from reducing the HFile block size (from 64k to
> 8k or even 4k).
> >
> > Lastly I don't think we tested the performance of using reverse scan
> this way, there is probably room to optimize this.
> > Can you restructure your keys to allow forwards scanning? For example
> you could store the time as MAX_LONG-time. Or you could invert all the bits
> of the time portion of the key, so that it sort the other way. Then you
> could do a forward scan.
> >
> > Let us know how it goes.
> >
> > -- Lars
> >
> >
> > ----- Original Message -----
> > From: Khaled Elmeleegy <kd...@hotmail.com>
> > To: "user@hbase.apache.org" <us...@hbase.apache.org>
> > Cc:
> > Sent: Thursday, October 2, 2014 12:12 AM
> > Subject: HBase read performance
> >
> > Hi,
> >
> > I am trying to do a scatter/gather on hbase (0.98.6.1), where I have a
> client reading ~1000 keys from an HBase table. These keys happen to fall on
> the same region server. For my reads I use reverse scan to read each key as
> I want the key prior to a specific time stamp (time stamps are stored in
> reverse order). I don't believe gets can accomplish that, right? so I use
> scan, with caching set to 1.
> >
> > I use 2000 reader threads in the client and on HBase, I've set
> hbase.regionserver.handler.count to 1500. With this setup, my scatter
> gather is very slow and can take up to 10s in total. Timing an individual
> getScanner(..) call on the client side, it can easily take few hundreds of
> ms. I also got the following metrics from the region server in question:
> >
> > "queueCallTime_mean" : 2.190855525775637,
> > "queueCallTime_median" : 0.0,
> > "queueCallTime_75th_percentile" : 0.0,
> > "queueCallTime_95th_percentile" : 1.0,
> > "queueCallTime_99th_percentile" : 556.9799999999818,
> >
> > "processCallTime_min" : 0,
> > "processCallTime_max" : 12755,
> > "processCallTime_mean" : 105.64873440912682,
> > "processCallTime_median" : 0.0,
> > "processCallTime_75th_percentile" : 2.0,
> > "processCallTime_95th_percentile" : 7917.95,
> > "processCallTime_99th_percentile" : 8876.89,
> >
> >
> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_min"
> : 89,
> >
> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_max"
> : 11300,
> >
> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_mean"
> : 654.4949739797315,
> >
> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_median"
> : 101.0,
> >
> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_75th_percentile"
> : 101.0,
> >
> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_95th_percentile"
> : 101.0,
> >
> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_99th_percentile"
> : 113.0,
> >
> > Where "delta" is the name of the table I am querying.
> >
> > In addition to all this, i monitored the hardware resources (CPU, disk,
> and network) of both the client and the region server and nothing seems
> anywhere near saturation. So I am puzzled by what's going on and where this
> time is going.
> >
> > Few things to note based on the above measurements: both medians of IPC
> processCallTime and queueCallTime are basically zero (ms I presume,
> right?). However, scanNext_median is 101 (ms too, right?). I am not sure
> how this adds up. Also, even though the 101 figure seems outrageously high
> and I don't know why, still all these scans should be happening in
> parallel, so the overall call should finish fast, given that no hardware
> resource is contended, right? but this is not what's happening, so I have
> to be missing something(s).
> >
> > So, any help is appreciated there.
> >
> > Thanks,
> > Khaled
>
>

RE: HBase read performance

Posted by Khaled Elmeleegy <kd...@hotmail.com>.

Thanks Lars for your quick reply.

Yes performance is similar with less handlers (I tried with 100 first).

The payload is not big ~1KB or so. The working set doesn't seem to fit in memory as there are many cache misses. However, disk is far from being a bottleneck. I checked using iostat. I also verified that neither the network nor the CPU of the region server or the client are a bottleneck. This leads me to believe that likely this is a software bottleneck, possibly due to a misconfiguration on my side. I just don't know how to debug it. A clear disconnect I see is the individual request latency as reported by metrics on the region server (IPC processCallTime vs scanNext) vs what's measured on the client. Does this sound right? Any ideas on how to better debug it?

About this trick with the timestamps to be able to do a forward scan, thanks for pointing it out. Actually, I am aware of it. The problem I have is, sometimes I want to get the key after a particular timestamp and sometimes I want to get the key before, so just relying on the key order doesn't work. Ideally, I want a reverse get(). I thought reverse scan can do the trick though.

Khaled

----------------------------------------
> Date: Thu, 2 Oct 2014 09:40:37 -0700
> From: larsh@apache.org
> Subject: Re: HBase read performance
> To: user@hbase.apache.org
>
> Hi Khaled,
> is it the same with fewer threads? 1500 handler threads seems to be a lot. Typically a good number of threads depends on the hardware (number of cores, number of spindles, etc). I cannot think of any type of scenario where more than 100 would give any improvement.
>
> How large is the payload per KV retrieved that way? If large (as in a few 100k) you definitely want to lower the number of the handler threads.
> How much heap do you give the region server? Does the working set fit into the cache? (i.e. in the metrics, do you see the eviction count going up, if so it does not fit into the cache).
>
> If the working set does not fit into the cache (eviction count goes up) then HBase will need to bring a new block in from disk on each Get (assuming the Gets are more or less random as far as the server is concerned).
> In case you'll benefit from reducing the HFile block size (from 64k to 8k or even 4k).
>
> Lastly I don't think we tested the performance of using reverse scan this way, there is probably room to optimize this.
> Can you restructure your keys to allow forwards scanning? For example you could store the time as MAX_LONG-time. Or you could invert all the bits of the time portion of the key, so that it sort the other way. Then you could do a forward scan.
>
> Let us know how it goes.
>
> -- Lars
>
>
> ----- Original Message -----
> From: Khaled Elmeleegy <kd...@hotmail.com>
> To: "user@hbase.apache.org" <us...@hbase.apache.org>
> Cc:
> Sent: Thursday, October 2, 2014 12:12 AM
> Subject: HBase read performance
>
> Hi,
>
> I am trying to do a scatter/gather on hbase (0.98.6.1), where I have a client reading ~1000 keys from an HBase table. These keys happen to fall on the same region server. For my reads I use reverse scan to read each key as I want the key prior to a specific time stamp (time stamps are stored in reverse order). I don't believe gets can accomplish that, right? so I use scan, with caching set to 1.
>
> I use 2000 reader threads in the client and on HBase, I've set hbase.regionserver.handler.count to 1500. With this setup, my scatter gather is very slow and can take up to 10s in total. Timing an individual getScanner(..) call on the client side, it can easily take few hundreds of ms. I also got the following metrics from the region server in question:
>
> "queueCallTime_mean" : 2.190855525775637,
> "queueCallTime_median" : 0.0,
> "queueCallTime_75th_percentile" : 0.0,
> "queueCallTime_95th_percentile" : 1.0,
> "queueCallTime_99th_percentile" : 556.9799999999818,
>
> "processCallTime_min" : 0,
> "processCallTime_max" : 12755,
> "processCallTime_mean" : 105.64873440912682,
> "processCallTime_median" : 0.0,
> "processCallTime_75th_percentile" : 2.0,
> "processCallTime_95th_percentile" : 7917.95,
> "processCallTime_99th_percentile" : 8876.89,
>
> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_min" : 89,
> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_max" : 11300,
> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_mean" : 654.4949739797315,
> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_median" : 101.0,
> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_75th_percentile" : 101.0,
> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_95th_percentile" : 101.0,
> "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_99th_percentile" : 113.0,
>
> Where "delta" is the name of the table I am querying.
>
> In addition to all this, i monitored the hardware resources (CPU, disk, and network) of both the client and the region server and nothing seems anywhere near saturation. So I am puzzled by what's going on and where this time is going.
>
> Few things to note based on the above measurements: both medians of IPC processCallTime and queueCallTime are basically zero (ms I presume, right?). However, scanNext_median is 101 (ms too, right?). I am not sure how this adds up. Also, even though the 101 figure seems outrageously high and I don't know why, still all these scans should be happening in parallel, so the overall call should finish fast, given that no hardware resource is contended, right? but this is not what's happening, so I have to be missing something(s).
>
> So, any help is appreciated there.
>
> Thanks,
> Khaled

Re: HBase read performance

Posted by lars hofhansl <la...@apache.org>.

Also you may be maxing out the network between the region server and your client - depending on the size of the KVs.

What's the bandwidth you see used on both ends?
Are you running clients on multiple machines?

----- Original Message -----
From: lars hofhansl <la...@apache.org>
To: "user@hbase.apache.org" <us...@hbase.apache.org>
Cc: 
Sent: Thursday, October 2, 2014 9:40 AM
Subject: Re: HBase read performance

Hi Khaled,
is it the same with fewer threads? 1500 handler threads seems to be a lot. Typically a good number of threads depends on the hardware (number of cores, number of spindles, etc). I cannot think of any type of scenario where more than 100 would give any improvement.

How large is the payload per KV retrieved that way? If large (as in a few 100k) you definitely want to lower the number of the handler threads.
How much heap do you give the region server? Does the working set fit into the cache? (i.e. in the metrics, do you see the eviction count going up, if so it does not fit into the cache).

If the working set does not fit into the cache (eviction count goes up) then HBase will need to bring a new block in from disk on each Get (assuming the Gets are more or less random as far as the server is concerned).
In case you'll benefit from reducing the HFile block size (from 64k to 8k or even 4k).

Lastly I don't think we tested the performance of using reverse scan this way, there is probably room to optimize this.
Can you restructure your keys to allow forwards scanning? For example you could store the time as MAX_LONG-time. Or you could invert all the bits of the time portion of the key, so that it sort the other way. Then you could do a forward scan.

Let us know how it goes.

-- Lars

----- Original Message -----
From: Khaled Elmeleegy <kd...@hotmail.com>
To: "user@hbase.apache.org" <us...@hbase.apache.org>
Cc: 
Sent: Thursday, October 2, 2014 12:12 AM
Subject: HBase read performance

Hi,

I am trying to do a scatter/gather on hbase (0.98.6.1), where I have a client reading ~1000 keys from an HBase table. These keys happen to fall on the same region server. For my reads I use reverse scan to read each key as I want the key prior to a specific time stamp (time stamps are stored in reverse order). I don't believe gets can accomplish that, right? so I use scan, with caching set to 1.

I use 2000 reader threads in the client and on HBase, I've set hbase.regionserver.handler.count to 1500. With this setup, my scatter gather is very slow and can take up to 10s in total. Timing an individual getScanner(..) call on the client side, it can easily take few hundreds of ms. I also got the following metrics from the region server in question:

"queueCallTime_mean" : 2.190855525775637,
"queueCallTime_median" : 0.0,
"queueCallTime_75th_percentile" : 0.0,
"queueCallTime_95th_percentile" : 1.0,
"queueCallTime_99th_percentile" : 556.9799999999818,

"processCallTime_min" : 0,
"processCallTime_max" : 12755,
"processCallTime_mean" : 105.64873440912682,
"processCallTime_median" : 0.0,
"processCallTime_75th_percentile" : 2.0,
"processCallTime_95th_percentile" : 7917.95,
"processCallTime_99th_percentile" : 8876.89,

"namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_min" : 89,
"namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_max" : 11300,
"namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_mean" : 654.4949739797315,
"namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_median" : 101.0,
"namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_75th_percentile" : 101.0,
"namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_95th_percentile" : 101.0,
"namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_99th_percentile" : 113.0,

Where "delta" is the name of the table I am querying.

In addition to all this, i monitored the hardware resources (CPU, disk, and network) of both the client and the region server and nothing seems anywhere near saturation. So I am puzzled by what's going on and where this time is going.

Few things to note based on the above measurements: both medians of IPC processCallTime and queueCallTime are basically zero (ms I presume, right?). However, scanNext_median is 101 (ms too, right?). I am not sure how this adds up. Also, even though the 101 figure seems outrageously high and I don't know why, still all these scans should be happening in parallel, so the overall call should finish fast, given that no hardware resource is contended, right? but this is not what's happening, so I have to be missing something(s). 

So, any help is appreciated there.

Thanks,
Khaled

Re: HBase read performance

Posted by lars hofhansl <la...@apache.org>.

Hi Khaled,
is it the same with fewer threads? 1500 handler threads seems to be a lot. Typically a good number of threads depends on the hardware (number of cores, number of spindles, etc). I cannot think of any type of scenario where more than 100 would give any improvement.

How large is the payload per KV retrieved that way? If large (as in a few 100k) you definitely want to lower the number of the handler threads.
How much heap do you give the region server? Does the working set fit into the cache? (i.e. in the metrics, do you see the eviction count going up, if so it does not fit into the cache).

If the working set does not fit into the cache (eviction count goes up) then HBase will need to bring a new block in from disk on each Get (assuming the Gets are more or less random as far as the server is concerned).
In case you'll benefit from reducing the HFile block size (from 64k to 8k or even 4k).

Lastly I don't think we tested the performance of using reverse scan this way, there is probably room to optimize this.
Can you restructure your keys to allow forwards scanning? For example you could store the time as MAX_LONG-time. Or you could invert all the bits of the time portion of the key, so that it sort the other way. Then you could do a forward scan.

Let us know how it goes.

-- Lars


----- Original Message -----
From: Khaled Elmeleegy <kd...@hotmail.com>
To: "user@hbase.apache.org" <us...@hbase.apache.org>
Cc: 
Sent: Thursday, October 2, 2014 12:12 AM
Subject: HBase read performance

Hi,

I am trying to do a scatter/gather on hbase (0.98.6.1), where I have a client reading ~1000 keys from an HBase table. These keys happen to fall on the same region server. For my reads I use reverse scan to read each key as I want the key prior to a specific time stamp (time stamps are stored in reverse order). I don't believe gets can accomplish that, right? so I use scan, with caching set to 1.

I use 2000 reader threads in the client and on HBase, I've set hbase.regionserver.handler.count to 1500. With this setup, my scatter gather is very slow and can take up to 10s in total. Timing an individual getScanner(..) call on the client side, it can easily take few hundreds of ms. I also got the following metrics from the region server in question:

"queueCallTime_mean" : 2.190855525775637,
"queueCallTime_median" : 0.0,
"queueCallTime_75th_percentile" : 0.0,
"queueCallTime_95th_percentile" : 1.0,
"queueCallTime_99th_percentile" : 556.9799999999818,

"processCallTime_min" : 0,
"processCallTime_max" : 12755,
"processCallTime_mean" : 105.64873440912682,
"processCallTime_median" : 0.0,
"processCallTime_75th_percentile" : 2.0,
"processCallTime_95th_percentile" : 7917.95,
"processCallTime_99th_percentile" : 8876.89,

"namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_min" : 89,
"namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_max" : 11300,
"namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_mean" : 654.4949739797315,
"namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_median" : 101.0,
"namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_75th_percentile" : 101.0,
"namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_95th_percentile" : 101.0,
"namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_99th_percentile" : 113.0,

Where "delta" is the name of the table I am querying.

In addition to all this, i monitored the hardware resources (CPU, disk, and network) of both the client and the region server and nothing seems anywhere near saturation. So I am puzzled by what's going on and where this time is going.

Few things to note based on the above measurements: both medians of IPC processCallTime and queueCallTime are basically zero (ms I presume, right?). However, scanNext_median is 101 (ms too, right?). I am not sure how this adds up. Also, even though the 101 figure seems outrageously high and I don't know why, still all these scans should be happening in parallel, so the overall call should finish fast, given that no hardware resource is contended, right? but this is not what's happening, so I have to be missing something(s). 

So, any help is appreciated there.

Thanks,
Khaled