You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Jan Schellenberger <le...@gmail.com> on 2014/02/01 00:12:09 UTC

Slow Get Performance (or how many disk I/O does it take for one non-cached read?)

I am running a cluster and getting slow performance - about 50 reads/sec/node
or about 800 reads/sec for the cluster.  The data is too big to fit into
memory and my access pattern is completely random reads which is presumably
difficult for hbase.  Is my read speed reasonable?  I feel like typical read
speeds I've seen reported are much higher?



Hardware/Software Configuration:
17 nodes + 1 master
8 cores
24 gigs ram
4x1TB 3.5" hard drives (I know this is low for hbase - we're working on
getting more disks)
running Cloudera CDH 4.3 with hbase .94.6
Most configurations are default except I'm using 12GB heap space/region
server and the block cache is .4 instead of .25 but neither of these two
things makes much of a difference.   I am NOT having a GC issue.  Latencies
are around 40ms and 99% is 200ms. 


Dataset Description:
6 tables ~300GB each (uncompressed) or 120GB each compressed <- compression
speeds things up a bit.
I just ran a major compaction so block locality is 100%
Each Table has a single column family and a single column ("c:d").  
keys are short strigs ~10-20 characters.
values are short json ~500 characters
100% Gets.  No Puts
I am heavily using time stamping.  maxversions is set to Integer.MAXINT.  My
gets have a maxretrieved of 200.  A typical row would have < 10 versions on
average though.  <1% of queries would max out at 200 versions returned.

Here are table configurations (I've also tried Snappy compression)
{NAME => 'TABLE1', FAMILIES => [{NAME => 'c', DATA_BLOCK_ENCODING => 'NONE'
 , BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '0', VERSIONS => '2147483647',
COMPR
 ESSION => 'NONE', MIN_VERSIONS => '0', TTL => '2147483647',
KEEP_DELETED_CELLS =>
  'false', BLOCKSIZE => '65536', IN_MEMORY => 'false', ENCODE_ON_DISK =>
'true', BLOCKCACHE => 'true'}]}


I am using the master node to query (with 20 threads) and get about 800
Gets/second.  Each worker node is completely swamped by disk i/o - I'm
seeing 80 io/sec with iostat for each of the 4 disk with a throughput of
about 10MB/sec each.  So this means it's reading roughly 120kB/transfer and
it's taking about 8 Hard Disk I/O's per Get request.  Does that seem
reasonable?  I've read the HFILE specs and I feel if the block index is
loaded into memory, it should take 1 hard disk read to read the proper block
with my row.


The region servers have a blockCacheHitRatio of about 33% (no compression)
or 50% (snappy compression)

Here are some regionserver stats while I'm running queries.  This is the
uncompressed table version and queries are only 38/sec
 
requestsPerSecond=38, numberOfOnlineRegions=212,
 numberOfStores=212, numberOfStorefiles=212, storefileIndexSizeMB=0,
rootIndexSizeKB=190, totalStaticIndexSizeKB=172689,
totalStaticBloomSizeKB=79692, memstoreSizeMB=0, mbInMemoryWithoutWAL=0,
numberOfPutsWithoutWAL=0, readRequestsCount=1865459,
writeRequestsCount=0, compactionQueueSize=0, flushQueueSize=0,
usedHeapMB=4565, maxHeapMB=12221, blockCacheSizeMB=4042.53,
blockCacheFreeMB=846.07, blockCacheCount=62176,
blockCacheHitCount=5389770, blockCacheMissCount=9909385,
blockCacheEvictedCount=2744919, blockCacheHitRatio=35%,
blockCacheHitCachingRatio=65%, hdfsBlocksLocalityIndex=99,
slowHLogAppendCount=0, fsReadLatencyHistogramMean=1570049.34,
fsReadLatencyHistogramCount=1239690.00,
fsReadLatencyHistogramMedian=20859045.50,
fsReadLatencyHistogram75th=35791318.75,
fsReadLatencyHistogram95th=97093132.05,
fsReadLatencyHistogram99th=179688655.93,
fsReadLatencyHistogram999th=312277183.40,
fsPreadLatencyHistogramMean=35548585.63,
fsPreadLatencyHistogramCount=2803268.00,
fsPreadLatencyHistogramMedian=37662144.00,
fsPreadLatencyHistogram75th=55991186.50,
fsPreadLatencyHistogram95th=116227275.50,
fsPreadLatencyHistogram99th=173173999.27,
fsPreadLatencyHistogram999th=273812341.79,
fsWriteLatencyHistogramMean=1523660.72,
fsWriteLatencyHistogramCount=1225000.00,
fsWriteLatencyHistogramMedian=226540.50,
fsWriteLatencyHistogram75th=380366.00,
fsWriteLatencyHistogram95th=2193516.80,
fsWriteLatencyHistogram99th=4290208.93,
fsWriteLatencyHistogram999th=6926850.53









--
View this message in context: http://apache-hbase.679495.n3.nabble.com/Slow-Get-Performance-or-how-many-disk-I-O-does-it-take-for-one-non-cached-read-tp4055545.html
Sent from the HBase User mailing list archive at Nabble.com.

Re: Slow Get Performance (or how many disk I/O does it take for one non-cached read?)

Posted by Ted Yu <yu...@gmail.com>.
bq. DATA_BLOCK_ENCODING => 'NONE'

Have you tried enabling data block encoding with e.g. FAST_DIFF ?

Cheers


On Fri, Jan 31, 2014 at 3:12 PM, Jan Schellenberger <le...@gmail.com>wrote:

> I am running a cluster and getting slow performance - about 50
> reads/sec/node
> or about 800 reads/sec for the cluster.  The data is too big to fit into
> memory and my access pattern is completely random reads which is presumably
> difficult for hbase.  Is my read speed reasonable?  I feel like typical
> read
> speeds I've seen reported are much higher?
>
>
>
> Hardware/Software Configuration:
> 17 nodes + 1 master
> 8 cores
> 24 gigs ram
> 4x1TB 3.5" hard drives (I know this is low for hbase - we're working on
> getting more disks)
> running Cloudera CDH 4.3 with hbase .94.6
> Most configurations are default except I'm using 12GB heap space/region
> server and the block cache is .4 instead of .25 but neither of these two
> things makes much of a difference.   I am NOT having a GC issue.  Latencies
> are around 40ms and 99% is 200ms.
>
>
> Dataset Description:
> 6 tables ~300GB each (uncompressed) or 120GB each compressed <- compression
> speeds things up a bit.
> I just ran a major compaction so block locality is 100%
> Each Table has a single column family and a single column ("c:d").
> keys are short strigs ~10-20 characters.
> values are short json ~500 characters
> 100% Gets.  No Puts
> I am heavily using time stamping.  maxversions is set to Integer.MAXINT.
>  My
> gets have a maxretrieved of 200.  A typical row would have < 10 versions on
> average though.  <1% of queries would max out at 200 versions returned.
>
> Here are table configurations (I've also tried Snappy compression)
> {NAME => 'TABLE1', FAMILIES => [{NAME => 'c', DATA_BLOCK_ENCODING => 'NONE'
>  , BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '0', VERSIONS =>
> '2147483647',
> COMPR
>  ESSION => 'NONE', MIN_VERSIONS => '0', TTL => '2147483647',
> KEEP_DELETED_CELLS =>
>   'false', BLOCKSIZE => '65536', IN_MEMORY => 'false', ENCODE_ON_DISK =>
> 'true', BLOCKCACHE => 'true'}]}
>
>
> I am using the master node to query (with 20 threads) and get about 800
> Gets/second.  Each worker node is completely swamped by disk i/o - I'm
> seeing 80 io/sec with iostat for each of the 4 disk with a throughput of
> about 10MB/sec each.  So this means it's reading roughly 120kB/transfer and
> it's taking about 8 Hard Disk I/O's per Get request.  Does that seem
> reasonable?  I've read the HFILE specs and I feel if the block index is
> loaded into memory, it should take 1 hard disk read to read the proper
> block
> with my row.
>
>
> The region servers have a blockCacheHitRatio of about 33% (no compression)
> or 50% (snappy compression)
>
> Here are some regionserver stats while I'm running queries.  This is the
> uncompressed table version and queries are only 38/sec
>
> requestsPerSecond=38, numberOfOnlineRegions=212,
>  numberOfStores=212, numberOfStorefiles=212, storefileIndexSizeMB=0,
> rootIndexSizeKB=190, totalStaticIndexSizeKB=172689,
> totalStaticBloomSizeKB=79692, memstoreSizeMB=0, mbInMemoryWithoutWAL=0,
> numberOfPutsWithoutWAL=0, readRequestsCount=1865459,
> writeRequestsCount=0, compactionQueueSize=0, flushQueueSize=0,
> usedHeapMB=4565, maxHeapMB=12221, blockCacheSizeMB=4042.53,
> blockCacheFreeMB=846.07, blockCacheCount=62176,
> blockCacheHitCount=5389770, blockCacheMissCount=9909385,
> blockCacheEvictedCount=2744919, blockCacheHitRatio=35%,
> blockCacheHitCachingRatio=65%, hdfsBlocksLocalityIndex=99,
> slowHLogAppendCount=0, fsReadLatencyHistogramMean=1570049.34,
> fsReadLatencyHistogramCount=1239690.00,
> fsReadLatencyHistogramMedian=20859045.50,
> fsReadLatencyHistogram75th=35791318.75,
> fsReadLatencyHistogram95th=97093132.05,
> fsReadLatencyHistogram99th=179688655.93,
> fsReadLatencyHistogram999th=312277183.40,
> fsPreadLatencyHistogramMean=35548585.63,
> fsPreadLatencyHistogramCount=2803268.00,
> fsPreadLatencyHistogramMedian=37662144.00,
> fsPreadLatencyHistogram75th=55991186.50,
> fsPreadLatencyHistogram95th=116227275.50,
> fsPreadLatencyHistogram99th=173173999.27,
> fsPreadLatencyHistogram999th=273812341.79,
> fsWriteLatencyHistogramMean=1523660.72,
> fsWriteLatencyHistogramCount=1225000.00,
> fsWriteLatencyHistogramMedian=226540.50,
> fsWriteLatencyHistogram75th=380366.00,
> fsWriteLatencyHistogram95th=2193516.80,
> fsWriteLatencyHistogram99th=4290208.93,
> fsWriteLatencyHistogram999th=6926850.53
>
>
>
>
>
>
>
>
>
> --
> View this message in context:
> http://apache-hbase.679495.n3.nabble.com/Slow-Get-Performance-or-how-many-disk-I-O-does-it-take-for-one-non-cached-read-tp4055545.html
> Sent from the HBase User mailing list archive at Nabble.com.
>

RE: Slow Get Performance (or how many disk I/O does it take for one non-cached read?)

Posted by Vladimir Rodionov <vr...@carrieriq.com>.
>> #3 I'm not sure I understand this suggestion - are you saying doing region
>> custom region splitting?  Each region is fully compacted so there is only
>> one HFile.  The queries I do are: "get me the most recent versions, up to
>> 200".  However I need to store more versions, because I may ask "get me the
>> most recent versions, up to 200 that I would have seen yesterday".

I am afraid, your only option is SSD.

Best regards,
Vladimir Rodionov
Principal Platform Engineer
Carrier IQ, www.carrieriq.com
e-mail: vrodionov@carrieriq.com

________________________________________
From: Jan Schellenberger [leipzig3@gmail.com]
Sent: Friday, January 31, 2014 6:38 PM
To: user@hbase.apache.org
Subject: RE: Slow Get Performance (or how many disk I/O does it take for one non-cached read?)

Thank you.  I will have to test these things one at a time.

I re-enabled compression (SNAPPY for now) and changed the block encoding to
FAST_DIFF.

#1 I will try GZ encoding.
#2 The block cache size is already at .4. I will try to increase it a bit
more but I will never get the whole set into memory.
I will disable bloom filter.

#4 I will investigate this.  I thought I read somewhere that cloudera 4.3
has this shortcut enabled by default but I will try to verify.

#3 I'm not sure I understand this suggestion - are you saying doing region
custom region splitting?  Each region is fully compacted so there is only
one HFile.  The queries I do are: "get me the most recent versions, up to
200".  However I need to store more versions, because I may ask "get me the
most recent versions, up to 200 that I would have seen yesterday".


#5 HDFS short circuit is already enabled already by default.
#6 yes SSD would clearly be better.

#7 The average result of the get is fairly small.  no more than 1kB I'd say.
We do hit each key with roughly the same probability.



I'm concerned about the block cache... It sounds like the improper blocks
are being cached.  i thought there was a preference to cache index and bloom
blocks.

I'm currently* running 60 queries/second* one node and it's reading
blockCacheHitRatio=29 and blockCacheHitCachingRatio=65% (not sure what's the
difference).

I also see rootIndexSize=122k totalStaticIndexSize=88MB and
totalstaticBloomSize=80MB (will disable bloomfilters in next run of this).
hdfslocality=100%





--
View this message in context: http://apache-hbase.679495.n3.nabble.com/Slow-Get-Performance-or-how-many-disk-I-O-does-it-take-for-one-non-cached-read-tp4055545p4055554.html
Sent from the HBase User mailing list archive at Nabble.com.

Confidentiality Notice:  The information contained in this message, including any attachments hereto, may be confidential and is intended to be read only by the individual or entity to whom this message is addressed. If the reader of this message is not the intended recipient or an agent or designee of the intended recipient, please note that any review, use, disclosure or distribution of this message or its attachments, in any form, is strictly prohibited.  If you have received this message in error, please immediately notify the sender and/or Notifications@carrieriq.com and delete or destroy any copy of this message and its attachments.

Re: Slow Get Performance (or how many disk I/O does it take for one non-cached read?)

Posted by Ted Yu <yu...@gmail.com>.
I realized that after hitting Send button :-)

And 0.94.17 is around the corner, right ?


On Fri, Jan 31, 2014 at 9:27 PM, lars hofhansl <la...@apache.org> wrote:

> 0.94.16 is out already :)
>
>
>
> ----- Original Message -----
> From: Ted Yu <yu...@gmail.com>
> To: "user@hbase.apache.org" <us...@hbase.apache.org>
> Cc:
> Sent: Friday, January 31, 2014 8:28 PM
> Subject: Re: Slow Get Performance (or how many disk I/O does it take for
> one non-cached read?)
>
> For #4,
> bq. has this shortcut enabled by default
>
> Inline checksum is different from short circuit read. Inline checksum is
> enabled by default in 0.96 and later releases - see HBASE-8322
>
> Meanwhile, you can consider upgrading to 0.94.15 - there have been quite
> some improvements since 0.94.6
>
> Cheers
>
>
>
> On Fri, Jan 31, 2014 at 6:38 PM, Jan Schellenberger <leipzig3@gmail.com
> >wrote:
>
> > Thank you.  I will have to test these things one at a time.
> >
> > I re-enabled compression (SNAPPY for now) and changed the block encoding
> to
> > FAST_DIFF.
> >
> > #1 I will try GZ encoding.
> > #2 The block cache size is already at .4. I will try to increase it a bit
> > more but I will never get the whole set into memory.
> > I will disable bloom filter.
> >
> > #4 I will investigate this.  I thought I read somewhere that cloudera 4.3
> > has this shortcut enabled by default but I will try to verify.
> >
> > #3 I'm not sure I understand this suggestion - are you saying doing
> region
> > custom region splitting?  Each region is fully compacted so there is only
> > one HFile.  The queries I do are: "get me the most recent versions, up to
> > 200".  However I need to store more versions, because I may ask "get me
> the
> > most recent versions, up to 200 that I would have seen yesterday".
> >
> >
> > #5 HDFS short circuit is already enabled already by default.
> > #6 yes SSD would clearly be better.
> >
> > #7 The average result of the get is fairly small.  no more than 1kB I'd
> > say.
> > We do hit each key with roughly the same probability.
> >
> >
> >
> > I'm concerned about the block cache... It sounds like the improper blocks
> > are being cached.  i thought there was a preference to cache index and
> > bloom
> > blocks.
> >
> > I'm currently* running 60 queries/second* one node and it's reading
> > blockCacheHitRatio=29 and blockCacheHitCachingRatio=65% (not sure what's
> > the
> > difference).
> >
> > I also see rootIndexSize=122k totalStaticIndexSize=88MB and
> > totalstaticBloomSize=80MB (will disable bloomfilters in next run of
> this).
> > hdfslocality=100%
> >
> >
> >
> >
> >
> > --
> > View this message in context:
> >
> http://apache-hbase.679495.n3.nabble.com/Slow-Get-Performance-or-how-many-disk-I-O-does-it-take-for-one-non-cached-read-tp4055545p4055554.html
> > Sent from the HBase User mailing list archive at Nabble.com.
> >
>
>

Re: Slow Get Performance (or how many disk I/O does it take for one non-cached read?)

Posted by lars hofhansl <la...@apache.org>.
0.94.16 is out already :)



----- Original Message -----
From: Ted Yu <yu...@gmail.com>
To: "user@hbase.apache.org" <us...@hbase.apache.org>
Cc: 
Sent: Friday, January 31, 2014 8:28 PM
Subject: Re: Slow Get Performance (or how many disk I/O does it take for one non-cached read?)

For #4,
bq. has this shortcut enabled by default

Inline checksum is different from short circuit read. Inline checksum is
enabled by default in 0.96 and later releases - see HBASE-8322

Meanwhile, you can consider upgrading to 0.94.15 - there have been quite
some improvements since 0.94.6

Cheers



On Fri, Jan 31, 2014 at 6:38 PM, Jan Schellenberger <le...@gmail.com>wrote:

> Thank you.  I will have to test these things one at a time.
>
> I re-enabled compression (SNAPPY for now) and changed the block encoding to
> FAST_DIFF.
>
> #1 I will try GZ encoding.
> #2 The block cache size is already at .4. I will try to increase it a bit
> more but I will never get the whole set into memory.
> I will disable bloom filter.
>
> #4 I will investigate this.  I thought I read somewhere that cloudera 4.3
> has this shortcut enabled by default but I will try to verify.
>
> #3 I'm not sure I understand this suggestion - are you saying doing region
> custom region splitting?  Each region is fully compacted so there is only
> one HFile.  The queries I do are: "get me the most recent versions, up to
> 200".  However I need to store more versions, because I may ask "get me the
> most recent versions, up to 200 that I would have seen yesterday".
>
>
> #5 HDFS short circuit is already enabled already by default.
> #6 yes SSD would clearly be better.
>
> #7 The average result of the get is fairly small.  no more than 1kB I'd
> say.
> We do hit each key with roughly the same probability.
>
>
>
> I'm concerned about the block cache... It sounds like the improper blocks
> are being cached.  i thought there was a preference to cache index and
> bloom
> blocks.
>
> I'm currently* running 60 queries/second* one node and it's reading
> blockCacheHitRatio=29 and blockCacheHitCachingRatio=65% (not sure what's
> the
> difference).
>
> I also see rootIndexSize=122k totalStaticIndexSize=88MB and
> totalstaticBloomSize=80MB (will disable bloomfilters in next run of this).
> hdfslocality=100%
>
>
>
>
>
> --
> View this message in context:
> http://apache-hbase.679495.n3.nabble.com/Slow-Get-Performance-or-how-many-disk-I-O-does-it-take-for-one-non-cached-read-tp4055545p4055554.html
> Sent from the HBase User mailing list archive at Nabble.com.
>


Re: Slow Get Performance (or how many disk I/O does it take for one non-cached read?)

Posted by Ted Yu <yu...@gmail.com>.
For #4,
bq. has this shortcut enabled by default

Inline checksum is different from short circuit read. Inline checksum is
enabled by default in 0.96 and later releases - see HBASE-8322

Meanwhile, you can consider upgrading to 0.94.15 - there have been quite
some improvements since 0.94.6

Cheers


On Fri, Jan 31, 2014 at 6:38 PM, Jan Schellenberger <le...@gmail.com>wrote:

> Thank you.  I will have to test these things one at a time.
>
> I re-enabled compression (SNAPPY for now) and changed the block encoding to
> FAST_DIFF.
>
> #1 I will try GZ encoding.
> #2 The block cache size is already at .4. I will try to increase it a bit
> more but I will never get the whole set into memory.
> I will disable bloom filter.
>
> #4 I will investigate this.  I thought I read somewhere that cloudera 4.3
> has this shortcut enabled by default but I will try to verify.
>
> #3 I'm not sure I understand this suggestion - are you saying doing region
> custom region splitting?  Each region is fully compacted so there is only
> one HFile.  The queries I do are: "get me the most recent versions, up to
> 200".  However I need to store more versions, because I may ask "get me the
> most recent versions, up to 200 that I would have seen yesterday".
>
>
> #5 HDFS short circuit is already enabled already by default.
> #6 yes SSD would clearly be better.
>
> #7 The average result of the get is fairly small.  no more than 1kB I'd
> say.
> We do hit each key with roughly the same probability.
>
>
>
> I'm concerned about the block cache... It sounds like the improper blocks
> are being cached.  i thought there was a preference to cache index and
> bloom
> blocks.
>
> I'm currently* running 60 queries/second* one node and it's reading
> blockCacheHitRatio=29 and blockCacheHitCachingRatio=65% (not sure what's
> the
> difference).
>
> I also see rootIndexSize=122k totalStaticIndexSize=88MB and
> totalstaticBloomSize=80MB (will disable bloomfilters in next run of this).
> hdfslocality=100%
>
>
>
>
>
> --
> View this message in context:
> http://apache-hbase.679495.n3.nabble.com/Slow-Get-Performance-or-how-many-disk-I-O-does-it-take-for-one-non-cached-read-tp4055545p4055554.html
> Sent from the HBase User mailing list archive at Nabble.com.
>

RE: Slow Get Performance (or how many disk I/O does it take for one non-cached read?)

Posted by Jan Schellenberger <le...@gmail.com>.
Thank you.  I will have to test these things one at a time.

I re-enabled compression (SNAPPY for now) and changed the block encoding to
FAST_DIFF.  

#1 I will try GZ encoding.
#2 The block cache size is already at .4. I will try to increase it a bit
more but I will never get the whole set into memory.  
I will disable bloom filter.  

#4 I will investigate this.  I thought I read somewhere that cloudera 4.3
has this shortcut enabled by default but I will try to verify.

#3 I'm not sure I understand this suggestion - are you saying doing region
custom region splitting?  Each region is fully compacted so there is only
one HFile.  The queries I do are: "get me the most recent versions, up to
200".  However I need to store more versions, because I may ask "get me the
most recent versions, up to 200 that I would have seen yesterday".


#5 HDFS short circuit is already enabled already by default.
#6 yes SSD would clearly be better.

#7 The average result of the get is fairly small.  no more than 1kB I'd say. 
We do hit each key with roughly the same probability.  



I'm concerned about the block cache... It sounds like the improper blocks
are being cached.  i thought there was a preference to cache index and bloom
blocks.  

I'm currently* running 60 queries/second* one node and it's reading
blockCacheHitRatio=29 and blockCacheHitCachingRatio=65% (not sure what's the
difference). 

I also see rootIndexSize=122k totalStaticIndexSize=88MB and
totalstaticBloomSize=80MB (will disable bloomfilters in next run of this). 
hdfslocality=100%





--
View this message in context: http://apache-hbase.679495.n3.nabble.com/Slow-Get-Performance-or-how-many-disk-I-O-does-it-take-for-one-non-cached-read-tp4055545p4055554.html
Sent from the HBase User mailing list archive at Nabble.com.

Re: Slow Get Performance (or how many disk I/O does it take for one non-cached read?)

Posted by Ted Yu <yu...@gmail.com>.
bq. #3. Custom compaction

Stripe compaction would be in the upcoming 0.98.0 release.
See HBASE-7667 Support stripe compaction

Cheers


On Fri, Jan 31, 2014 at 5:29 PM, Vladimir Rodionov
<vr...@carrieriq.com>wrote:

>
> #1 Use GZ compression instead of SNAPPY - usually it gives you additional
> 1.5 x
>
> Block Cache hit rate 50% is very low, actually and it is strange. On every
> GET there will be at least 3 accesses to block cache:
>
> get INDEX block, get BLOOM block, get DATA block. Therefore, everything
> below 66% is actually, - nothing.
>
> #2: Try to increase block cache size and see what will happen?
>
> Your Bloomfilter does not work actually, because you have zillion of
> versions. In this case, the only thing which can help you:
>
> major compaction of regions... or better -
>
> #3. Custom compaction, which will create non-overlapping, by timestamp,
> store files. Yes, its hard.
>
> #4 Disable CRC32 check in HDFS and enable inline CRC in HBase - this will
> save you 50% of IOPS.
> https://issues.apache.org/jira/browse/HBASE-5074
>
> #5 Enable short circuit reads (See HBase book on short circuit reads)
>
> #6 For your use case, probably,  the good idea to try SSDs.
>
> and finally,
>
> #7 the rule of thumb is to have your hot data set in RAM. Does not fit?
> Increase RAM, increase # of servers.
>
> btw, what is the average size of GET result and do you really touch every
> key in your data set with the same probability?
>
> Best regards,
> Vladimir Rodionov
> Principal Platform Engineer
> Carrier IQ, www.carrieriq.com
> e-mail: vrodionov@carrieriq.com
>
> ________________________________________
> From: Jan Schellenberger [leipzig3@gmail.com]
> Sent: Friday, January 31, 2014 3:12 PM
> To: user@hbase.apache.org
> Subject: Slow Get Performance (or how many disk I/O does it take for one
> non-cached read?)
>
> I am running a cluster and getting slow performance - about 50
> reads/sec/node
> or about 800 reads/sec for the cluster.  The data is too big to fit into
> memory and my access pattern is completely random reads which is presumably
> difficult for hbase.  Is my read speed reasonable?  I feel like typical
> read
> speeds I've seen reported are much higher?
>
>
>
> Hardware/Software Configuration:
> 17 nodes + 1 master
> 8 cores
> 24 gigs ram
> 4x1TB 3.5" hard drives (I know this is low for hbase - we're working on
> getting more disks)
> running Cloudera CDH 4.3 with hbase .94.6
> Most configurations are default except I'm using 12GB heap space/region
> server and the block cache is .4 instead of .25 but neither of these two
> things makes much of a difference.   I am NOT having a GC issue.  Latencies
> are around 40ms and 99% is 200ms.
>
>
> Dataset Description:
> 6 tables ~300GB each (uncompressed) or 120GB each compressed <- compression
> speeds things up a bit.
> I just ran a major compaction so block locality is 100%
> Each Table has a single column family and a single column ("c:d").
> keys are short strigs ~10-20 characters.
> values are short json ~500 characters
> 100% Gets.  No Puts
> I am heavily using time stamping.  maxversions is set to Integer.MAXINT.
>  My
> gets have a maxretrieved of 200.  A typical row would have < 10 versions on
> average though.  <1% of queries would max out at 200 versions returned.
>
> Here are table configurations (I've also tried Snappy compression)
> {NAME => 'TABLE1', FAMILIES => [{NAME => 'c', DATA_BLOCK_ENCODING => 'NONE'
>  , BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '0', VERSIONS =>
> '2147483647',
> COMPR
>  ESSION => 'NONE', MIN_VERSIONS => '0', TTL => '2147483647',
> KEEP_DELETED_CELLS =>
>   'false', BLOCKSIZE => '65536', IN_MEMORY => 'false', ENCODE_ON_DISK =>
> 'true', BLOCKCACHE => 'true'}]}
>
>
> I am using the master node to query (with 20 threads) and get about 800
> Gets/second.  Each worker node is completely swamped by disk i/o - I'm
> seeing 80 io/sec with iostat for each of the 4 disk with a throughput of
> about 10MB/sec each.  So this means it's reading roughly 120kB/transfer and
> it's taking about 8 Hard Disk I/O's per Get request.  Does that seem
> reasonable?  I've read the HFILE specs and I feel if the block index is
> loaded into memory, it should take 1 hard disk read to read the proper
> block
> with my row.
>
>
> The region servers have a blockCacheHitRatio of about 33% (no compression)
> or 50% (snappy compression)
>
> Here are some regionserver stats while I'm running queries.  This is the
> uncompressed table version and queries are only 38/sec
>
> requestsPerSecond=38, numberOfOnlineRegions=212,
>  numberOfStores=212, numberOfStorefiles=212, storefileIndexSizeMB=0,
> rootIndexSizeKB=190, totalStaticIndexSizeKB=172689,
> totalStaticBloomSizeKB=79692, memstoreSizeMB=0, mbInMemoryWithoutWAL=0,
> numberOfPutsWithoutWAL=0, readRequestsCount=1865459,
> writeRequestsCount=0, compactionQueueSize=0, flushQueueSize=0,
> usedHeapMB=4565, maxHeapMB=12221, blockCacheSizeMB=4042.53,
> blockCacheFreeMB=846.07, blockCacheCount=62176,
> blockCacheHitCount=5389770, blockCacheMissCount=9909385,
> blockCacheEvictedCount=2744919, blockCacheHitRatio=35%,
> blockCacheHitCachingRatio=65%, hdfsBlocksLocalityIndex=99,
> slowHLogAppendCount=0, fsReadLatencyHistogramMean=1570049.34,
> fsReadLatencyHistogramCount=1239690.00,
> fsReadLatencyHistogramMedian=20859045.50,
> fsReadLatencyHistogram75th=35791318.75,
> fsReadLatencyHistogram95th=97093132.05,
> fsReadLatencyHistogram99th=179688655.93,
> fsReadLatencyHistogram999th=312277183.40,
> fsPreadLatencyHistogramMean=35548585.63,
> fsPreadLatencyHistogramCount=2803268.00,
> fsPreadLatencyHistogramMedian=37662144.00,
> fsPreadLatencyHistogram75th=55991186.50,
> fsPreadLatencyHistogram95th=116227275.50,
> fsPreadLatencyHistogram99th=173173999.27,
> fsPreadLatencyHistogram999th=273812341.79,
> fsWriteLatencyHistogramMean=1523660.72,
> fsWriteLatencyHistogramCount=1225000.00,
> fsWriteLatencyHistogramMedian=226540.50,
> fsWriteLatencyHistogram75th=380366.00,
> fsWriteLatencyHistogram95th=2193516.80,
> fsWriteLatencyHistogram99th=4290208.93,
> fsWriteLatencyHistogram999th=6926850.53
>
>
>
>
>
>
>
>
>
> --
> View this message in context:
> http://apache-hbase.679495.n3.nabble.com/Slow-Get-Performance-or-how-many-disk-I-O-does-it-take-for-one-non-cached-read-tp4055545.html
> Sent from the HBase User mailing list archive at Nabble.com.
>
> Confidentiality Notice:  The information contained in this message,
> including any attachments hereto, may be confidential and is intended to be
> read only by the individual or entity to whom this message is addressed. If
> the reader of this message is not the intended recipient or an agent or
> designee of the intended recipient, please note that any review, use,
> disclosure or distribution of this message or its attachments, in any form,
> is strictly prohibited.  If you have received this message in error, please
> immediately notify the sender and/or Notifications@carrieriq.com and
> delete or destroy any copy of this message and its attachments.
>

RE: Slow Get Performance (or how many disk I/O does it take for one non-cached read?)

Posted by Vladimir Rodionov <vr...@carrieriq.com>.
#1 Use GZ compression instead of SNAPPY - usually it gives you additional 1.5 x

Block Cache hit rate 50% is very low, actually and it is strange. On every GET there will be at least 3 accesses to block cache:

get INDEX block, get BLOOM block, get DATA block. Therefore, everything below 66% is actually, - nothing.

#2: Try to increase block cache size and see what will happen?

Your Bloomfilter does not work actually, because you have zillion of versions. In this case, the only thing which can help you:

major compaction of regions... or better -

#3. Custom compaction, which will create non-overlapping, by timestamp, store files. Yes, its hard.

#4 Disable CRC32 check in HDFS and enable inline CRC in HBase - this will save you 50% of IOPS.
https://issues.apache.org/jira/browse/HBASE-5074

#5 Enable short circuit reads (See HBase book on short circuit reads)

#6 For your use case, probably,  the good idea to try SSDs.

and finally,

#7 the rule of thumb is to have your hot data set in RAM. Does not fit? Increase RAM, increase # of servers.

btw, what is the average size of GET result and do you really touch every key in your data set with the same probability?

Best regards,
Vladimir Rodionov
Principal Platform Engineer
Carrier IQ, www.carrieriq.com
e-mail: vrodionov@carrieriq.com

________________________________________
From: Jan Schellenberger [leipzig3@gmail.com]
Sent: Friday, January 31, 2014 3:12 PM
To: user@hbase.apache.org
Subject: Slow Get Performance (or how many disk I/O does it take for one non-cached read?)

I am running a cluster and getting slow performance - about 50 reads/sec/node
or about 800 reads/sec for the cluster.  The data is too big to fit into
memory and my access pattern is completely random reads which is presumably
difficult for hbase.  Is my read speed reasonable?  I feel like typical read
speeds I've seen reported are much higher?



Hardware/Software Configuration:
17 nodes + 1 master
8 cores
24 gigs ram
4x1TB 3.5" hard drives (I know this is low for hbase - we're working on
getting more disks)
running Cloudera CDH 4.3 with hbase .94.6
Most configurations are default except I'm using 12GB heap space/region
server and the block cache is .4 instead of .25 but neither of these two
things makes much of a difference.   I am NOT having a GC issue.  Latencies
are around 40ms and 99% is 200ms.


Dataset Description:
6 tables ~300GB each (uncompressed) or 120GB each compressed <- compression
speeds things up a bit.
I just ran a major compaction so block locality is 100%
Each Table has a single column family and a single column ("c:d").
keys are short strigs ~10-20 characters.
values are short json ~500 characters
100% Gets.  No Puts
I am heavily using time stamping.  maxversions is set to Integer.MAXINT.  My
gets have a maxretrieved of 200.  A typical row would have < 10 versions on
average though.  <1% of queries would max out at 200 versions returned.

Here are table configurations (I've also tried Snappy compression)
{NAME => 'TABLE1', FAMILIES => [{NAME => 'c', DATA_BLOCK_ENCODING => 'NONE'
 , BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '0', VERSIONS => '2147483647',
COMPR
 ESSION => 'NONE', MIN_VERSIONS => '0', TTL => '2147483647',
KEEP_DELETED_CELLS =>
  'false', BLOCKSIZE => '65536', IN_MEMORY => 'false', ENCODE_ON_DISK =>
'true', BLOCKCACHE => 'true'}]}


I am using the master node to query (with 20 threads) and get about 800
Gets/second.  Each worker node is completely swamped by disk i/o - I'm
seeing 80 io/sec with iostat for each of the 4 disk with a throughput of
about 10MB/sec each.  So this means it's reading roughly 120kB/transfer and
it's taking about 8 Hard Disk I/O's per Get request.  Does that seem
reasonable?  I've read the HFILE specs and I feel if the block index is
loaded into memory, it should take 1 hard disk read to read the proper block
with my row.


The region servers have a blockCacheHitRatio of about 33% (no compression)
or 50% (snappy compression)

Here are some regionserver stats while I'm running queries.  This is the
uncompressed table version and queries are only 38/sec

requestsPerSecond=38, numberOfOnlineRegions=212,
 numberOfStores=212, numberOfStorefiles=212, storefileIndexSizeMB=0,
rootIndexSizeKB=190, totalStaticIndexSizeKB=172689,
totalStaticBloomSizeKB=79692, memstoreSizeMB=0, mbInMemoryWithoutWAL=0,
numberOfPutsWithoutWAL=0, readRequestsCount=1865459,
writeRequestsCount=0, compactionQueueSize=0, flushQueueSize=0,
usedHeapMB=4565, maxHeapMB=12221, blockCacheSizeMB=4042.53,
blockCacheFreeMB=846.07, blockCacheCount=62176,
blockCacheHitCount=5389770, blockCacheMissCount=9909385,
blockCacheEvictedCount=2744919, blockCacheHitRatio=35%,
blockCacheHitCachingRatio=65%, hdfsBlocksLocalityIndex=99,
slowHLogAppendCount=0, fsReadLatencyHistogramMean=1570049.34,
fsReadLatencyHistogramCount=1239690.00,
fsReadLatencyHistogramMedian=20859045.50,
fsReadLatencyHistogram75th=35791318.75,
fsReadLatencyHistogram95th=97093132.05,
fsReadLatencyHistogram99th=179688655.93,
fsReadLatencyHistogram999th=312277183.40,
fsPreadLatencyHistogramMean=35548585.63,
fsPreadLatencyHistogramCount=2803268.00,
fsPreadLatencyHistogramMedian=37662144.00,
fsPreadLatencyHistogram75th=55991186.50,
fsPreadLatencyHistogram95th=116227275.50,
fsPreadLatencyHistogram99th=173173999.27,
fsPreadLatencyHistogram999th=273812341.79,
fsWriteLatencyHistogramMean=1523660.72,
fsWriteLatencyHistogramCount=1225000.00,
fsWriteLatencyHistogramMedian=226540.50,
fsWriteLatencyHistogram75th=380366.00,
fsWriteLatencyHistogram95th=2193516.80,
fsWriteLatencyHistogram99th=4290208.93,
fsWriteLatencyHistogram999th=6926850.53









--
View this message in context: http://apache-hbase.679495.n3.nabble.com/Slow-Get-Performance-or-how-many-disk-I-O-does-it-take-for-one-non-cached-read-tp4055545.html
Sent from the HBase User mailing list archive at Nabble.com.

Confidentiality Notice:  The information contained in this message, including any attachments hereto, may be confidential and is intended to be read only by the individual or entity to whom this message is addressed. If the reader of this message is not the intended recipient or an agent or designee of the intended recipient, please note that any review, use, disclosure or distribution of this message or its attachments, in any form, is strictly prohibited.  If you have received this message in error, please immediately notify the sender and/or Notifications@carrieriq.com and delete or destroy any copy of this message and its attachments.

Re: Slow Get Performance (or how many disk I/O does it take for one non-cached read?)

Posted by lars hofhansl <la...@apache.org>.
You cannot disable the cache for the index block. The index blocks are always cached.


Well maybe you can by setting the block cache size to 0, would have to try :)



________________________________
 From: Vladimir Rodionov <vr...@carrieriq.com>
To: "user@hbase.apache.org" <us...@hbase.apache.org>; lars hofhansl <la...@apache.org> 
Sent: Friday, January 31, 2014 10:05 PM
Subject: RE: Slow Get Performance (or how many disk I/O does it take for one non-cached read?)
 

Lars wrote:
>> You can also try disabling the block cache, as it does not help in your scenario anyway.

It helps with caching INDEX blocks.? Or you suggest relying on OS page cache?  BLOOM blocks are useless, I think, therefore Bloom filter can be disabled.

Best regards,
Vladimir Rodionov
Principal Platform Engineer
Carrier IQ, www.carrieriq.com
e-mail: vrodionov@carrieriq.com

________________________________________

From: lars hofhansl [larsh@apache.org]
Sent: Friday, January 31, 2014 9:25 PM
To: user@hbase.apache.org
Subject: Re: Slow Get Performance (or how many disk I/O does it take for one non-cached read?)

If you data does not fit into cache and your request patter is essentially random then each GET will likely cause an entirely new HFile block to be read from disk (since that block was likely evicted due to other random GETs).

This is somewhat of a worst case for HBase. The default block size if 64k.
That is why the cache hit ratio is low and your disk IO is high. For each GET even reading just a single KV of a few hundred bytes, HBase needs to bring in 64k worth of data from disk.


With your load you can set the block size as low as 4k (or even lower).
That way HBase would still need to bring in a new block for each GET, but that block will only be 4k.
You can also try disabling the block cache, as it does not help in your scenario anyway.


Note that I mean the HFile block size, not the HDFS block (which is typically 64, 128, or 256 mb).


You can set this via the HBase as a column family parameter: BLOCKSIZE => '4906'
I'd start with 4k and then vary up and down and do some testing.

Truly random reads are very hard for any caching system.
Is your load really truly random, or is it just for testing?

-- Lars


----- Original Message -----
From: Jan Schellenberger <le...@gmail.com>
To: user@hbase.apache.org
Cc:
Sent: Friday, January 31, 2014 3:12 PM
Subject: Slow Get Performance (or how many disk I/O does it take for one non-cached read?)

I am running a cluster and getting slow performance - about 50 reads/sec/node
or about 800 reads/sec for the cluster.  The data is too big to fit into
memory and my access pattern is completely random reads which is presumably
difficult for hbase.  Is my read speed reasonable?  I feel like typical read
speeds I've seen reported are much higher?



Hardware/Software Configuration:
17 nodes + 1 master
8 cores
24 gigs ram
4x1TB 3.5" hard drives (I know this is low for hbase - we're working on
getting more disks)
running Cloudera CDH 4.3 with hbase .94.6
Most configurations are default except I'm using 12GB heap space/region
server and the block cache is .4 instead of .25 but neither of these two
things makes much of a difference.   I am NOT having a GC issue.  Latencies
are around 40ms and 99% is 200ms.


Dataset Description:
6 tables ~300GB each (uncompressed) or 120GB each compressed <- compression
speeds things up a bit.
I just ran a major compaction so block locality is 100%
Each Table has a single column family and a single column ("c:d").
keys are short strigs ~10-20 characters.
values are short json ~500 characters
100% Gets.  No Puts
I am heavily using time stamping.  maxversions is set to Integer.MAXINT.  My
gets have a maxretrieved of 200.  A typical row would have < 10 versions on
average though.  <1% of queries would max out at 200 versions returned.

Here are table configurations (I've also tried Snappy compression)
{NAME => 'TABLE1', FAMILIES => [{NAME => 'c', DATA_BLOCK_ENCODING => 'NONE'
, BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '0', VERSIONS => '2147483647',
COMPR
ESSION => 'NONE', MIN_VERSIONS => '0', TTL => '2147483647',
KEEP_DELETED_CELLS =>
  'false', BLOCKSIZE => '65536', IN_MEMORY => 'false', ENCODE_ON_DISK =>
'true', BLOCKCACHE => 'true'}]}


I am using the master node to query (with 20 threads) and get about 800
Gets/second.  Each worker node is completely swamped by disk i/o - I'm
seeing 80 io/sec with iostat for each of the 4 disk with a throughput of
about 10MB/sec each.  So this means it's reading roughly 120kB/transfer and
it's taking about 8 Hard Disk I/O's per Get request.  Does that seem
reasonable?  I've read the HFILE specs and I feel if the block index is
loaded into memory, it should take 1 hard disk read to read the proper block
with my row.


The region servers have a blockCacheHitRatio of about 33% (no compression)
or 50% (snappy compression)

Here are some regionserver stats while I'm running queries.  This is the
uncompressed table version and queries are only 38/sec

requestsPerSecond=38, numberOfOnlineRegions=212,
numberOfStores=212, numberOfStorefiles=212, storefileIndexSizeMB=0,
rootIndexSizeKB=190, totalStaticIndexSizeKB=172689,
totalStaticBloomSizeKB=79692, memstoreSizeMB=0, mbInMemoryWithoutWAL=0,
numberOfPutsWithoutWAL=0, readRequestsCount=1865459,
writeRequestsCount=0, compactionQueueSize=0, flushQueueSize=0,
usedHeapMB=4565, maxHeapMB=12221, blockCacheSizeMB=4042.53,
blockCacheFreeMB=846.07, blockCacheCount=62176,
blockCacheHitCount=5389770, blockCacheMissCount=9909385,
blockCacheEvictedCount=2744919, blockCacheHitRatio=35%,
blockCacheHitCachingRatio=65%, hdfsBlocksLocalityIndex=99,
slowHLogAppendCount=0, fsReadLatencyHistogramMean=1570049.34,
fsReadLatencyHistogramCount=1239690.00,
fsReadLatencyHistogramMedian=20859045.50,
fsReadLatencyHistogram75th=35791318.75,
fsReadLatencyHistogram95th=97093132.05,
fsReadLatencyHistogram99th=179688655.93,
fsReadLatencyHistogram999th=312277183.40,
fsPreadLatencyHistogramMean=35548585.63,
fsPreadLatencyHistogramCount=2803268.00,
fsPreadLatencyHistogramMedian=37662144.00,
fsPreadLatencyHistogram75th=55991186.50,
fsPreadLatencyHistogram95th=116227275.50,
fsPreadLatencyHistogram99th=173173999.27,
fsPreadLatencyHistogram999th=273812341.79,
fsWriteLatencyHistogramMean=1523660.72,
fsWriteLatencyHistogramCount=1225000.00,
fsWriteLatencyHistogramMedian=226540.50,
fsWriteLatencyHistogram75th=380366.00,
fsWriteLatencyHistogram95th=2193516.80,
fsWriteLatencyHistogram99th=4290208.93,
fsWriteLatencyHistogram999th=6926850.53









--
View this message in context: http://apache-hbase.679495.n3.nabble.com/Slow-Get-Performance-or-how-many-disk-I-O-does-it-take-for-one-non-cached-read-tp4055545.html
Sent from the HBase User mailing list archive at Nabble.com.


Confidentiality Notice:  The information contained in this message, including any attachments hereto, may be confidential and is intended to be read only by the individual or entity to whom this message is addressed. If the reader of this message is not the intended recipient or an agent or designee of the intended recipient, please note that any review, use, disclosure or distribution of this message or its attachments, in any form, is strictly prohibited.  If you have received this message in error, please immediately notify the sender and/or Notifications@carrieriq.com and delete or destroy any copy of this message and its attachments.

Re: Slow Get Performance (or how many disk I/O does it take for one non-cached read?)

Posted by Varun Sharma <va...@pinterest.com>.
Actually there are 2 read aheads in linux (from what I learned last time, I
did benchmarking on random reads). One is the filesystem readahead which
linux does and then there is also a disk level read ahead which can be
modified by using the hdparm command. IIRC, there is no sure way of
removing filesystem level readaheads but disk level read aheads can be
modified.

For regular HDDs, if you are reading small keys, don't think changing the
disk level read ahead will make a big difference because the seek latency
will dominate. However, for SSDs, it did help to reduce the disk level read
ahead in my tests - because the seek latency is small and reading a lot of
unnecessary data is relatively expensive.




On Sat, Feb 1, 2014 at 11:07 PM, Vladimir Rodionov
<vr...@carrieriq.com>wrote:

> Block size does not matter on Linux . 256KB on read prefetch (read ahead).
>
> Best regards,
> Vladimir Rodionov
> Principal Platform Engineer
> Carrier IQ, www.carrieriq.com
> e-mail: vrodionov@carrieriq.com
>
> ________________________________________
> From: lars hofhansl [larsh@apache.org]
> Sent: Saturday, February 01, 2014 10:33 PM
> To: user@hbase.apache.org
> Subject: Re: Slow Get Performance (or how many disk I/O does it take for
> one non-cached read?)
>
> Hmm... Interesting. I expected there to be a better improvement from
> smaller blocks.
>
> So it's really just IOPS (and block size does not matter), in which case,
> yes, HBase checksum will save you 50% IOPS for each data block (and since
> index blocks are cache) it'll save 50% total IOPS.
>
>
>
> ________________________________
>  From: Jan Schellenberger <le...@gmail.com>
> To: user@hbase.apache.org
> Sent: Saturday, February 1, 2014 9:38 PM
> Subject: Re: Slow Get Performance (or how many disk I/O does it take for
> one non-cached read?)
>
>
> I've experimented with the block size.  Here are results:
> 4k - 60reads/sec  - 1.2GB totalStaticIndexSize
> 8k - 80reads/sec  - 660MB totalStaticIndexSize
> 16k - 90reads/sec  - 330MB totalStaticIndexSize
> and previously
> 64k - 80reads/sec - ~100mb totalStaticIndexSize
>
>
> Also, I turned off caching and you are correct, the index blocks seem to be
> cached always - the blockCachedSize grows until it reaches
> totalStaticIndexSize and then stops growing.  If you turn caching on, it
> will grow until the maxHeap * blockCacheSize (.4 in my case).
>
> I saw no performance difference between caching off/on so I guess off is
> fine.
>
> Yes, I always do a major_compact before testing.
>
> I think this probably concludes my question - I will try to upgrade to a
> newer hbase version to get the CRC32/HDFS check fix and we will probably
> have to evaluate SSD's.
>
> Cheers, everyone.
>
>
>
>
> --
> View this message in context:
> http://apache-hbase.679495.n3.nabble.com/Slow-Get-Performance-or-how-many-disk-I-O-does-it-take-for-one-non-cached-read-tp4055545p4055582.html
>
> Sent from the HBase User mailing list archive at Nabble.com.
>
> Confidentiality Notice:  The information contained in this message,
> including any attachments hereto, may be confidential and is intended to be
> read only by the individual or entity to whom this message is addressed. If
> the reader of this message is not the intended recipient or an agent or
> designee of the intended recipient, please note that any review, use,
> disclosure or distribution of this message or its attachments, in any form,
> is strictly prohibited.  If you have received this message in error, please
> immediately notify the sender and/or Notifications@carrieriq.com and
> delete or destroy any copy of this message and its attachments.
>

RE: Slow Get Performance (or how many disk I/O does it take for one non-cached read?)

Posted by Vladimir Rodionov <vr...@carrieriq.com>.
Block size does not matter on Linux . 256KB on read prefetch (read ahead).

Best regards,
Vladimir Rodionov
Principal Platform Engineer
Carrier IQ, www.carrieriq.com
e-mail: vrodionov@carrieriq.com

________________________________________
From: lars hofhansl [larsh@apache.org]
Sent: Saturday, February 01, 2014 10:33 PM
To: user@hbase.apache.org
Subject: Re: Slow Get Performance (or how many disk I/O does it take for one non-cached read?)

Hmm... Interesting. I expected there to be a better improvement from smaller blocks.

So it's really just IOPS (and block size does not matter), in which case, yes, HBase checksum will save you 50% IOPS for each data block (and since index blocks are cache) it'll save 50% total IOPS.



________________________________
 From: Jan Schellenberger <le...@gmail.com>
To: user@hbase.apache.org
Sent: Saturday, February 1, 2014 9:38 PM
Subject: Re: Slow Get Performance (or how many disk I/O does it take for one non-cached read?)


I've experimented with the block size.  Here are results:
4k - 60reads/sec  - 1.2GB totalStaticIndexSize
8k - 80reads/sec  - 660MB totalStaticIndexSize
16k - 90reads/sec  - 330MB totalStaticIndexSize
and previously
64k - 80reads/sec - ~100mb totalStaticIndexSize


Also, I turned off caching and you are correct, the index blocks seem to be
cached always - the blockCachedSize grows until it reaches
totalStaticIndexSize and then stops growing.  If you turn caching on, it
will grow until the maxHeap * blockCacheSize (.4 in my case).

I saw no performance difference between caching off/on so I guess off is
fine.

Yes, I always do a major_compact before testing.

I think this probably concludes my question - I will try to upgrade to a
newer hbase version to get the CRC32/HDFS check fix and we will probably
have to evaluate SSD's.

Cheers, everyone.




--
View this message in context: http://apache-hbase.679495.n3.nabble.com/Slow-Get-Performance-or-how-many-disk-I-O-does-it-take-for-one-non-cached-read-tp4055545p4055582.html

Sent from the HBase User mailing list archive at Nabble.com.

Confidentiality Notice:  The information contained in this message, including any attachments hereto, may be confidential and is intended to be read only by the individual or entity to whom this message is addressed. If the reader of this message is not the intended recipient or an agent or designee of the intended recipient, please note that any review, use, disclosure or distribution of this message or its attachments, in any form, is strictly prohibited.  If you have received this message in error, please immediately notify the sender and/or Notifications@carrieriq.com and delete or destroy any copy of this message and its attachments.

Re: Slow Get Performance (or how many disk I/O does it take for one non-cached read?)

Posted by lars hofhansl <la...@apache.org>.
Hmm... Interesting. I expected there to be a better improvement from smaller blocks.

So it's really just IOPS (and block size does not matter), in which case, yes, HBase checksum will save you 50% IOPS for each data block (and since index blocks are cache) it'll save 50% total IOPS.



________________________________
 From: Jan Schellenberger <le...@gmail.com>
To: user@hbase.apache.org 
Sent: Saturday, February 1, 2014 9:38 PM
Subject: Re: Slow Get Performance (or how many disk I/O does it take for one non-cached read?)
 

I've experimented with the block size.  Here are results:
4k - 60reads/sec  - 1.2GB totalStaticIndexSize
8k - 80reads/sec  - 660MB totalStaticIndexSize
16k - 90reads/sec  - 330MB totalStaticIndexSize
and previously
64k - 80reads/sec - ~100mb totalStaticIndexSize


Also, I turned off caching and you are correct, the index blocks seem to be
cached always - the blockCachedSize grows until it reaches
totalStaticIndexSize and then stops growing.  If you turn caching on, it
will grow until the maxHeap * blockCacheSize (.4 in my case).

I saw no performance difference between caching off/on so I guess off is
fine.

Yes, I always do a major_compact before testing.

I think this probably concludes my question - I will try to upgrade to a
newer hbase version to get the CRC32/HDFS check fix and we will probably
have to evaluate SSD's.  

Cheers, everyone.




--
View this message in context: http://apache-hbase.679495.n3.nabble.com/Slow-Get-Performance-or-how-many-disk-I-O-does-it-take-for-one-non-cached-read-tp4055545p4055582.html

Sent from the HBase User mailing list archive at Nabble.com.

Re: Slow Get Performance (or how many disk I/O does it take for one non-cached read?)

Posted by Jan Schellenberger <le...@gmail.com>.
I've experimented with the block size.  Here are results:
4k - 60reads/sec  - 1.2GB totalStaticIndexSize
8k - 80reads/sec  - 660MB totalStaticIndexSize
16k - 90reads/sec  - 330MB totalStaticIndexSize
and previously
64k - 80reads/sec - ~100mb totalStaticIndexSize


Also, I turned off caching and you are correct, the index blocks seem to be
cached always - the blockCachedSize grows until it reaches
totalStaticIndexSize and then stops growing.  If you turn caching on, it
will grow until the maxHeap * blockCacheSize (.4 in my case).

I saw no performance difference between caching off/on so I guess off is
fine.

Yes, I always do a major_compact before testing.

I think this probably concludes my question - I will try to upgrade to a
newer hbase version to get the CRC32/HDFS check fix and we will probably
have to evaluate SSD's.  

Cheers, everyone.




--
View this message in context: http://apache-hbase.679495.n3.nabble.com/Slow-Get-Performance-or-how-many-disk-I-O-does-it-take-for-one-non-cached-read-tp4055545p4055582.html
Sent from the HBase User mailing list archive at Nabble.com.

Re: Slow Get Performance (or how many disk I/O does it take for one non-cached read?)

Posted by Andrew Purtell <an...@gmail.com>.
To clarify what Lars said: We can do custom encoding of the key values in HFile blocks (FAST_DIFF, etc) in cache as well as on disk. We can also or instead do whole block compression using the usual suspects (gzip, snappy), but only as part of reading or writing HFile blocks "at the HDFS level".

> On Feb 1, 2014, at 8:10 PM, Jay Vyas <ja...@gmail.com> wrote:
> 
> RE: HDFS Compression... that is interesting -- i didnt think HBase  forced
> any HDFS specific operatoins (other than short circuit reads, which is
> configurable on/off)?
> 
> ... So how is the compression encoding implemented, and how do other file
> systems handle it?  I dont think compression is specifically part of the
> FileSystem API.
> 
> 
>> On Sat, Feb 1, 2014 at 11:06 PM, lars hofhansl <la...@apache.org> wrote:
>> 
>> HBase always loads the whole block and then seeks forward in that block
>> until it finds the KV it  is looking for (there is no indexing inside the
>> block).
>> 
>> Also note that HBase has compression and block encoding. These are
>> different. Compression compresses the files on disk (at the HDFS level) and
>> not in memory, so it does not help with your cache size. Encoding is
>> applied at the HBase block level and is retained in the block cache.
>> 
>> I'm really curious as what kind of improvement you see with smaller block
>> size. Remember that after you change BLOCKSIZE you need to issue a major
>> compaction so that the data is rewritten into smaller blocks.
>> 
>> We should really document this stuff better.
>> 
>> 
>> -- Lars
>> 
>> 
>> 
>> ________________________________
>> From: Jan Schellenberger <le...@gmail.com>
>> To: user@hbase.apache.org
>> Sent: Friday, January 31, 2014 10:31 PM
>> Subject: RE: Slow Get Performance (or how many disk I/O does it take for
>> one non-cached read?)
>> 
>> 
>> A lot of useful information here...
>> 
>> I disabled bloom filters
>> I changed to gz compression (compressed files significantly)
>> 
>> I'm now seeing about *80gets/sec/server* which is a pretty good
>> improvement.
>> Since I estimate that the server is capable of about 300-350 hard disk
>> operations/second, that's about 4 hard disk operations/get.
>> 
>> I will experiment with the BLOCKSIZE next.  Unfortunately upgrading our
>> system to a newer HBASE/Hadoop is tricky for various IT/regulation reasons
>> but I'll ask to upgrade.  From what I see, even Cloudera 4.5.0 still comes
>> with HBase 94.6
>> 
>> 
>> 
>> 
>> I also restarted the regionservers and am now getting
>> blockCacheHitCachingRatio=51% and blockCacheHitRatio=51%.
>> So conceivably, I could be hitting the:
>> root index (cache hit)
>> block index (cache hit)
>> load on average 2 blocks to get data (cache misses most likely as my total
>> heap space is 1/7 the compressed dataset)
>> That would be about 52% cache hit overall and if each data access requires
>> 2
>> Hard Drive reads (data + checksum) then that would explain my throughput.
>> It still seems high but probably within the realm of reason.
>> 
>> Does HBase always read a full block (the 64k HFile block, not the HDFS
>> block) at a time or can it just jump to a particular location within the
>> block?
>> 
>> 
>> 
>> 
>> 
>> --
>> View this message in context:
>> http://apache-hbase.679495.n3.nabble.com/Slow-Get-Performance-or-how-many-disk-I-O-does-it-take-for-one-non-cached-read-tp4055545p4055564.html
>> 
>> Sent from the HBase User mailing list archive at Nabble.com.
> 
> 
> 
> -- 
> Jay Vyas
> http://jayunit100.blogspot.com

Re: Slow Get Performance (or how many disk I/O does it take for one non-cached read?)

Posted by Jay Vyas <ja...@gmail.com>.
RE: HDFS Compression... that is interesting -- i didnt think HBase  forced
any HDFS specific operatoins (other than short circuit reads, which is
configurable on/off)?

... So how is the compression encoding implemented, and how do other file
systems handle it?  I dont think compression is specifically part of the
FileSystem API.


On Sat, Feb 1, 2014 at 11:06 PM, lars hofhansl <la...@apache.org> wrote:

> HBase always loads the whole block and then seeks forward in that block
> until it finds the KV it  is looking for (there is no indexing inside the
> block).
>
> Also note that HBase has compression and block encoding. These are
> different. Compression compresses the files on disk (at the HDFS level) and
> not in memory, so it does not help with your cache size. Encoding is
> applied at the HBase block level and is retained in the block cache.
>
> I'm really curious as what kind of improvement you see with smaller block
> size. Remember that after you change BLOCKSIZE you need to issue a major
> compaction so that the data is rewritten into smaller blocks.
>
> We should really document this stuff better.
>
>
> -- Lars
>
>
>
> ________________________________
>  From: Jan Schellenberger <le...@gmail.com>
> To: user@hbase.apache.org
> Sent: Friday, January 31, 2014 10:31 PM
> Subject: RE: Slow Get Performance (or how many disk I/O does it take for
> one non-cached read?)
>
>
> A lot of useful information here...
>
> I disabled bloom filters
> I changed to gz compression (compressed files significantly)
>
> I'm now seeing about *80gets/sec/server* which is a pretty good
> improvement.
> Since I estimate that the server is capable of about 300-350 hard disk
> operations/second, that's about 4 hard disk operations/get.
>
> I will experiment with the BLOCKSIZE next.  Unfortunately upgrading our
> system to a newer HBASE/Hadoop is tricky for various IT/regulation reasons
> but I'll ask to upgrade.  From what I see, even Cloudera 4.5.0 still comes
> with HBase 94.6
>
>
>
>
> I also restarted the regionservers and am now getting
> blockCacheHitCachingRatio=51% and blockCacheHitRatio=51%.
> So conceivably, I could be hitting the:
> root index (cache hit)
> block index (cache hit)
> load on average 2 blocks to get data (cache misses most likely as my total
> heap space is 1/7 the compressed dataset)
> That would be about 52% cache hit overall and if each data access requires
> 2
> Hard Drive reads (data + checksum) then that would explain my throughput.
> It still seems high but probably within the realm of reason.
>
> Does HBase always read a full block (the 64k HFile block, not the HDFS
> block) at a time or can it just jump to a particular location within the
> block?
>
>
>
>
>
> --
> View this message in context:
> http://apache-hbase.679495.n3.nabble.com/Slow-Get-Performance-or-how-many-disk-I-O-does-it-take-for-one-non-cached-read-tp4055545p4055564.html
>
> Sent from the HBase User mailing list archive at Nabble.com.
>



-- 
Jay Vyas
http://jayunit100.blogspot.com

Re: Slow Get Performance (or how many disk I/O does it take for one non-cached read?)

Posted by lars hofhansl <la...@apache.org>.
HBase always loads the whole block and then seeks forward in that block until it finds the KV it  is looking for (there is no indexing inside the block).

Also note that HBase has compression and block encoding. These are different. Compression compresses the files on disk (at the HDFS level) and not in memory, so it does not help with your cache size. Encoding is applied at the HBase block level and is retained in the block cache.

I'm really curious as what kind of improvement you see with smaller block size. Remember that after you change BLOCKSIZE you need to issue a major compaction so that the data is rewritten into smaller blocks.

We should really document this stuff better.


-- Lars



________________________________
 From: Jan Schellenberger <le...@gmail.com>
To: user@hbase.apache.org 
Sent: Friday, January 31, 2014 10:31 PM
Subject: RE: Slow Get Performance (or how many disk I/O does it take for one non-cached read?)
 

A lot of useful information here...

I disabled bloom filters
I changed to gz compression (compressed files significantly)

I'm now seeing about *80gets/sec/server* which is a pretty good improvement. 
Since I estimate that the server is capable of about 300-350 hard disk
operations/second, that's about 4 hard disk operations/get.

I will experiment with the BLOCKSIZE next.  Unfortunately upgrading our
system to a newer HBASE/Hadoop is tricky for various IT/regulation reasons
but I'll ask to upgrade.  From what I see, even Cloudera 4.5.0 still comes
with HBase 94.6




I also restarted the regionservers and am now getting
blockCacheHitCachingRatio=51% and blockCacheHitRatio=51%.  
So conceivably, I could be hitting the: 
root index (cache hit)
block index (cache hit)
load on average 2 blocks to get data (cache misses most likely as my total
heap space is 1/7 the compressed dataset)
That would be about 52% cache hit overall and if each data access requires 2
Hard Drive reads (data + checksum) then that would explain my throughput.
It still seems high but probably within the realm of reason.

Does HBase always read a full block (the 64k HFile block, not the HDFS
block) at a time or can it just jump to a particular location within the
block?





--
View this message in context: http://apache-hbase.679495.n3.nabble.com/Slow-Get-Performance-or-how-many-disk-I-O-does-it-take-for-one-non-cached-read-tp4055545p4055564.html

Sent from the HBase User mailing list archive at Nabble.com.

RE: Slow Get Performance (or how many disk I/O does it take for one non-cached read?)

Posted by Jan Schellenberger <le...@gmail.com>.
A lot of useful information here...

I disabled bloom filters
I changed to gz compression (compressed files significantly)

I'm now seeing about *80gets/sec/server* which is a pretty good improvement. 
Since I estimate that the server is capable of about 300-350 hard disk
operations/second, that's about 4 hard disk operations/get.

I will experiment with the BLOCKSIZE next.  Unfortunately upgrading our
system to a newer HBASE/Hadoop is tricky for various IT/regulation reasons
but I'll ask to upgrade.  From what I see, even Cloudera 4.5.0 still comes
with HBase 94.6




I also restarted the regionservers and am now getting
blockCacheHitCachingRatio=51% and blockCacheHitRatio=51%.  
So conceivably, I could be hitting the: 
root index (cache hit)
block index (cache hit)
load on average 2 blocks to get data (cache misses most likely as my total
heap space is 1/7 the compressed dataset)
That would be about 52% cache hit overall and if each data access requires 2
Hard Drive reads (data + checksum) then that would explain my throughput.
It still seems high but probably within the realm of reason.

Does HBase always read a full block (the 64k HFile block, not the HDFS
block) at a time or can it just jump to a particular location within the
block?





--
View this message in context: http://apache-hbase.679495.n3.nabble.com/Slow-Get-Performance-or-how-many-disk-I-O-does-it-take-for-one-non-cached-read-tp4055545p4055564.html
Sent from the HBase User mailing list archive at Nabble.com.

RE: Slow Get Performance (or how many disk I/O does it take for one non-cached read?)

Posted by Vladimir Rodionov <vr...@carrieriq.com>.
Lars wrote:
>> You can also try disabling the block cache, as it does not help in your scenario anyway.

It helps with caching INDEX blocks.? Or you suggest relying on OS page cache?  BLOOM blocks are useless, I think, therefore Bloom filter can be disabled.

Best regards,
Vladimir Rodionov
Principal Platform Engineer
Carrier IQ, www.carrieriq.com
e-mail: vrodionov@carrieriq.com

________________________________________
From: lars hofhansl [larsh@apache.org]
Sent: Friday, January 31, 2014 9:25 PM
To: user@hbase.apache.org
Subject: Re: Slow Get Performance (or how many disk I/O does it take for one non-cached read?)

If you data does not fit into cache and your request patter is essentially random then each GET will likely cause an entirely new HFile block to be read from disk (since that block was likely evicted due to other random GETs).

This is somewhat of a worst case for HBase. The default block size if 64k.
That is why the cache hit ratio is low and your disk IO is high. For each GET even reading just a single KV of a few hundred bytes, HBase needs to bring in 64k worth of data from disk.


With your load you can set the block size as low as 4k (or even lower).
That way HBase would still need to bring in a new block for each GET, but that block will only be 4k.
You can also try disabling the block cache, as it does not help in your scenario anyway.


Note that I mean the HFile block size, not the HDFS block (which is typically 64, 128, or 256 mb).


You can set this via the HBase as a column family parameter: BLOCKSIZE => '4906'
I'd start with 4k and then vary up and down and do some testing.

Truly random reads are very hard for any caching system.
Is your load really truly random, or is it just for testing?

-- Lars


----- Original Message -----
From: Jan Schellenberger <le...@gmail.com>
To: user@hbase.apache.org
Cc:
Sent: Friday, January 31, 2014 3:12 PM
Subject: Slow Get Performance (or how many disk I/O does it take for one non-cached read?)

I am running a cluster and getting slow performance - about 50 reads/sec/node
or about 800 reads/sec for the cluster.  The data is too big to fit into
memory and my access pattern is completely random reads which is presumably
difficult for hbase.  Is my read speed reasonable?  I feel like typical read
speeds I've seen reported are much higher?



Hardware/Software Configuration:
17 nodes + 1 master
8 cores
24 gigs ram
4x1TB 3.5" hard drives (I know this is low for hbase - we're working on
getting more disks)
running Cloudera CDH 4.3 with hbase .94.6
Most configurations are default except I'm using 12GB heap space/region
server and the block cache is .4 instead of .25 but neither of these two
things makes much of a difference.   I am NOT having a GC issue.  Latencies
are around 40ms and 99% is 200ms.


Dataset Description:
6 tables ~300GB each (uncompressed) or 120GB each compressed <- compression
speeds things up a bit.
I just ran a major compaction so block locality is 100%
Each Table has a single column family and a single column ("c:d").
keys are short strigs ~10-20 characters.
values are short json ~500 characters
100% Gets.  No Puts
I am heavily using time stamping.  maxversions is set to Integer.MAXINT.  My
gets have a maxretrieved of 200.  A typical row would have < 10 versions on
average though.  <1% of queries would max out at 200 versions returned.

Here are table configurations (I've also tried Snappy compression)
{NAME => 'TABLE1', FAMILIES => [{NAME => 'c', DATA_BLOCK_ENCODING => 'NONE'
, BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '0', VERSIONS => '2147483647',
COMPR
ESSION => 'NONE', MIN_VERSIONS => '0', TTL => '2147483647',
KEEP_DELETED_CELLS =>
  'false', BLOCKSIZE => '65536', IN_MEMORY => 'false', ENCODE_ON_DISK =>
'true', BLOCKCACHE => 'true'}]}


I am using the master node to query (with 20 threads) and get about 800
Gets/second.  Each worker node is completely swamped by disk i/o - I'm
seeing 80 io/sec with iostat for each of the 4 disk with a throughput of
about 10MB/sec each.  So this means it's reading roughly 120kB/transfer and
it's taking about 8 Hard Disk I/O's per Get request.  Does that seem
reasonable?  I've read the HFILE specs and I feel if the block index is
loaded into memory, it should take 1 hard disk read to read the proper block
with my row.


The region servers have a blockCacheHitRatio of about 33% (no compression)
or 50% (snappy compression)

Here are some regionserver stats while I'm running queries.  This is the
uncompressed table version and queries are only 38/sec

requestsPerSecond=38, numberOfOnlineRegions=212,
numberOfStores=212, numberOfStorefiles=212, storefileIndexSizeMB=0,
rootIndexSizeKB=190, totalStaticIndexSizeKB=172689,
totalStaticBloomSizeKB=79692, memstoreSizeMB=0, mbInMemoryWithoutWAL=0,
numberOfPutsWithoutWAL=0, readRequestsCount=1865459,
writeRequestsCount=0, compactionQueueSize=0, flushQueueSize=0,
usedHeapMB=4565, maxHeapMB=12221, blockCacheSizeMB=4042.53,
blockCacheFreeMB=846.07, blockCacheCount=62176,
blockCacheHitCount=5389770, blockCacheMissCount=9909385,
blockCacheEvictedCount=2744919, blockCacheHitRatio=35%,
blockCacheHitCachingRatio=65%, hdfsBlocksLocalityIndex=99,
slowHLogAppendCount=0, fsReadLatencyHistogramMean=1570049.34,
fsReadLatencyHistogramCount=1239690.00,
fsReadLatencyHistogramMedian=20859045.50,
fsReadLatencyHistogram75th=35791318.75,
fsReadLatencyHistogram95th=97093132.05,
fsReadLatencyHistogram99th=179688655.93,
fsReadLatencyHistogram999th=312277183.40,
fsPreadLatencyHistogramMean=35548585.63,
fsPreadLatencyHistogramCount=2803268.00,
fsPreadLatencyHistogramMedian=37662144.00,
fsPreadLatencyHistogram75th=55991186.50,
fsPreadLatencyHistogram95th=116227275.50,
fsPreadLatencyHistogram99th=173173999.27,
fsPreadLatencyHistogram999th=273812341.79,
fsWriteLatencyHistogramMean=1523660.72,
fsWriteLatencyHistogramCount=1225000.00,
fsWriteLatencyHistogramMedian=226540.50,
fsWriteLatencyHistogram75th=380366.00,
fsWriteLatencyHistogram95th=2193516.80,
fsWriteLatencyHistogram99th=4290208.93,
fsWriteLatencyHistogram999th=6926850.53









--
View this message in context: http://apache-hbase.679495.n3.nabble.com/Slow-Get-Performance-or-how-many-disk-I-O-does-it-take-for-one-non-cached-read-tp4055545.html
Sent from the HBase User mailing list archive at Nabble.com.


Confidentiality Notice:  The information contained in this message, including any attachments hereto, may be confidential and is intended to be read only by the individual or entity to whom this message is addressed. If the reader of this message is not the intended recipient or an agent or designee of the intended recipient, please note that any review, use, disclosure or distribution of this message or its attachments, in any form, is strictly prohibited.  If you have received this message in error, please immediately notify the sender and/or Notifications@carrieriq.com and delete or destroy any copy of this message and its attachments.

Re: Slow Get Performance (or how many disk I/O does it take for one non-cached read?)

Posted by lars hofhansl <la...@apache.org>.
Pardon the bad spelling. Hit sent too early. Also, in the second to last paragraph I meant using the HBase *Shell* to alter the BLOCKSIZE.

-- Lars



----- Original Message -----
From: lars hofhansl <la...@apache.org>
To: "user@hbase.apache.org" <us...@hbase.apache.org>
Cc: 
Sent: Friday, January 31, 2014 9:25 PM
Subject: Re: Slow Get Performance (or how many disk I/O does it take for one non-cached read?)

If you data does not fit into cache and your request patter is essentially random then each GET will likely cause an entirely new HFile block to be read from disk (since that block was likely evicted due to other random GETs).

This is somewhat of a worst case for HBase. The default block size if 64k.
That is why the cache hit ratio is low and your disk IO is high. For each GET even reading just a single KV of a few hundred bytes, HBase needs to bring in 64k worth of data from disk.


With your load you can set the block size as low as 4k (or even lower).
That way HBase would still need to bring in a new block for each GET, but that block will only be 4k.
You can also try disabling the block cache, as it does not help in your scenario anyway.


Note that I mean the HFile block size, not the HDFS block (which is typically 64, 128, or 256 mb).


You can set this via the HBase as a column family parameter: BLOCKSIZE => '4906'
I'd start with 4k and then vary up and down and do some testing.

Truly random reads are very hard for any caching system.
Is your load really truly random, or is it just for testing?

-- Lars



----- Original Message -----
From: Jan Schellenberger <le...@gmail.com>
To: user@hbase.apache.org
Cc: 
Sent: Friday, January 31, 2014 3:12 PM
Subject: Slow Get Performance (or how many disk I/O does it take for one non-cached read?)

I am running a cluster and getting slow performance - about 50 reads/sec/node
or about 800 reads/sec for the cluster.  The data is too big to fit into
memory and my access pattern is completely random reads which is presumably
difficult for hbase.  Is my read speed reasonable?  I feel like typical read
speeds I've seen reported are much higher?



Hardware/Software Configuration:
17 nodes + 1 master
8 cores
24 gigs ram
4x1TB 3.5" hard drives (I know this is low for hbase - we're working on
getting more disks)
running Cloudera CDH 4.3 with hbase .94.6
Most configurations are default except I'm using 12GB heap space/region
server and the block cache is .4 instead of .25 but neither of these two
things makes much of a difference.   I am NOT having a GC issue.  Latencies
are around 40ms and 99% is 200ms. 


Dataset Description:
6 tables ~300GB each (uncompressed) or 120GB each compressed <- compression
speeds things up a bit.
I just ran a major compaction so block locality is 100%
Each Table has a single column family and a single column ("c:d").  
keys are short strigs ~10-20 characters.
values are short json ~500 characters
100% Gets.  No Puts
I am heavily using time stamping.  maxversions is set to Integer.MAXINT.  My
gets have a maxretrieved of 200.  A typical row would have < 10 versions on
average though.  <1% of queries would max out at 200 versions returned.

Here are table configurations (I've also tried Snappy compression)
{NAME => 'TABLE1', FAMILIES => [{NAME => 'c', DATA_BLOCK_ENCODING => 'NONE'
, BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '0', VERSIONS => '2147483647',
COMPR
ESSION => 'NONE', MIN_VERSIONS => '0', TTL => '2147483647',
KEEP_DELETED_CELLS =>
  'false', BLOCKSIZE => '65536', IN_MEMORY => 'false', ENCODE_ON_DISK =>
'true', BLOCKCACHE => 'true'}]}


I am using the master node to query (with 20 threads) and get about 800
Gets/second.  Each worker node is completely swamped by disk i/o - I'm
seeing 80 io/sec with iostat for each of the 4 disk with a throughput of
about 10MB/sec each.  So this means it's reading roughly 120kB/transfer and
it's taking about 8 Hard Disk I/O's per Get request.  Does that seem
reasonable?  I've read the HFILE specs and I feel if the block index is
loaded into memory, it should take 1 hard disk read to read the proper block
with my row.


The region servers have a blockCacheHitRatio of about 33% (no compression)
or 50% (snappy compression)

Here are some regionserver stats while I'm running queries.  This is the
uncompressed table version and queries are only 38/sec

requestsPerSecond=38, numberOfOnlineRegions=212,
numberOfStores=212, numberOfStorefiles=212, storefileIndexSizeMB=0,
rootIndexSizeKB=190, totalStaticIndexSizeKB=172689,
totalStaticBloomSizeKB=79692, memstoreSizeMB=0, mbInMemoryWithoutWAL=0,
numberOfPutsWithoutWAL=0, readRequestsCount=1865459,
writeRequestsCount=0, compactionQueueSize=0, flushQueueSize=0,
usedHeapMB=4565, maxHeapMB=12221, blockCacheSizeMB=4042.53,
blockCacheFreeMB=846.07, blockCacheCount=62176,
blockCacheHitCount=5389770, blockCacheMissCount=9909385,
blockCacheEvictedCount=2744919, blockCacheHitRatio=35%,
blockCacheHitCachingRatio=65%, hdfsBlocksLocalityIndex=99,
slowHLogAppendCount=0, fsReadLatencyHistogramMean=1570049.34,
fsReadLatencyHistogramCount=1239690.00,
fsReadLatencyHistogramMedian=20859045.50,
fsReadLatencyHistogram75th=35791318.75,
fsReadLatencyHistogram95th=97093132.05,
fsReadLatencyHistogram99th=179688655.93,
fsReadLatencyHistogram999th=312277183.40,
fsPreadLatencyHistogramMean=35548585.63,
fsPreadLatencyHistogramCount=2803268.00,
fsPreadLatencyHistogramMedian=37662144.00,
fsPreadLatencyHistogram75th=55991186.50,
fsPreadLatencyHistogram95th=116227275.50,
fsPreadLatencyHistogram99th=173173999.27,
fsPreadLatencyHistogram999th=273812341.79,
fsWriteLatencyHistogramMean=1523660.72,
fsWriteLatencyHistogramCount=1225000.00,
fsWriteLatencyHistogramMedian=226540.50,
fsWriteLatencyHistogram75th=380366.00,
fsWriteLatencyHistogram95th=2193516.80,
fsWriteLatencyHistogram99th=4290208.93,
fsWriteLatencyHistogram999th=6926850.53









--
View this message in context: http://apache-hbase.679495.n3.nabble.com/Slow-Get-Performance-or-how-many-disk-I-O-does-it-take-for-one-non-cached-read-tp4055545.html
Sent from the HBase User mailing list archive at Nabble.com.


Re: Slow Get Performance (or how many disk I/O does it take for one non-cached read?)

Posted by lars hofhansl <la...@apache.org>.
If you data does not fit into cache and your request patter is essentially random then each GET will likely cause an entirely new HFile block to be read from disk (since that block was likely evicted due to other random GETs).

This is somewhat of a worst case for HBase. The default block size if 64k.
That is why the cache hit ratio is low and your disk IO is high. For each GET even reading just a single KV of a few hundred bytes, HBase needs to bring in 64k worth of data from disk.


With your load you can set the block size as low as 4k (or even lower).
That way HBase would still need to bring in a new block for each GET, but that block will only be 4k.
You can also try disabling the block cache, as it does not help in your scenario anyway.


Note that I mean the HFile block size, not the HDFS block (which is typically 64, 128, or 256 mb).


You can set this via the HBase as a column family parameter: BLOCKSIZE => '4906'
I'd start with 4k and then vary up and down and do some testing.

Truly random reads are very hard for any caching system.
Is your load really truly random, or is it just for testing?

-- Lars


----- Original Message -----
From: Jan Schellenberger <le...@gmail.com>
To: user@hbase.apache.org
Cc: 
Sent: Friday, January 31, 2014 3:12 PM
Subject: Slow Get Performance (or how many disk I/O does it take for one non-cached read?)

I am running a cluster and getting slow performance - about 50 reads/sec/node
or about 800 reads/sec for the cluster.  The data is too big to fit into
memory and my access pattern is completely random reads which is presumably
difficult for hbase.  Is my read speed reasonable?  I feel like typical read
speeds I've seen reported are much higher?



Hardware/Software Configuration:
17 nodes + 1 master
8 cores
24 gigs ram
4x1TB 3.5" hard drives (I know this is low for hbase - we're working on
getting more disks)
running Cloudera CDH 4.3 with hbase .94.6
Most configurations are default except I'm using 12GB heap space/region
server and the block cache is .4 instead of .25 but neither of these two
things makes much of a difference.   I am NOT having a GC issue.  Latencies
are around 40ms and 99% is 200ms. 


Dataset Description:
6 tables ~300GB each (uncompressed) or 120GB each compressed <- compression
speeds things up a bit.
I just ran a major compaction so block locality is 100%
Each Table has a single column family and a single column ("c:d").  
keys are short strigs ~10-20 characters.
values are short json ~500 characters
100% Gets.  No Puts
I am heavily using time stamping.  maxversions is set to Integer.MAXINT.  My
gets have a maxretrieved of 200.  A typical row would have < 10 versions on
average though.  <1% of queries would max out at 200 versions returned.

Here are table configurations (I've also tried Snappy compression)
{NAME => 'TABLE1', FAMILIES => [{NAME => 'c', DATA_BLOCK_ENCODING => 'NONE'
, BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '0', VERSIONS => '2147483647',
COMPR
ESSION => 'NONE', MIN_VERSIONS => '0', TTL => '2147483647',
KEEP_DELETED_CELLS =>
  'false', BLOCKSIZE => '65536', IN_MEMORY => 'false', ENCODE_ON_DISK =>
'true', BLOCKCACHE => 'true'}]}


I am using the master node to query (with 20 threads) and get about 800
Gets/second.  Each worker node is completely swamped by disk i/o - I'm
seeing 80 io/sec with iostat for each of the 4 disk with a throughput of
about 10MB/sec each.  So this means it's reading roughly 120kB/transfer and
it's taking about 8 Hard Disk I/O's per Get request.  Does that seem
reasonable?  I've read the HFILE specs and I feel if the block index is
loaded into memory, it should take 1 hard disk read to read the proper block
with my row.


The region servers have a blockCacheHitRatio of about 33% (no compression)
or 50% (snappy compression)

Here are some regionserver stats while I'm running queries.  This is the
uncompressed table version and queries are only 38/sec

requestsPerSecond=38, numberOfOnlineRegions=212,
numberOfStores=212, numberOfStorefiles=212, storefileIndexSizeMB=0,
rootIndexSizeKB=190, totalStaticIndexSizeKB=172689,
totalStaticBloomSizeKB=79692, memstoreSizeMB=0, mbInMemoryWithoutWAL=0,
numberOfPutsWithoutWAL=0, readRequestsCount=1865459,
writeRequestsCount=0, compactionQueueSize=0, flushQueueSize=0,
usedHeapMB=4565, maxHeapMB=12221, blockCacheSizeMB=4042.53,
blockCacheFreeMB=846.07, blockCacheCount=62176,
blockCacheHitCount=5389770, blockCacheMissCount=9909385,
blockCacheEvictedCount=2744919, blockCacheHitRatio=35%,
blockCacheHitCachingRatio=65%, hdfsBlocksLocalityIndex=99,
slowHLogAppendCount=0, fsReadLatencyHistogramMean=1570049.34,
fsReadLatencyHistogramCount=1239690.00,
fsReadLatencyHistogramMedian=20859045.50,
fsReadLatencyHistogram75th=35791318.75,
fsReadLatencyHistogram95th=97093132.05,
fsReadLatencyHistogram99th=179688655.93,
fsReadLatencyHistogram999th=312277183.40,
fsPreadLatencyHistogramMean=35548585.63,
fsPreadLatencyHistogramCount=2803268.00,
fsPreadLatencyHistogramMedian=37662144.00,
fsPreadLatencyHistogram75th=55991186.50,
fsPreadLatencyHistogram95th=116227275.50,
fsPreadLatencyHistogram99th=173173999.27,
fsPreadLatencyHistogram999th=273812341.79,
fsWriteLatencyHistogramMean=1523660.72,
fsWriteLatencyHistogramCount=1225000.00,
fsWriteLatencyHistogramMedian=226540.50,
fsWriteLatencyHistogram75th=380366.00,
fsWriteLatencyHistogram95th=2193516.80,
fsWriteLatencyHistogram99th=4290208.93,
fsWriteLatencyHistogram999th=6926850.53









--
View this message in context: http://apache-hbase.679495.n3.nabble.com/Slow-Get-Performance-or-how-many-disk-I-O-does-it-take-for-one-non-cached-read-tp4055545.html
Sent from the HBase User mailing list archive at Nabble.com.