You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hbase.apache.org by Varun Sharma <va...@pinterest.com> on 2013/06/29 21:13:23 UTC

Poor HBase random read performance

Hi,

I was doing some tests on how good HBase random reads are. The setup is
consists of a 1 node cluster with dfs replication set to 1. Short circuit
local reads and HBase checksums are enabled. The data set is small enough
to be largely cached in the filesystem cache - 10G on a 60G machine.

Client sends out multi-get operations in batches to 10 and I try to measure
throughput.

Test #1

All Data was cached in the block cache.

Test Time = 120 seconds
Num Read Ops = 12M

Throughput = 100K per second

Test #2

I disable block cache. But now all the data is in the file system cache. I
verify this by making sure that IOPs on the disk drive are 0 during the
test. I run the same test with batched ops.

Test Time = 120 seconds
Num Read Ops = 0.6M
Throughput = 5K per second

Test #3

I saw all the threads are now stuck in idLock.lockEntry(). So I now run
with the lock disabled and the block cache disabled.

Test Time = 120 seconds
Num Read Ops = 1.2M
Throughput = 10K per second

Test #4

I re enable block cache and this time hack hbase to only cache Index and
Bloom blocks but data blocks come from File System cache.

Test Time = 120 seconds
Num Read Ops = 1.6M
Throughput = 13K per second

So, I wonder how come such a massive drop in throughput. I know that HDFS
code adds tremendous overhead but this seems pretty high to me. I use
0.94.7 and cdh 4.2.0

Thanks
Varun

Re: Poor HBase random read performance

Posted by Varun Sharma <va...@pinterest.com>.

Yeah, that is a very interesting benchmark. I ran mine on hi1.4xlarge -
almost 4X more CPU than m1.xlarge

In your tests, essentially block cache performance looks close to SCR + OS
PageCache, looking from a latency standpoint. I did not find throughput
numbers in your benchmark.

So I just lowered the block size further to 4K to see if there are more
gains but I found that throughput remains the same at roughly 100K - maybe
slightly higher. But when I reenable the block cache to cache all the data
blocks too. The throughput jumps to 250K+.

I hope to generate some more data for this table and hopefully I can test
with some real data on SSDs.

Thanks
Varun


On Mon, Jul 1, 2013 at 9:55 AM, Jean-Daniel Cryans <jd...@apache.org>wrote:

> You might also be interested in this benchmark I ran 3 months ago:
>
> https://docs.google.com/spreadsheet/pub?key=0Ao87IrzZJSaydENaem5USWg4TlRKcHl0dEtTS2NBOUE&output=html
>
> J-D
>
> On Sat, Jun 29, 2013 at 12:13 PM, Varun Sharma <va...@pinterest.com>
> wrote:
> > Hi,
> >
> > I was doing some tests on how good HBase random reads are. The setup is
> > consists of a 1 node cluster with dfs replication set to 1. Short circuit
> > local reads and HBase checksums are enabled. The data set is small enough
> > to be largely cached in the filesystem cache - 10G on a 60G machine.
> >
> > Client sends out multi-get operations in batches to 10 and I try to
> measure
> > throughput.
> >
> > Test #1
> >
> > All Data was cached in the block cache.
> >
> > Test Time = 120 seconds
> > Num Read Ops = 12M
> >
> > Throughput = 100K per second
> >
> > Test #2
> >
> > I disable block cache. But now all the data is in the file system cache.
> I
> > verify this by making sure that IOPs on the disk drive are 0 during the
> > test. I run the same test with batched ops.
> >
> > Test Time = 120 seconds
> > Num Read Ops = 0.6M
> > Throughput = 5K per second
> >
> > Test #3
> >
> > I saw all the threads are now stuck in idLock.lockEntry(). So I now run
> > with the lock disabled and the block cache disabled.
> >
> > Test Time = 120 seconds
> > Num Read Ops = 1.2M
> > Throughput = 10K per second
> >
> > Test #4
> >
> > I re enable block cache and this time hack hbase to only cache Index and
> > Bloom blocks but data blocks come from File System cache.
> >
> > Test Time = 120 seconds
> > Num Read Ops = 1.6M
> > Throughput = 13K per second
> >
> > So, I wonder how come such a massive drop in throughput. I know that HDFS
> > code adds tremendous overhead but this seems pretty high to me. I use
> > 0.94.7 and cdh 4.2.0
> >
> > Thanks
> > Varun
>

Re: Poor HBase random read performance

Posted by Jean-Daniel Cryans <jd...@apache.org>.

You might also be interested in this benchmark I ran 3 months ago:
https://docs.google.com/spreadsheet/pub?key=0Ao87IrzZJSaydENaem5USWg4TlRKcHl0dEtTS2NBOUE&output=html

J-D

On Sat, Jun 29, 2013 at 12:13 PM, Varun Sharma <va...@pinterest.com> wrote:
> Hi,
>
> I was doing some tests on how good HBase random reads are. The setup is
> consists of a 1 node cluster with dfs replication set to 1. Short circuit
> local reads and HBase checksums are enabled. The data set is small enough
> to be largely cached in the filesystem cache - 10G on a 60G machine.
>
> Client sends out multi-get operations in batches to 10 and I try to measure
> throughput.
>
> Test #1
>
> All Data was cached in the block cache.
>
> Test Time = 120 seconds
> Num Read Ops = 12M
>
> Throughput = 100K per second
>
> Test #2
>
> I disable block cache. But now all the data is in the file system cache. I
> verify this by making sure that IOPs on the disk drive are 0 during the
> test. I run the same test with batched ops.
>
> Test Time = 120 seconds
> Num Read Ops = 0.6M
> Throughput = 5K per second
>
> Test #3
>
> I saw all the threads are now stuck in idLock.lockEntry(). So I now run
> with the lock disabled and the block cache disabled.
>
> Test Time = 120 seconds
> Num Read Ops = 1.2M
> Throughput = 10K per second
>
> Test #4
>
> I re enable block cache and this time hack hbase to only cache Index and
> Bloom blocks but data blocks come from File System cache.
>
> Test Time = 120 seconds
> Num Read Ops = 1.6M
> Throughput = 13K per second
>
> So, I wonder how come such a massive drop in throughput. I know that HDFS
> code adds tremendous overhead but this seems pretty high to me. I use
> 0.94.7 and cdh 4.2.0
>
> Thanks
> Varun

Re: Poor HBase random read performance

Posted by Varun Sharma <va...@pinterest.com>.

Another update. I reduced the block size from 32K (it seems i was running
with 32K initially not 64K) to 8K and bam, the throughput went from 4M
requests to 11M.

One interesting thing to note however, is that when I had 3 store files per
region, throughput on random reads was 1/3rd, this is understandable
because u need to bring in 3X the amount of blocks and then merge. However,
when I look at the leveldb benchmarks for non compacted vs compacted
tables, I wonder why they are able to do 65K reads per second vs 80K reads
per second when comparing compacted/non compacted files. It seems for their
benchmark - performance does not fall proportionaly with # of store files
(unless perhaps that benchmark includes bloom filters which I disabled).

Also, it seems the idLock issues was because of locking on IndexBlocks
which are always hot. Now idLock does not seem to be an issue when its only
locking up data blocks and for truly random reads, no data block is hot.


On Sat, Jun 29, 2013 at 3:39 PM, Varun Sharma <va...@pinterest.com> wrote:

> So, I just major compacted the table which initially had 3 store files and
> performance went 3X from 1.6M to 4M+.
>
> The tests I am running, have 8 byte keys with ~ 80-100 byte values. Right
> now i am working with 64K block size, I am going to make it 8K and see if
> that helps.
>
> The one point though is the IdLock mechanism - that seems to add a huge
> amount of overhead 2x - however in that test I was not caching index blocks
> in the block cache, which means a lot higher contention on those blocks. I
> believe it was used so that we dont load the same block twice from disk. I
> am wondering, when IOPs are surplus (ssds for example), if we should have
> an option to disable it though I probably should reevaluate it, with index
> blocks in block cache.
>
>
> On Sat, Jun 29, 2013 at 3:24 PM, lars hofhansl <la...@apache.org> wrote:
>
>> Should also say that random reads this way are somewhat of a worst case
>> scenario.
>>
>> If the working set is much larger than the block cache and the reads are
>> random, then each read will likely have to bring in an entirely new block
>> from the OS cache,
>> even when the KVs are much smaller than a block.
>>
>> So in order to read a (say) 1k KV HBase needs to bring 64k (default block
>> size) from the OS cache.
>> As long as the dataset fits into the block cache this difference in size
>> has no performance impact, but as soon as the dataset does not fit, we have
>> to bring much more data from the OS cache than we're actually interested in.
>>
>> Indeed in my test I found that HBase brings in about 60x the data size
>> from the OS cache (used PE with ~1k KVs). This can be improved with smaller
>> block sizes; and with a more efficient way to instantiate HFile blocks in
>> Java (which we need to work on).
>>
>>
>> -- Lars
>>
>> ________________________________
>> From: lars hofhansl <la...@apache.org>
>> To: "dev@hbase.apache.org" <de...@hbase.apache.org>
>> Sent: Saturday, June 29, 2013 3:09 PM
>> Subject: Re: Poor HBase random read performance
>>
>>
>> I've seen the same bad performance behavior when I tested this on a real
>> cluster. (I think it was in 0.94.6)
>>
>>
>> Instead of en/disabling the blockcache, I tested sequential and random
>> reads on a data set that does not fit into the (aggregate) block cache.
>> Sequential reads were drastically faster than Random reads (7 vs 34
>> minutes), which can really only be explained with the fact that the next
>> get will with high probability hit an already cached block, whereas in the
>> random read case it likely will not.
>>
>> In the RandomRead case I estimate that each RegionServer brings in
>> between 100 and 200mb/s from the OS cache. Even at 200mb/s this would be
>> quite slow.I understand that performance is bad when index/bloom blocks are
>> not cached, but bringing in data blocks from the OS cache should be faster
>> than it is.
>>
>>
>> So this is something to debug.
>>
>> -- Lars
>>
>>
>>
>> ________________________________
>> From: Varun Sharma <va...@pinterest.com>
>> To: "dev@hbase.apache.org" <de...@hbase.apache.org>
>> Sent: Saturday, June 29, 2013 12:13 PM
>> Subject: Poor HBase random read performance
>>
>>
>> Hi,
>>
>> I was doing some tests on how good HBase random reads are. The setup is
>> consists of a 1 node cluster with dfs replication set to 1. Short circuit
>> local reads and HBase checksums are enabled. The data set is small enough
>> to be largely cached in the filesystem cache - 10G on a 60G machine.
>>
>> Client sends out multi-get operations in batches to 10 and I try to
>> measure
>> throughput.
>>
>> Test #1
>>
>> All Data was cached in the block cache.
>>
>> Test Time = 120 seconds
>> Num Read Ops = 12M
>>
>> Throughput = 100K per second
>>
>> Test #2
>>
>> I disable block cache. But now all the data is in the file system cache. I
>> verify this by making sure that IOPs on the disk drive are 0 during the
>> test. I run the same test with batched ops.
>>
>> Test Time = 120 seconds
>> Num Read Ops = 0.6M
>> Throughput = 5K per second
>>
>> Test #3
>>
>> I saw all the threads are now stuck in idLock.lockEntry(). So I now run
>> with the lock disabled and the block cache disabled.
>>
>> Test Time = 120 seconds
>> Num Read Ops = 1.2M
>> Throughput = 10K per second
>>
>> Test #4
>>
>> I re enable block cache and this time hack hbase to only cache Index and
>> Bloom blocks but data blocks come from File System cache.
>>
>> Test Time = 120 seconds
>> Num Read Ops = 1.6M
>> Throughput = 13K per second
>>
>> So, I wonder how come such a massive drop in throughput. I know that HDFS
>> code adds tremendous overhead but this seems pretty high to me. I use
>> 0.94.7 and cdh 4.2.0
>>
>> Thanks
>> Varun
>>
>
>

Re: Poor HBase random read performance

Posted by Varun Sharma <va...@pinterest.com>.

So, I just major compacted the table which initially had 3 store files and
performance went 3X from 1.6M to 4M+.

The tests I am running, have 8 byte keys with ~ 80-100 byte values. Right
now i am working with 64K block size, I am going to make it 8K and see if
that helps.

The one point though is the IdLock mechanism - that seems to add a huge
amount of overhead 2x - however in that test I was not caching index blocks
in the block cache, which means a lot higher contention on those blocks. I
believe it was used so that we dont load the same block twice from disk. I
am wondering, when IOPs are surplus (ssds for example), if we should have
an option to disable it though I probably should reevaluate it, with index
blocks in block cache.


On Sat, Jun 29, 2013 at 3:24 PM, lars hofhansl <la...@apache.org> wrote:

> Should also say that random reads this way are somewhat of a worst case
> scenario.
>
> If the working set is much larger than the block cache and the reads are
> random, then each read will likely have to bring in an entirely new block
> from the OS cache,
> even when the KVs are much smaller than a block.
>
> So in order to read a (say) 1k KV HBase needs to bring 64k (default block
> size) from the OS cache.
> As long as the dataset fits into the block cache this difference in size
> has no performance impact, but as soon as the dataset does not fit, we have
> to bring much more data from the OS cache than we're actually interested in.
>
> Indeed in my test I found that HBase brings in about 60x the data size
> from the OS cache (used PE with ~1k KVs). This can be improved with smaller
> block sizes; and with a more efficient way to instantiate HFile blocks in
> Java (which we need to work on).
>
>
> -- Lars
>
> ________________________________
> From: lars hofhansl <la...@apache.org>
> To: "dev@hbase.apache.org" <de...@hbase.apache.org>
> Sent: Saturday, June 29, 2013 3:09 PM
> Subject: Re: Poor HBase random read performance
>
>
> I've seen the same bad performance behavior when I tested this on a real
> cluster. (I think it was in 0.94.6)
>
>
> Instead of en/disabling the blockcache, I tested sequential and random
> reads on a data set that does not fit into the (aggregate) block cache.
> Sequential reads were drastically faster than Random reads (7 vs 34
> minutes), which can really only be explained with the fact that the next
> get will with high probability hit an already cached block, whereas in the
> random read case it likely will not.
>
> In the RandomRead case I estimate that each RegionServer brings in between
> 100 and 200mb/s from the OS cache. Even at 200mb/s this would be quite
> slow.I understand that performance is bad when index/bloom blocks are not
> cached, but bringing in data blocks from the OS cache should be faster than
> it is.
>
>
> So this is something to debug.
>
> -- Lars
>
>
>
> ________________________________
> From: Varun Sharma <va...@pinterest.com>
> To: "dev@hbase.apache.org" <de...@hbase.apache.org>
> Sent: Saturday, June 29, 2013 12:13 PM
> Subject: Poor HBase random read performance
>
>
> Hi,
>
> I was doing some tests on how good HBase random reads are. The setup is
> consists of a 1 node cluster with dfs replication set to 1. Short circuit
> local reads and HBase checksums are enabled. The data set is small enough
> to be largely cached in the filesystem cache - 10G on a 60G machine.
>
> Client sends out multi-get operations in batches to 10 and I try to measure
> throughput.
>
> Test #1
>
> All Data was cached in the block cache.
>
> Test Time = 120 seconds
> Num Read Ops = 12M
>
> Throughput = 100K per second
>
> Test #2
>
> I disable block cache. But now all the data is in the file system cache. I
> verify this by making sure that IOPs on the disk drive are 0 during the
> test. I run the same test with batched ops.
>
> Test Time = 120 seconds
> Num Read Ops = 0.6M
> Throughput = 5K per second
>
> Test #3
>
> I saw all the threads are now stuck in idLock.lockEntry(). So I now run
> with the lock disabled and the block cache disabled.
>
> Test Time = 120 seconds
> Num Read Ops = 1.2M
> Throughput = 10K per second
>
> Test #4
>
> I re enable block cache and this time hack hbase to only cache Index and
> Bloom blocks but data blocks come from File System cache.
>
> Test Time = 120 seconds
> Num Read Ops = 1.6M
> Throughput = 13K per second
>
> So, I wonder how come such a massive drop in throughput. I know that HDFS
> code adds tremendous overhead but this seems pretty high to me. I use
> 0.94.7 and cdh 4.2.0
>
> Thanks
> Varun
>

RE: Poor HBase random read performance

Posted by Vladimir Rodionov <vr...@carrieriq.com>.

My bad, Bloom filters won't help to locate most recent version of a column.

So, back to my favorite LevelDB ... To locate most recent  version of a column, LevelDB must check

1. Memstore
2. All sst-files in Level-0 (with bloom filters and they are in memory)
3. one sst-file per every other level beginning with L1 (this does not require file I/O as since bloom filters are kept in memory)
L1, then L2, then L3 etc. First found is the most recent version.

In most cases, all these lookup operations are done entirely in memory w/o even single disk I/O 


Best regards,
Vladimir Rodionov
Principal Platform Engineer
Carrier IQ, www.carrieriq.com
e-mail: vrodionov@carrieriq.com

________________________________________
From: Vladimir Rodionov
Sent: Monday, July 01, 2013 4:57 PM
To: dev@hbase.apache.org; lars hofhansl
Subject: RE: Poor HBase random read performance

Varun,

LevelDB relies on Bloom filters to check only relevant sst-files. I think this is Bloom filters for in HBase as well.
Am I wrong?

Best regards,
Vladimir Rodionov
Principal Platform Engineer
Carrier IQ, www.carrieriq.com
e-mail: vrodionov@carrieriq.com

________________________________________
From: Varun Sharma [varun@pinterest.com]
Sent: Monday, July 01, 2013 4:10 PM
To: dev@hbase.apache.org; lars hofhansl
Subject: Re: Poor HBase random read performance

Going back to leveldb vs hbase, I am not sure if we can come with a clean
way to identify HFiles containing more recent data in the wake of
compactions

I though wonder if this works with minor compactions, lets say you compact
a really old file with a new file. Now since this file's most recent
timestamp is very recent because of the new file, you look into this file,
but then retrieve something from the "old" portion of this file. So you end
with older data.

I guess one way would be just order the files by time ranges. Non
intersecting time range files can be ordered in reverse time order.
Intersecting stuff can be seeked together.

     File1
|-----------------|
                          File2
                     |---------------|
                                       File3
                             |-----------------------------|
                                                                     File4

 |--------------------|

So in this case, we seek

[File1], [File2, File3], [File4]

I think for random single key value looks (row, col)->key - this could lead
to good savings for time ordered clients (which are quite common). Unless
File1 and File4 get compacted, in which case, we always need to seek into
both.



On Mon, Jul 1, 2013 at 12:10 PM, lars hofhansl <la...@apache.org> wrote:

> Sorry. Hit enter too early.
>
> Some discussion here:
> http://apache-hbase.679495.n3.nabble.com/keyvalue-cache-td3882628.html
> but no actionable outcome.
>
> -- Lars
> ________________________________
> From: lars hofhansl <la...@apache.org>
> To: "dev@hbase.apache.org" <de...@hbase.apache.org>
> Sent: Monday, July 1, 2013 12:05 PM
> Subject: Re: Poor HBase random read performance
>
>
> This came up a few times before.
>
>
>
> ________________________________
> From: Vladimir Rodionov <vr...@carrieriq.com>
> To: "dev@hbase.apache.org" <de...@hbase.apache.org>; lars hofhansl <
> larsh@apache.org>
> Sent: Monday, July 1, 2013 11:08 AM
> Subject: RE: Poor HBase random read performance
>
>
> I would like to remind that in original BigTable's design  there is scan
> cache to take care of random reads and this
> important feature is still missing in HBase.
>
> Best regards,
> Vladimir Rodionov
> Principal Platform Engineer
> Carrier IQ, www.carrieriq.com
> e-mail: vrodionov@carrieriq.com
>
> ________________________________________
> From: lars hofhansl [larsh@apache.org]
> Sent: Saturday, June 29, 2013 3:24 PM
> To: dev@hbase.apache.org
> Subject: Re: Poor HBase random read performance
>
> Should also say that random reads this way are somewhat of a worst case
> scenario.
>
> If the working set is much larger than the block cache and the reads are
> random, then each read will likely have to bring in an entirely new block
> from the OS cache,
> even when the KVs are much smaller than a block.
>
> So in order to read a (say) 1k KV HBase needs to bring 64k (default block
> size) from the OS cache.
> As long as the dataset fits into the block cache this difference in size
> has no performance impact, but as soon as the dataset does not fit, we have
> to bring much more data from the OS cache than we're actually interested in.
>
> Indeed in my test I found that HBase brings in about 60x the data size
> from the OS cache (used PE with ~1k KVs). This can be improved with smaller
> block sizes; and with a more efficient way to instantiate HFile blocks in
> Java (which we need to work on).
>
>
> -- Lars
>
> ________________________________
> From: lars hofhansl <la...@apache.org>
> To: "dev@hbase.apache.org" <de...@hbase.apache.org>
> Sent: Saturday, June 29, 2013 3:09 PM
> Subject: Re: Poor HBase random read performance
>
>
> I've seen the same bad performance behavior when I tested this on a real
> cluster. (I think it was in 0.94.6)
>
>
> Instead of en/disabling the blockcache, I tested sequential and random
> reads on a data set that does not fit into the (aggregate) block cache.
> Sequential reads were drastically faster than Random reads (7 vs 34
> minutes), which can really only be explained with the fact that the next
> get will with high probability hit an already cached block, whereas in the
> random read case it likely will not.
>
> In the RandomRead case I estimate that each RegionServer brings in between
> 100 and 200mb/s from the OS cache. Even at 200mb/s this would be quite
> slow.I understand that performance is bad when index/bloom blocks are not
> cached, but bringing in data blocks from the OS cache should be faster than
> it is.
>
>
> So this is something to debug.
>
> -- Lars
>
>
>
> ________________________________
> From: Varun Sharma <va...@pinterest.com>
> To: "dev@hbase.apache.org" <de...@hbase.apache.org>
> Sent: Saturday, June 29, 2013 12:13 PM
> Subject: Poor HBase random read performance
>
>
> Hi,
>
> I was doing some tests on how good HBase random reads are. The setup is
> consists of a 1 node cluster with dfs replication set to 1. Short circuit
> local reads and HBase checksums are enabled. The data set is small enough
> to be largely cached in the filesystem cache - 10G on a 60G machine.
>
> Client sends out multi-get operations in batches to 10 and I try to measure
> throughput.
>
> Test #1
>
> All Data was cached in the block cache.
>
> Test Time = 120 seconds
> Num Read Ops = 12M
>
> Throughput = 100K per second
>
> Test #2
>
> I disable block cache. But now all the data is in the file system cache. I
> verify this by making sure that IOPs on the disk drive are 0 during the
> test. I run the same test with batched ops.
>
> Test Time = 120 seconds
> Num Read Ops = 0.6M
> Throughput = 5K per second
>
> Test #3
>
> I saw all the threads are now stuck in idLock.lockEntry(). So I now run
> with the lock disabled and the block cache disabled.
>
> Test Time = 120 seconds
> Num Read Ops = 1.2M
> Throughput = 10K per second
>
> Test #4
>
> I re enable block cache and this time hack hbase to only cache Index and
> Bloom blocks but data blocks come from File System cache.
>
> Test Time = 120 seconds
> Num Read Ops = 1.6M
> Throughput = 13K per second
>
> So, I wonder how come such a massive drop in throughput. I know that HDFS
> code adds tremendous overhead but this seems pretty high to me. I use
> 0.94.7 and cdh 4.2.0
>
> Thanks
> Varun
>
> Confidentiality Notice:  The information contained in this message,
> including any attachments hereto, may be confidential and is intended to be
> read only by the individual or entity to whom this message is addressed. If
> the reader of this message is not the intended recipient or an agent or
> designee of the intended recipient, please note that any review, use,
> disclosure or distribution of this message or its attachments, in any form,
> is strictly prohibited.  If you have received this message in error, please
> immediately notify the sender and/or Notifications@carrieriq.com and
> delete or destroy any copy of this message and its attachments.
>

Confidentiality Notice:  The information contained in this message, including any attachments hereto, may be confidential and is intended to be read only by the individual or entity to whom this message is addressed. If the reader of this message is not the intended recipient or an agent or designee of the intended recipient, please note that any review, use, disclosure or distribution of this message or its attachments, in any form, is strictly prohibited.  If you have received this message in error, please immediately notify the sender and/or Notifications@carrieriq.com and delete or destroy any copy of this message and its attachments.

RE: Poor HBase random read performance

Posted by Vladimir Rodionov <vr...@carrieriq.com>.

Varun,

LevelDB relies on Bloom filters to check only relevant sst-files. I think this is Bloom filters for in HBase as well.
Am I wrong?

Best regards,
Vladimir Rodionov
Principal Platform Engineer
Carrier IQ, www.carrieriq.com
e-mail: vrodionov@carrieriq.com

________________________________________
From: Varun Sharma [varun@pinterest.com]
Sent: Monday, July 01, 2013 4:10 PM
To: dev@hbase.apache.org; lars hofhansl
Subject: Re: Poor HBase random read performance

Going back to leveldb vs hbase, I am not sure if we can come with a clean
way to identify HFiles containing more recent data in the wake of
compactions

I though wonder if this works with minor compactions, lets say you compact
a really old file with a new file. Now since this file's most recent
timestamp is very recent because of the new file, you look into this file,
but then retrieve something from the "old" portion of this file. So you end
with older data.

I guess one way would be just order the files by time ranges. Non
intersecting time range files can be ordered in reverse time order.
Intersecting stuff can be seeked together.

     File1
|-----------------|
                          File2
                     |---------------|
                                       File3
                             |-----------------------------|
                                                                     File4

 |--------------------|

So in this case, we seek

[File1], [File2, File3], [File4]

I think for random single key value looks (row, col)->key - this could lead
to good savings for time ordered clients (which are quite common). Unless
File1 and File4 get compacted, in which case, we always need to seek into
both.



On Mon, Jul 1, 2013 at 12:10 PM, lars hofhansl <la...@apache.org> wrote:

> Sorry. Hit enter too early.
>
> Some discussion here:
> http://apache-hbase.679495.n3.nabble.com/keyvalue-cache-td3882628.html
> but no actionable outcome.
>
> -- Lars
> ________________________________
> From: lars hofhansl <la...@apache.org>
> To: "dev@hbase.apache.org" <de...@hbase.apache.org>
> Sent: Monday, July 1, 2013 12:05 PM
> Subject: Re: Poor HBase random read performance
>
>
> This came up a few times before.
>
>
>
> ________________________________
> From: Vladimir Rodionov <vr...@carrieriq.com>
> To: "dev@hbase.apache.org" <de...@hbase.apache.org>; lars hofhansl <
> larsh@apache.org>
> Sent: Monday, July 1, 2013 11:08 AM
> Subject: RE: Poor HBase random read performance
>
>
> I would like to remind that in original BigTable's design  there is scan
> cache to take care of random reads and this
> important feature is still missing in HBase.
>
> Best regards,
> Vladimir Rodionov
> Principal Platform Engineer
> Carrier IQ, www.carrieriq.com
> e-mail: vrodionov@carrieriq.com
>
> ________________________________________
> From: lars hofhansl [larsh@apache.org]
> Sent: Saturday, June 29, 2013 3:24 PM
> To: dev@hbase.apache.org
> Subject: Re: Poor HBase random read performance
>
> Should also say that random reads this way are somewhat of a worst case
> scenario.
>
> If the working set is much larger than the block cache and the reads are
> random, then each read will likely have to bring in an entirely new block
> from the OS cache,
> even when the KVs are much smaller than a block.
>
> So in order to read a (say) 1k KV HBase needs to bring 64k (default block
> size) from the OS cache.
> As long as the dataset fits into the block cache this difference in size
> has no performance impact, but as soon as the dataset does not fit, we have
> to bring much more data from the OS cache than we're actually interested in.
>
> Indeed in my test I found that HBase brings in about 60x the data size
> from the OS cache (used PE with ~1k KVs). This can be improved with smaller
> block sizes; and with a more efficient way to instantiate HFile blocks in
> Java (which we need to work on).
>
>
> -- Lars
>
> ________________________________
> From: lars hofhansl <la...@apache.org>
> To: "dev@hbase.apache.org" <de...@hbase.apache.org>
> Sent: Saturday, June 29, 2013 3:09 PM
> Subject: Re: Poor HBase random read performance
>
>
> I've seen the same bad performance behavior when I tested this on a real
> cluster. (I think it was in 0.94.6)
>
>
> Instead of en/disabling the blockcache, I tested sequential and random
> reads on a data set that does not fit into the (aggregate) block cache.
> Sequential reads were drastically faster than Random reads (7 vs 34
> minutes), which can really only be explained with the fact that the next
> get will with high probability hit an already cached block, whereas in the
> random read case it likely will not.
>
> In the RandomRead case I estimate that each RegionServer brings in between
> 100 and 200mb/s from the OS cache. Even at 200mb/s this would be quite
> slow.I understand that performance is bad when index/bloom blocks are not
> cached, but bringing in data blocks from the OS cache should be faster than
> it is.
>
>
> So this is something to debug.
>
> -- Lars
>
>
>
> ________________________________
> From: Varun Sharma <va...@pinterest.com>
> To: "dev@hbase.apache.org" <de...@hbase.apache.org>
> Sent: Saturday, June 29, 2013 12:13 PM
> Subject: Poor HBase random read performance
>
>
> Hi,
>
> I was doing some tests on how good HBase random reads are. The setup is
> consists of a 1 node cluster with dfs replication set to 1. Short circuit
> local reads and HBase checksums are enabled. The data set is small enough
> to be largely cached in the filesystem cache - 10G on a 60G machine.
>
> Client sends out multi-get operations in batches to 10 and I try to measure
> throughput.
>
> Test #1
>
> All Data was cached in the block cache.
>
> Test Time = 120 seconds
> Num Read Ops = 12M
>
> Throughput = 100K per second
>
> Test #2
>
> I disable block cache. But now all the data is in the file system cache. I
> verify this by making sure that IOPs on the disk drive are 0 during the
> test. I run the same test with batched ops.
>
> Test Time = 120 seconds
> Num Read Ops = 0.6M
> Throughput = 5K per second
>
> Test #3
>
> I saw all the threads are now stuck in idLock.lockEntry(). So I now run
> with the lock disabled and the block cache disabled.
>
> Test Time = 120 seconds
> Num Read Ops = 1.2M
> Throughput = 10K per second
>
> Test #4
>
> I re enable block cache and this time hack hbase to only cache Index and
> Bloom blocks but data blocks come from File System cache.
>
> Test Time = 120 seconds
> Num Read Ops = 1.6M
> Throughput = 13K per second
>
> So, I wonder how come such a massive drop in throughput. I know that HDFS
> code adds tremendous overhead but this seems pretty high to me. I use
> 0.94.7 and cdh 4.2.0
>
> Thanks
> Varun
>
> Confidentiality Notice:  The information contained in this message,
> including any attachments hereto, may be confidential and is intended to be
> read only by the individual or entity to whom this message is addressed. If
> the reader of this message is not the intended recipient or an agent or
> designee of the intended recipient, please note that any review, use,
> disclosure or distribution of this message or its attachments, in any form,
> is strictly prohibited.  If you have received this message in error, please
> immediately notify the sender and/or Notifications@carrieriq.com and
> delete or destroy any copy of this message and its attachments.
>

Confidentiality Notice:  The information contained in this message, including any attachments hereto, may be confidential and is intended to be read only by the individual or entity to whom this message is addressed. If the reader of this message is not the intended recipient or an agent or designee of the intended recipient, please note that any review, use, disclosure or distribution of this message or its attachments, in any form, is strictly prohibited.  If you have received this message in error, please immediately notify the sender and/or Notifications@carrieriq.com and delete or destroy any copy of this message and its attachments.

Re: Poor HBase random read performance

Posted by Ted Yu <yu...@gmail.com>.

bq.  lets say you compact a really old file with a new file.

I think stripe compaction is supposed to handle the above scenario. Take a
look at:
https://issues.apache.org/jira/browse/HBASE-7667

Please also refer to Sergey's talk @ HBaseCon.

Cheers

On Mon, Jul 1, 2013 at 4:10 PM, Varun Sharma <va...@pinterest.com> wrote:

> Going back to leveldb vs hbase, I am not sure if we can come with a clean
> way to identify HFiles containing more recent data in the wake of
> compactions
>
> I though wonder if this works with minor compactions, lets say you compact
> a really old file with a new file. Now since this file's most recent
> timestamp is very recent because of the new file, you look into this file,
> but then retrieve something from the "old" portion of this file. So you end
> with older data.
>
> I guess one way would be just order the files by time ranges. Non
> intersecting time range files can be ordered in reverse time order.
> Intersecting stuff can be seeked together.
>
>      File1
> |-----------------|
>                           File2
>                      |---------------|
>                                        File3
>                              |-----------------------------|
>                                                                      File4
>
>  |--------------------|
>
> So in this case, we seek
>
> [File1], [File2, File3], [File4]
>
> I think for random single key value looks (row, col)->key - this could lead
> to good savings for time ordered clients (which are quite common). Unless
> File1 and File4 get compacted, in which case, we always need to seek into
> both.
>
>
>
> On Mon, Jul 1, 2013 at 12:10 PM, lars hofhansl <la...@apache.org> wrote:
>
> > Sorry. Hit enter too early.
> >
> > Some discussion here:
> > http://apache-hbase.679495.n3.nabble.com/keyvalue-cache-td3882628.html
> > but no actionable outcome.
> >
> > -- Lars
> > ________________________________
> > From: lars hofhansl <la...@apache.org>
> > To: "dev@hbase.apache.org" <de...@hbase.apache.org>
> > Sent: Monday, July 1, 2013 12:05 PM
> > Subject: Re: Poor HBase random read performance
> >
> >
> > This came up a few times before.
> >
> >
> >
> > ________________________________
> > From: Vladimir Rodionov <vr...@carrieriq.com>
> > To: "dev@hbase.apache.org" <de...@hbase.apache.org>; lars hofhansl <
> > larsh@apache.org>
> > Sent: Monday, July 1, 2013 11:08 AM
> > Subject: RE: Poor HBase random read performance
> >
> >
> > I would like to remind that in original BigTable's design  there is scan
> > cache to take care of random reads and this
> > important feature is still missing in HBase.
> >
> > Best regards,
> > Vladimir Rodionov
> > Principal Platform Engineer
> > Carrier IQ, www.carrieriq.com
> > e-mail: vrodionov@carrieriq.com
> >
> > ________________________________________
> > From: lars hofhansl [larsh@apache.org]
> > Sent: Saturday, June 29, 2013 3:24 PM
> > To: dev@hbase.apache.org
> > Subject: Re: Poor HBase random read performance
> >
> > Should also say that random reads this way are somewhat of a worst case
> > scenario.
> >
> > If the working set is much larger than the block cache and the reads are
> > random, then each read will likely have to bring in an entirely new block
> > from the OS cache,
> > even when the KVs are much smaller than a block.
> >
> > So in order to read a (say) 1k KV HBase needs to bring 64k (default block
> > size) from the OS cache.
> > As long as the dataset fits into the block cache this difference in size
> > has no performance impact, but as soon as the dataset does not fit, we
> have
> > to bring much more data from the OS cache than we're actually interested
> in.
> >
> > Indeed in my test I found that HBase brings in about 60x the data size
> > from the OS cache (used PE with ~1k KVs). This can be improved with
> smaller
> > block sizes; and with a more efficient way to instantiate HFile blocks in
> > Java (which we need to work on).
> >
> >
> > -- Lars
> >
> > ________________________________
> > From: lars hofhansl <la...@apache.org>
> > To: "dev@hbase.apache.org" <de...@hbase.apache.org>
> > Sent: Saturday, June 29, 2013 3:09 PM
> > Subject: Re: Poor HBase random read performance
> >
> >
> > I've seen the same bad performance behavior when I tested this on a real
> > cluster. (I think it was in 0.94.6)
> >
> >
> > Instead of en/disabling the blockcache, I tested sequential and random
> > reads on a data set that does not fit into the (aggregate) block cache.
> > Sequential reads were drastically faster than Random reads (7 vs 34
> > minutes), which can really only be explained with the fact that the next
> > get will with high probability hit an already cached block, whereas in
> the
> > random read case it likely will not.
> >
> > In the RandomRead case I estimate that each RegionServer brings in
> between
> > 100 and 200mb/s from the OS cache. Even at 200mb/s this would be quite
> > slow.I understand that performance is bad when index/bloom blocks are not
> > cached, but bringing in data blocks from the OS cache should be faster
> than
> > it is.
> >
> >
> > So this is something to debug.
> >
> > -- Lars
> >
> >
> >
> > ________________________________
> > From: Varun Sharma <va...@pinterest.com>
> > To: "dev@hbase.apache.org" <de...@hbase.apache.org>
> > Sent: Saturday, June 29, 2013 12:13 PM
> > Subject: Poor HBase random read performance
> >
> >
> > Hi,
> >
> > I was doing some tests on how good HBase random reads are. The setup is
> > consists of a 1 node cluster with dfs replication set to 1. Short circuit
> > local reads and HBase checksums are enabled. The data set is small enough
> > to be largely cached in the filesystem cache - 10G on a 60G machine.
> >
> > Client sends out multi-get operations in batches to 10 and I try to
> measure
> > throughput.
> >
> > Test #1
> >
> > All Data was cached in the block cache.
> >
> > Test Time = 120 seconds
> > Num Read Ops = 12M
> >
> > Throughput = 100K per second
> >
> > Test #2
> >
> > I disable block cache. But now all the data is in the file system cache.
> I
> > verify this by making sure that IOPs on the disk drive are 0 during the
> > test. I run the same test with batched ops.
> >
> > Test Time = 120 seconds
> > Num Read Ops = 0.6M
> > Throughput = 5K per second
> >
> > Test #3
> >
> > I saw all the threads are now stuck in idLock.lockEntry(). So I now run
> > with the lock disabled and the block cache disabled.
> >
> > Test Time = 120 seconds
> > Num Read Ops = 1.2M
> > Throughput = 10K per second
> >
> > Test #4
> >
> > I re enable block cache and this time hack hbase to only cache Index and
> > Bloom blocks but data blocks come from File System cache.
> >
> > Test Time = 120 seconds
> > Num Read Ops = 1.6M
> > Throughput = 13K per second
> >
> > So, I wonder how come such a massive drop in throughput. I know that HDFS
> > code adds tremendous overhead but this seems pretty high to me. I use
> > 0.94.7 and cdh 4.2.0
> >
> > Thanks
> > Varun
> >
> > Confidentiality Notice:  The information contained in this message,
> > including any attachments hereto, may be confidential and is intended to
> be
> > read only by the individual or entity to whom this message is addressed.
> If
> > the reader of this message is not the intended recipient or an agent or
> > designee of the intended recipient, please note that any review, use,
> > disclosure or distribution of this message or its attachments, in any
> form,
> > is strictly prohibited.  If you have received this message in error,
> please
> > immediately notify the sender and/or Notifications@carrieriq.com and
> > delete or destroy any copy of this message and its attachments.
> >
>

Re: Poor HBase random read performance

Posted by Varun Sharma <va...@pinterest.com>.

Going back to leveldb vs hbase, I am not sure if we can come with a clean
way to identify HFiles containing more recent data in the wake of
compactions

I though wonder if this works with minor compactions, lets say you compact
a really old file with a new file. Now since this file's most recent
timestamp is very recent because of the new file, you look into this file,
but then retrieve something from the "old" portion of this file. So you end
with older data.

I guess one way would be just order the files by time ranges. Non
intersecting time range files can be ordered in reverse time order.
Intersecting stuff can be seeked together.

     File1
|-----------------|
                          File2
                     |---------------|
                                       File3
                             |-----------------------------|
                                                                     File4

 |--------------------|

So in this case, we seek

[File1], [File2, File3], [File4]

I think for random single key value looks (row, col)->key - this could lead
to good savings for time ordered clients (which are quite common). Unless
File1 and File4 get compacted, in which case, we always need to seek into
both.



On Mon, Jul 1, 2013 at 12:10 PM, lars hofhansl <la...@apache.org> wrote:

> Sorry. Hit enter too early.
>
> Some discussion here:
> http://apache-hbase.679495.n3.nabble.com/keyvalue-cache-td3882628.html
> but no actionable outcome.
>
> -- Lars
> ________________________________
> From: lars hofhansl <la...@apache.org>
> To: "dev@hbase.apache.org" <de...@hbase.apache.org>
> Sent: Monday, July 1, 2013 12:05 PM
> Subject: Re: Poor HBase random read performance
>
>
> This came up a few times before.
>
>
>
> ________________________________
> From: Vladimir Rodionov <vr...@carrieriq.com>
> To: "dev@hbase.apache.org" <de...@hbase.apache.org>; lars hofhansl <
> larsh@apache.org>
> Sent: Monday, July 1, 2013 11:08 AM
> Subject: RE: Poor HBase random read performance
>
>
> I would like to remind that in original BigTable's design  there is scan
> cache to take care of random reads and this
> important feature is still missing in HBase.
>
> Best regards,
> Vladimir Rodionov
> Principal Platform Engineer
> Carrier IQ, www.carrieriq.com
> e-mail: vrodionov@carrieriq.com
>
> ________________________________________
> From: lars hofhansl [larsh@apache.org]
> Sent: Saturday, June 29, 2013 3:24 PM
> To: dev@hbase.apache.org
> Subject: Re: Poor HBase random read performance
>
> Should also say that random reads this way are somewhat of a worst case
> scenario.
>
> If the working set is much larger than the block cache and the reads are
> random, then each read will likely have to bring in an entirely new block
> from the OS cache,
> even when the KVs are much smaller than a block.
>
> So in order to read a (say) 1k KV HBase needs to bring 64k (default block
> size) from the OS cache.
> As long as the dataset fits into the block cache this difference in size
> has no performance impact, but as soon as the dataset does not fit, we have
> to bring much more data from the OS cache than we're actually interested in.
>
> Indeed in my test I found that HBase brings in about 60x the data size
> from the OS cache (used PE with ~1k KVs). This can be improved with smaller
> block sizes; and with a more efficient way to instantiate HFile blocks in
> Java (which we need to work on).
>
>
> -- Lars
>
> ________________________________
> From: lars hofhansl <la...@apache.org>
> To: "dev@hbase.apache.org" <de...@hbase.apache.org>
> Sent: Saturday, June 29, 2013 3:09 PM
> Subject: Re: Poor HBase random read performance
>
>
> I've seen the same bad performance behavior when I tested this on a real
> cluster. (I think it was in 0.94.6)
>
>
> Instead of en/disabling the blockcache, I tested sequential and random
> reads on a data set that does not fit into the (aggregate) block cache.
> Sequential reads were drastically faster than Random reads (7 vs 34
> minutes), which can really only be explained with the fact that the next
> get will with high probability hit an already cached block, whereas in the
> random read case it likely will not.
>
> In the RandomRead case I estimate that each RegionServer brings in between
> 100 and 200mb/s from the OS cache. Even at 200mb/s this would be quite
> slow.I understand that performance is bad when index/bloom blocks are not
> cached, but bringing in data blocks from the OS cache should be faster than
> it is.
>
>
> So this is something to debug.
>
> -- Lars
>
>
>
> ________________________________
> From: Varun Sharma <va...@pinterest.com>
> To: "dev@hbase.apache.org" <de...@hbase.apache.org>
> Sent: Saturday, June 29, 2013 12:13 PM
> Subject: Poor HBase random read performance
>
>
> Hi,
>
> I was doing some tests on how good HBase random reads are. The setup is
> consists of a 1 node cluster with dfs replication set to 1. Short circuit
> local reads and HBase checksums are enabled. The data set is small enough
> to be largely cached in the filesystem cache - 10G on a 60G machine.
>
> Client sends out multi-get operations in batches to 10 and I try to measure
> throughput.
>
> Test #1
>
> All Data was cached in the block cache.
>
> Test Time = 120 seconds
> Num Read Ops = 12M
>
> Throughput = 100K per second
>
> Test #2
>
> I disable block cache. But now all the data is in the file system cache. I
> verify this by making sure that IOPs on the disk drive are 0 during the
> test. I run the same test with batched ops.
>
> Test Time = 120 seconds
> Num Read Ops = 0.6M
> Throughput = 5K per second
>
> Test #3
>
> I saw all the threads are now stuck in idLock.lockEntry(). So I now run
> with the lock disabled and the block cache disabled.
>
> Test Time = 120 seconds
> Num Read Ops = 1.2M
> Throughput = 10K per second
>
> Test #4
>
> I re enable block cache and this time hack hbase to only cache Index and
> Bloom blocks but data blocks come from File System cache.
>
> Test Time = 120 seconds
> Num Read Ops = 1.6M
> Throughput = 13K per second
>
> So, I wonder how come such a massive drop in throughput. I know that HDFS
> code adds tremendous overhead but this seems pretty high to me. I use
> 0.94.7 and cdh 4.2.0
>
> Thanks
> Varun
>
> Confidentiality Notice:  The information contained in this message,
> including any attachments hereto, may be confidential and is intended to be
> read only by the individual or entity to whom this message is addressed. If
> the reader of this message is not the intended recipient or an agent or
> designee of the intended recipient, please note that any review, use,
> disclosure or distribution of this message or its attachments, in any form,
> is strictly prohibited.  If you have received this message in error, please
> immediately notify the sender and/or Notifications@carrieriq.com and
> delete or destroy any copy of this message and its attachments.
>

Re: Poor HBase random read performance

Posted by lars hofhansl <la...@apache.org>.

Sorry. Hit enter too early.

Some discussion here: http://apache-hbase.679495.n3.nabble.com/keyvalue-cache-td3882628.html
but no actionable outcome.

-- Lars
________________________________
From: lars hofhansl <la...@apache.org>
To: "dev@hbase.apache.org" <de...@hbase.apache.org> 
Sent: Monday, July 1, 2013 12:05 PM
Subject: Re: Poor HBase random read performance

This came up a few times before.

________________________________
From: Vladimir Rodionov <vr...@carrieriq.com>
To: "dev@hbase.apache.org" <de...@hbase.apache.org>; lars hofhansl <la...@apache.org> 
Sent: Monday, July 1, 2013 11:08 AM
Subject: RE: Poor HBase random read performance

I would like to remind that in original BigTable's design  there is scan cache to take care of random reads and this
important feature is still missing in HBase.

Best regards,
Vladimir Rodionov
Principal Platform Engineer
Carrier IQ, www.carrieriq.com
e-mail: vrodionov@carrieriq.com

________________________________________
From: lars hofhansl [larsh@apache.org]
Sent: Saturday, June 29, 2013 3:24 PM
To: dev@hbase.apache.org
Subject: Re: Poor HBase random read performance

Should also say that random reads this way are somewhat of a worst case scenario.

If the working set is much larger than the block cache and the reads are random, then each read will likely have to bring in an entirely new block from the OS cache,
even when the KVs are much smaller than a block.

So in order to read a (say) 1k KV HBase needs to bring 64k (default block size) from the OS cache.
As long as the dataset fits into the block cache this difference in size has no performance impact, but as soon as the dataset does not fit, we have to bring much more data from the OS cache than we're actually interested in.

Indeed in my test I found that HBase brings in about 60x the data size from the OS cache (used PE with ~1k KVs). This can be improved with smaller block sizes; and with a more efficient way to instantiate HFile blocks in Java (which we need to work on).

-- Lars

________________________________
From: lars hofhansl <la...@apache.org>
To: "dev@hbase.apache.org" <de...@hbase.apache.org>
Sent: Saturday, June 29, 2013 3:09 PM
Subject: Re: Poor HBase random read performance

I've seen the same bad performance behavior when I tested this on a real cluster. (I think it was in 0.94.6)

Instead of en/disabling the blockcache, I tested sequential and random reads on a data set that does not fit into the (aggregate) block cache.
Sequential reads were drastically faster than Random reads (7 vs 34 minutes), which can really only be explained with the fact that the next get will with high probability hit an already cached block, whereas in the random read case it likely will not.

In the RandomRead case I estimate that each RegionServer brings in between 100 and 200mb/s from the OS cache. Even at 200mb/s this would be quite slow.I understand that performance is bad when index/bloom blocks are not cached, but bringing in data blocks from the OS cache should be faster than it is.

So this is something to debug.

-- Lars

________________________________
From: Varun Sharma <va...@pinterest.com>
To: "dev@hbase.apache.org" <de...@hbase.apache.org>
Sent: Saturday, June 29, 2013 12:13 PM
Subject: Poor HBase random read performance

Hi,

I was doing some tests on how good HBase random reads are. The setup is
consists of a 1 node cluster with dfs replication set to 1. Short circuit
local reads and HBase checksums are enabled. The data set is small enough
to be largely cached in the filesystem cache - 10G on a 60G machine.

Client sends out multi-get operations in batches to 10 and I try to measure
throughput.

Test #1

All Data was cached in the block cache.

Test Time = 120 seconds
Num Read Ops = 12M

Throughput = 100K per second

Test #2

I disable block cache. But now all the data is in the file system cache. I
verify this by making sure that IOPs on the disk drive are 0 during the
test. I run the same test with batched ops.

Test Time = 120 seconds
Num Read Ops = 0.6M
Throughput = 5K per second

Test #3

I saw all the threads are now stuck in idLock.lockEntry(). So I now run
with the lock disabled and the block cache disabled.

Test Time = 120 seconds
Num Read Ops = 1.2M
Throughput = 10K per second

Test #4

I re enable block cache and this time hack hbase to only cache Index and
Bloom blocks but data blocks come from File System cache.

Test Time = 120 seconds
Num Read Ops = 1.6M
Throughput = 13K per second

So, I wonder how come such a massive drop in throughput. I know that HDFS
code adds tremendous overhead but this seems pretty high to me. I use
0.94.7 and cdh 4.2.0

Thanks
Varun

Confidentiality Notice:  The information contained in this message, including any attachments hereto, may be confidential and is intended to be read only by the individual or entity to whom this message is addressed. If the reader of this message is not the intended recipient or an agent or designee of the intended recipient, please note that any review, use, disclosure or distribution of this message or its attachments, in any form, is strictly prohibited.  If you have received this message in error, please immediately notify the sender and/or Notifications@carrieriq.com and delete or destroy any copy of this message and its attachments.

Re: Poor HBase random read performance

Posted by lars hofhansl <la...@apache.org>.

This came up a few times before.

________________________________
 From: Vladimir Rodionov <vr...@carrieriq.com>
To: "dev@hbase.apache.org" <de...@hbase.apache.org>; lars hofhansl <la...@apache.org> 
Sent: Monday, July 1, 2013 11:08 AM
Subject: RE: Poor HBase random read performance

I would like to remind that in original BigTable's design  there is scan cache to take care of random reads and this
important feature is still missing in HBase.

Best regards,
Vladimir Rodionov
Principal Platform Engineer
Carrier IQ, www.carrieriq.com
e-mail: vrodionov@carrieriq.com

________________________________________
From: lars hofhansl [larsh@apache.org]
Sent: Saturday, June 29, 2013 3:24 PM
To: dev@hbase.apache.org
Subject: Re: Poor HBase random read performance

Should also say that random reads this way are somewhat of a worst case scenario.

If the working set is much larger than the block cache and the reads are random, then each read will likely have to bring in an entirely new block from the OS cache,
even when the KVs are much smaller than a block.

So in order to read a (say) 1k KV HBase needs to bring 64k (default block size) from the OS cache.
As long as the dataset fits into the block cache this difference in size has no performance impact, but as soon as the dataset does not fit, we have to bring much more data from the OS cache than we're actually interested in.

Indeed in my test I found that HBase brings in about 60x the data size from the OS cache (used PE with ~1k KVs). This can be improved with smaller block sizes; and with a more efficient way to instantiate HFile blocks in Java (which we need to work on).

-- Lars

________________________________
From: lars hofhansl <la...@apache.org>
To: "dev@hbase.apache.org" <de...@hbase.apache.org>
Sent: Saturday, June 29, 2013 3:09 PM
Subject: Re: Poor HBase random read performance

I've seen the same bad performance behavior when I tested this on a real cluster. (I think it was in 0.94.6)

Instead of en/disabling the blockcache, I tested sequential and random reads on a data set that does not fit into the (aggregate) block cache.
Sequential reads were drastically faster than Random reads (7 vs 34 minutes), which can really only be explained with the fact that the next get will with high probability hit an already cached block, whereas in the random read case it likely will not.

In the RandomRead case I estimate that each RegionServer brings in between 100 and 200mb/s from the OS cache. Even at 200mb/s this would be quite slow.I understand that performance is bad when index/bloom blocks are not cached, but bringing in data blocks from the OS cache should be faster than it is.

So this is something to debug.

-- Lars

________________________________
From: Varun Sharma <va...@pinterest.com>
To: "dev@hbase.apache.org" <de...@hbase.apache.org>
Sent: Saturday, June 29, 2013 12:13 PM
Subject: Poor HBase random read performance

Hi,

I was doing some tests on how good HBase random reads are. The setup is
consists of a 1 node cluster with dfs replication set to 1. Short circuit
local reads and HBase checksums are enabled. The data set is small enough
to be largely cached in the filesystem cache - 10G on a 60G machine.

Client sends out multi-get operations in batches to 10 and I try to measure
throughput.

Test #1

All Data was cached in the block cache.

Test Time = 120 seconds
Num Read Ops = 12M

Throughput = 100K per second

Test #2

I disable block cache. But now all the data is in the file system cache. I
verify this by making sure that IOPs on the disk drive are 0 during the
test. I run the same test with batched ops.

Test Time = 120 seconds
Num Read Ops = 0.6M
Throughput = 5K per second

Test #3

I saw all the threads are now stuck in idLock.lockEntry(). So I now run
with the lock disabled and the block cache disabled.

Test Time = 120 seconds
Num Read Ops = 1.2M
Throughput = 10K per second

Test #4

I re enable block cache and this time hack hbase to only cache Index and
Bloom blocks but data blocks come from File System cache.

Test Time = 120 seconds
Num Read Ops = 1.6M
Throughput = 13K per second

So, I wonder how come such a massive drop in throughput. I know that HDFS
code adds tremendous overhead but this seems pretty high to me. I use
0.94.7 and cdh 4.2.0

Thanks
Varun

Confidentiality Notice:  The information contained in this message, including any attachments hereto, may be confidential and is intended to be read only by the individual or entity to whom this message is addressed. If the reader of this message is not the intended recipient or an agent or designee of the intended recipient, please note that any review, use, disclosure or distribution of this message or its attachments, in any form, is strictly prohibited.  If you have received this message in error, please immediately notify the sender and/or Notifications@carrieriq.com and delete or destroy any copy of this message and its attachments.

RE: Poor HBase random read performance

Posted by Vladimir Rodionov <vr...@carrieriq.com>.

I would like to remind that in original BigTable's design  there is scan cache to take care of random reads and this
important feature is still missing in HBase.

Best regards,
Vladimir Rodionov
Principal Platform Engineer
Carrier IQ, www.carrieriq.com
e-mail: vrodionov@carrieriq.com

________________________________________
From: lars hofhansl [larsh@apache.org]
Sent: Saturday, June 29, 2013 3:24 PM
To: dev@hbase.apache.org
Subject: Re: Poor HBase random read performance

Should also say that random reads this way are somewhat of a worst case scenario.

If the working set is much larger than the block cache and the reads are random, then each read will likely have to bring in an entirely new block from the OS cache,
even when the KVs are much smaller than a block.

So in order to read a (say) 1k KV HBase needs to bring 64k (default block size) from the OS cache.
As long as the dataset fits into the block cache this difference in size has no performance impact, but as soon as the dataset does not fit, we have to bring much more data from the OS cache than we're actually interested in.

Indeed in my test I found that HBase brings in about 60x the data size from the OS cache (used PE with ~1k KVs). This can be improved with smaller block sizes; and with a more efficient way to instantiate HFile blocks in Java (which we need to work on).


-- Lars

________________________________
From: lars hofhansl <la...@apache.org>
To: "dev@hbase.apache.org" <de...@hbase.apache.org>
Sent: Saturday, June 29, 2013 3:09 PM
Subject: Re: Poor HBase random read performance


I've seen the same bad performance behavior when I tested this on a real cluster. (I think it was in 0.94.6)


Instead of en/disabling the blockcache, I tested sequential and random reads on a data set that does not fit into the (aggregate) block cache.
Sequential reads were drastically faster than Random reads (7 vs 34 minutes), which can really only be explained with the fact that the next get will with high probability hit an already cached block, whereas in the random read case it likely will not.

In the RandomRead case I estimate that each RegionServer brings in between 100 and 200mb/s from the OS cache. Even at 200mb/s this would be quite slow.I understand that performance is bad when index/bloom blocks are not cached, but bringing in data blocks from the OS cache should be faster than it is.


So this is something to debug.

-- Lars



________________________________
From: Varun Sharma <va...@pinterest.com>
To: "dev@hbase.apache.org" <de...@hbase.apache.org>
Sent: Saturday, June 29, 2013 12:13 PM
Subject: Poor HBase random read performance


Hi,

I was doing some tests on how good HBase random reads are. The setup is
consists of a 1 node cluster with dfs replication set to 1. Short circuit
local reads and HBase checksums are enabled. The data set is small enough
to be largely cached in the filesystem cache - 10G on a 60G machine.

Client sends out multi-get operations in batches to 10 and I try to measure
throughput.

Test #1

All Data was cached in the block cache.

Test Time = 120 seconds
Num Read Ops = 12M

Throughput = 100K per second

Test #2

I disable block cache. But now all the data is in the file system cache. I
verify this by making sure that IOPs on the disk drive are 0 during the
test. I run the same test with batched ops.

Test Time = 120 seconds
Num Read Ops = 0.6M
Throughput = 5K per second

Test #3

I saw all the threads are now stuck in idLock.lockEntry(). So I now run
with the lock disabled and the block cache disabled.

Test Time = 120 seconds
Num Read Ops = 1.2M
Throughput = 10K per second

Test #4

I re enable block cache and this time hack hbase to only cache Index and
Bloom blocks but data blocks come from File System cache.

Test Time = 120 seconds
Num Read Ops = 1.6M
Throughput = 13K per second

So, I wonder how come such a massive drop in throughput. I know that HDFS
code adds tremendous overhead but this seems pretty high to me. I use
0.94.7 and cdh 4.2.0

Thanks
Varun

Confidentiality Notice:  The information contained in this message, including any attachments hereto, may be confidential and is intended to be read only by the individual or entity to whom this message is addressed. If the reader of this message is not the intended recipient or an agent or designee of the intended recipient, please note that any review, use, disclosure or distribution of this message or its attachments, in any form, is strictly prohibited.  If you have received this message in error, please immediately notify the sender and/or Notifications@carrieriq.com and delete or destroy any copy of this message and its attachments.

Re: Poor HBase random read performance

Posted by lars hofhansl <la...@apache.org>.

Should also say that random reads this way are somewhat of a worst case scenario.

If the working set is much larger than the block cache and the reads are random, then each read will likely have to bring in an entirely new block from the OS cache,
even when the KVs are much smaller than a block.

So in order to read a (say) 1k KV HBase needs to bring 64k (default block size) from the OS cache.
As long as the dataset fits into the block cache this difference in size has no performance impact, but as soon as the dataset does not fit, we have to bring much more data from the OS cache than we're actually interested in.

Indeed in my test I found that HBase brings in about 60x the data size from the OS cache (used PE with ~1k KVs). This can be improved with smaller block sizes; and with a more efficient way to instantiate HFile blocks in Java (which we need to work on).


-- Lars

________________________________
From: lars hofhansl <la...@apache.org>
To: "dev@hbase.apache.org" <de...@hbase.apache.org> 
Sent: Saturday, June 29, 2013 3:09 PM
Subject: Re: Poor HBase random read performance


I've seen the same bad performance behavior when I tested this on a real cluster. (I think it was in 0.94.6)


Instead of en/disabling the blockcache, I tested sequential and random reads on a data set that does not fit into the (aggregate) block cache.
Sequential reads were drastically faster than Random reads (7 vs 34 minutes), which can really only be explained with the fact that the next get will with high probability hit an already cached block, whereas in the random read case it likely will not.

In the RandomRead case I estimate that each RegionServer brings in between 100 and 200mb/s from the OS cache. Even at 200mb/s this would be quite slow.I understand that performance is bad when index/bloom blocks are not cached, but bringing in data blocks from the OS cache should be faster than it is.


So this is something to debug.

-- Lars



________________________________
From: Varun Sharma <va...@pinterest.com>
To: "dev@hbase.apache.org" <de...@hbase.apache.org> 
Sent: Saturday, June 29, 2013 12:13 PM
Subject: Poor HBase random read performance


Hi,

I was doing some tests on how good HBase random reads are. The setup is
consists of a 1 node cluster with dfs replication set to 1. Short circuit
local reads and HBase checksums are enabled. The data set is small enough
to be largely cached in the filesystem cache - 10G on a 60G machine.

Client sends out multi-get operations in batches to 10 and I try to measure
throughput.

Test #1

All Data was cached in the block cache.

Test Time = 120 seconds
Num Read Ops = 12M

Throughput = 100K per second

Test #2

I disable block cache. But now all the data is in the file system cache. I
verify this by making sure that IOPs on the disk drive are 0 during the
test. I run the same test with batched ops.

Test Time = 120 seconds
Num Read Ops = 0.6M
Throughput = 5K per second

Test #3

I saw all the threads are now stuck in idLock.lockEntry(). So I now run
with the lock disabled and the block cache disabled.

Test Time = 120 seconds
Num Read Ops = 1.2M
Throughput = 10K per second

Test #4

I re enable block cache and this time hack hbase to only cache Index and
Bloom blocks but data blocks come from File System cache.

Test Time = 120 seconds
Num Read Ops = 1.6M
Throughput = 13K per second

So, I wonder how come such a massive drop in throughput. I know that HDFS
code adds tremendous overhead but this seems pretty high to me. I use
0.94.7 and cdh 4.2.0

Thanks
Varun

Re: Poor HBase random read performance

Posted by lars hofhansl <la...@apache.org>.

I've seen the same bad performance behavior when I tested this on a real cluster. (I think it was in 0.94.6)


Instead of en/disabling the blockcache, I tested sequential and random reads on a data set that does not fit into the (aggregate) block cache.
Sequential reads were drastically faster than Random reads (7 vs 34 minutes), which can really only be explained with the fact that the next get will with high probability hit an already cached block, whereas in the random read case it likely will not.

In the RandomRead case I estimate that each RegionServer brings in between 100 and 200mb/s from the OS cache. Even at 200mb/s this would be quite slow.I understand that performance is bad when index/bloom blocks are not cached, but bringing in data blocks from the OS cache should be faster than it is.


So this is something to debug.

-- Lars



________________________________
 From: Varun Sharma <va...@pinterest.com>
To: "dev@hbase.apache.org" <de...@hbase.apache.org> 
Sent: Saturday, June 29, 2013 12:13 PM
Subject: Poor HBase random read performance
 

Hi,

I was doing some tests on how good HBase random reads are. The setup is
consists of a 1 node cluster with dfs replication set to 1. Short circuit
local reads and HBase checksums are enabled. The data set is small enough
to be largely cached in the filesystem cache - 10G on a 60G machine.

Client sends out multi-get operations in batches to 10 and I try to measure
throughput.

Test #1

All Data was cached in the block cache.

Test Time = 120 seconds
Num Read Ops = 12M

Throughput = 100K per second

Test #2

I disable block cache. But now all the data is in the file system cache. I
verify this by making sure that IOPs on the disk drive are 0 during the
test. I run the same test with batched ops.

Test Time = 120 seconds
Num Read Ops = 0.6M
Throughput = 5K per second

Test #3

I saw all the threads are now stuck in idLock.lockEntry(). So I now run
with the lock disabled and the block cache disabled.

Test Time = 120 seconds
Num Read Ops = 1.2M
Throughput = 10K per second

Test #4

I re enable block cache and this time hack hbase to only cache Index and
Bloom blocks but data blocks come from File System cache.

Test Time = 120 seconds
Num Read Ops = 1.6M
Throughput = 13K per second

So, I wonder how come such a massive drop in throughput. I know that HDFS
code adds tremendous overhead but this seems pretty high to me. I use
0.94.7 and cdh 4.2.0

Thanks
Varun