You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Bryan Keller <br...@gmail.com> on 2013/05/01 06:01:57 UTC

Poor HBase map-reduce scan performance

I have been attempting to speed up my HBase map-reduce scans for a while now. I have tried just about everything without much luck. I'm running out of ideas and was hoping for some suggestions. This is HBase 0.94.2 and Hadoop 2.0.0 (CDH4.2.1).

The table I'm scanning:
20 mil rows
Hundreds of columns/row
Column keys can be 30-40 bytes
Column values are generally not large, 1k would be on the large side
250 regions
Snappy compression
8gb region size
512mb memstore flush
128k block size
700gb of data on HDFS

My cluster has 8 datanodes which are also regionservers. Each has 8 cores (16 HT), 64gb RAM, and 2 SSDs. The network is 10gbit. I have a separate machine acting as namenode, HMaster, and zookeeper (single instance). I have disk local reads turned on.

I'm seeing around 5 gbit/sec on average network IO. Each disk is getting 400mb/sec read IO. Theoretically I could get 400mb/sec * 16 = 6.4gb/sec.

Using Hadoop's TestDFSIO tool, I'm seeing around 1.4gb/sec read speed. Not really that great compared to the theoretical I/O. However this is far better than I am seeing with HBase map-reduce scans of my table.

I have a simple no-op map-only job (using TableInputFormat) that scans the table and does nothing with data. This takes 45 minutes. That's about 260mb/sec read speed. This is over 5x slower than straight HDFS. Basically, with HBase I'm seeing read performance of my 16 SSD cluster performing nearly 35% slower than a single SSD.

Here are some things I have changed to no avail:
Scan caching values
HDFS block sizes
HBase block sizes
Region file sizes
Memory settings
GC settings
Number of mappers/node
Compressed vs not compressed

One thing I notice is that the regionserver is using quite a bit of CPU during the map reduce job. When dumping the jstack of the process, it seems like it is usually in some type of memory allocation or decompression routine which didn't seem abnormal.

I can't seem to pinpoint the bottleneck. CPU use by the regionserver is high but not maxed out. Disk I/O and network I/O are low, IO wait is low. I'm on the verge of just writing the dataset out to sequence files once a day for scan purposes. Is that what others are doing?

Re: Poor HBase map-reduce scan performance

Posted by Michael Segel <mi...@hotmail.com>.

I'd say go to Avro over protobufs in terms of redesigning your schema. 

With respect to CPUs, you don't say what your system looks like. Intel vs AMD , Num physical cores, what else you're running on the machine (#Mappers/Reducer slots) etc ... 

In terms of the schema... 

How are you accessing your data? 
You said that you want to filter on a column value... using avro to store the address record in lets say a JSON string... write a custom filter? 

And people laughed at me when I said that schema design was critical and often misunderstood. ;-) 
(Ok the truth was that they laughed at me because I thought I looked cool wearing a plaid suit.) 

HTH


On May 1, 2013, at 1:02 AM, Bryan Keller <br...@gmail.com> wrote:

> The table has hashed keys so rows are evenly distributed amongst the regionservers, and load on each regionserver is pretty much the same. I also have per-table balancing turned on. I get mostly data local mappers with only a few rack local (maybe 10 of the 250 mappers).
> 
> Currently the table is a wide table schema, with lists of data structures stored as columns with column prefixes grouping the data structures (e.g. 1_name, 1_address, 1_city, 2_name, 2_address, 2_city). I was thinking of moving those data structures to protobuf which would cut down on the number of columns. The downside is I can't filter on one value with that, but it is a tradeoff I would make for performance. I was also considering restructuring the table into a tall table.
> 
> Something interesting is that my old regionserver machines had five 15k SCSI drives instead of 2 SSDs, and performance was about the same. Also, my old network was 1gbit, now it is 10gbit. So neither network nor disk I/O appear to be the bottleneck. The CPU is rather high for the regionserver so it seems like the best candidate to investigate. I will try profiling it tomorrow and will report back. I may revisit compression on vs off since that is adding load to the CPU.
> 
> I'll also come up with a sample program that generates data similar to my table.
> 
> 
> On Apr 30, 2013, at 10:01 PM, lars hofhansl <la...@apache.org> wrote:
> 
>> Your average row is 35k so scanner caching would not make a huge difference, although I would have expected some improvements by setting it to 10 or 50 since you have a wide 10ge pipe.
>> 
>> I assume your table is split sufficiently to touch all RegionServer... Do you see the same load/IO on all region servers?
>> 
>> A bunch of scan improvements went into HBase since 0.94.2.
>> I blogged about some of these changes here: http://hadoop-hbase.blogspot.com/2012/12/hbase-profiling.html
>> 
>> In your case - since you have many columns, each of which carry the rowkey - you might benefit a lot from HBASE-7279.
>> 
>> In the end HBase *is* slower than straight HDFS for full scans. How could it not be?
>> So I would start by looking at HDFS first. Make sure Nagle's is disbaled in both HBase and HDFS.
>> 
>> And lastly SSDs are somewhat new territory for HBase. Maybe Andy Purtell is listening, I think he did some tests with HBase on SSDs.
>> With rotating media you typically see an improvement with compression. With SSDs the added CPU needed for decompression might outweigh the benefits.
>> 
>> At the risk of starting a larger discussion here, I would posit that HBase's LSM based design, which trades random IO with sequential IO, might be a bit more questionable on SSDs.
>> 
>> If you can, it would be nice to run a profiler against one of the RegionServers (or maybe do it with the single RS setup) and see where it is bottlenecked.
>> (And if you send me a sample program to generate some data - not 700g, though :) - I'll try to do a bit of profiling during the next days as my day job permits, but I do not have any machines with SSDs).
>> 
>> -- Lars
>> 
>> 
>> 
>> 
>> ________________________________
>> From: Bryan Keller <br...@gmail.com>
>> To: user@hbase.apache.org 
>> Sent: Tuesday, April 30, 2013 9:31 PM
>> Subject: Re: Poor HBase map-reduce scan performance
>> 
>> 
>> Yes, I have tried various settings for setCaching() and I have setCacheBlocks(false)
>> 
>> On Apr 30, 2013, at 9:17 PM, Ted Yu <yu...@gmail.com> wrote:
>> 
>>> From http://hbase.apache.org/book.html#mapreduce.example :
>>> 
>>> scan.setCaching(500);        // 1 is the default in Scan, which will
>>> be bad for MapReduce jobs
>>> scan.setCacheBlocks(false);  // don't set to true for MR jobs
>>> 
>>> I guess you have used the above setting.
>>> 
>>> 0.94.x releases are compatible. Have you considered upgrading to, say
>>> 0.94.7 which was recently released ?
>>> 
>>> Cheers
>>> 
>>> On Tue, Apr 30, 2013 at 9:01 PM, Bryan Keller <br...@gmail.com> wrote:
>>> 
>>>> I have been attempting to speed up my HBase map-reduce scans for a while
>>>> now. I have tried just about everything without much luck. I'm running out
>>>> of ideas and was hoping for some suggestions. This is HBase 0.94.2 and
>>>> Hadoop 2.0.0 (CDH4.2.1).
>>>> 
>>>> The table I'm scanning:
>>>> 20 mil rows
>>>> Hundreds of columns/row
>>>> Column keys can be 30-40 bytes
>>>> Column values are generally not large, 1k would be on the large side
>>>> 250 regions
>>>> Snappy compression
>>>> 8gb region size
>>>> 512mb memstore flush
>>>> 128k block size
>>>> 700gb of data on HDFS
>>>> 
>>>> My cluster has 8 datanodes which are also regionservers. Each has 8 cores
>>>> (16 HT), 64gb RAM, and 2 SSDs. The network is 10gbit. I have a separate
>>>> machine acting as namenode, HMaster, and zookeeper (single instance). I
>>>> have disk local reads turned on.
>>>> 
>>>> I'm seeing around 5 gbit/sec on average network IO. Each disk is getting
>>>> 400mb/sec read IO. Theoretically I could get 400mb/sec * 16 = 6.4gb/sec.
>>>> 
>>>> Using Hadoop's TestDFSIO tool, I'm seeing around 1.4gb/sec read speed. Not
>>>> really that great compared to the theoretical I/O. However this is far
>>>> better than I am seeing with HBase map-reduce scans of my table.
>>>> 
>>>> I have a simple no-op map-only job (using TableInputFormat) that scans the
>>>> table and does nothing with data. This takes 45 minutes. That's about
>>>> 260mb/sec read speed. This is over 5x slower than straight HDFS.
>>>> Basically, with HBase I'm seeing read performance of my 16 SSD cluster
>>>> performing nearly 35% slower than a single SSD.
>>>> 
>>>> Here are some things I have changed to no avail:
>>>> Scan caching values
>>>> HDFS block sizes
>>>> HBase block sizes
>>>> Region file sizes
>>>> Memory settings
>>>> GC settings
>>>> Number of mappers/node
>>>> Compressed vs not compressed
>>>> 
>>>> One thing I notice is that the regionserver is using quite a bit of CPU
>>>> during the map reduce job. When dumping the jstack of the process, it seems
>>>> like it is usually in some type of memory allocation or decompression
>>>> routine which didn't seem abnormal.
>>>> 
>>>> I can't seem to pinpoint the bottleneck. CPU use by the regionserver is
>>>> high but not maxed out. Disk I/O and network I/O are low, IO wait is low.
>>>> I'm on the verge of just writing the dataset out to sequence files once a
>>>> day for scan purposes. Is that what others are doing?
> 
>

Re: Poor HBase map-reduce scan performance

Posted by Nicolas Liochon <nk...@gmail.com>.

You can try Yourkit, they have evaluation licenses. There is one gotcha:
some classes are excluded by default, and this includes org.apache.* . So
you need to change the default config when using it with HBase.


On Thu, May 2, 2013 at 7:54 PM, Bryan Keller <br...@gmail.com> wrote:

> I ran one of my regionservers through VisualVM. It looks like the top hot
> spots are HFileReaderV2$ScannerV2.getKeyValue() and ByteBuffer.allocate().
> It appears at first glance that memory allocations may be an issue.
> Decompression was next below that but less of an issue it seems.
>
> Would changing the block size, either HDFS or HBase, help here?
>
> Also, if anyone has tips on how else to profile, that would be
> appreciated. VisualVM can produce a lot of noise that is hard to sift
> through.
>
>
> On May 1, 2013, at 9:49 PM, Bryan Keller <br...@gmail.com> wrote:
>
> > I used exactly 0.94.4, pulled from the tag in subversion.
> >
> > On May 1, 2013, at 9:41 PM, lars hofhansl <la...@apache.org> wrote:
> >
> >> Hmm... Did you actually use exactly version 0.94.4, or the latest
> 0.94.7.
> >> I would be very curious to see profiling data.
> >>
> >> -- Lars
> >>
> >>
> >>
> >> ----- Original Message -----
> >> From: Bryan Keller <br...@gmail.com>
> >> To: "user@hbase.apache.org" <us...@hbase.apache.org>
> >> Cc:
> >> Sent: Wednesday, May 1, 2013 6:01 PM
> >> Subject: Re: Poor HBase map-reduce scan performance
> >>
> >> I tried running my test with 0.94.4, unfortunately performance was
> about the same. I'm planning on profiling the regionserver and trying some
> other things tonight and tomorrow and will report back.
> >>
> >> On May 1, 2013, at 8:00 AM, Bryan Keller <br...@gmail.com> wrote:
> >>
> >>> Yes I would like to try this, if you can point me to the pom.xml patch
> that would save me some time.
> >>>
> >>> On Tuesday, April 30, 2013, lars hofhansl wrote:
> >>> If you can, try 0.94.4+; it should significantly reduce the amount of
> bytes copied around in RAM during scanning, especially if you have wide
> rows and/or large key portions. That in turns makes scans scale better
> across cores, since RAM is shared resource between cores (much like disk).
> >>>
> >>>
> >>> It's not hard to build the latest HBase against Cloudera's version of
> Hadoop. I can send along a simple patch to pom.xml to do that.
> >>>
> >>> -- Lars
> >>>
> >>>
> >>>
> >>> ________________________________
> >>>  From: Bryan Keller <br...@gmail.com>
> >>> To: user@hbase.apache.org
> >>> Sent: Tuesday, April 30, 2013 11:02 PM
> >>> Subject: Re: Poor HBase map-reduce scan performance
> >>>
> >>>
> >>> The table has hashed keys so rows are evenly distributed amongst the
> regionservers, and load on each regionserver is pretty much the same. I
> also have per-table balancing turned on. I get mostly data local mappers
> with only a few rack local (maybe 10 of the 250 mappers).
> >>>
> >>> Currently the table is a wide table schema, with lists of data
> structures stored as columns with column prefixes grouping the data
> structures (e.g. 1_name, 1_address, 1_city, 2_name, 2_address, 2_city). I
> was thinking of moving those data structures to protobuf which would cut
> down on the number of columns. The downside is I can't filter on one value
> with that, but it is a tradeoff I would make for performance. I was also
> considering restructuring the table into a tall table.
> >>>
> >>> Something interesting is that my old regionserver machines had five
> 15k SCSI drives instead of 2 SSDs, and performance was about the same.
> Also, my old network was 1gbit, now it is 10gbit. So neither network nor
> disk I/O appear to be the bottleneck. The CPU is rather high for the
> regionserver so it seems like the best candidate to investigate. I will try
> profiling it tomorrow and will report back. I may revisit compression on vs
> off since that is adding load to the CPU.
> >>>
> >>> I'll also come up with a sample program that generates data similar to
> my table.
> >>>
> >>>
> >>> On Apr 30, 2013, at 10:01 PM, lars hofhansl <la...@apache.org> wrote:
> >>>
> >>>> Your average row is 35k so scanner caching would not make a huge
> difference, although I would have expected some improvements by setting it
> to 10 or 50 since you have a wide 10ge pipe.
> >>>>
> >>>> I assume your table is split sufficiently to touch all
> RegionServer... Do you see the same load/IO on all region servers?
> >>>>
> >>>> A bunch of scan improvements went into HBase since 0.94.2.
> >>>> I blogged about some of these changes here:
> http://hadoop-hbase.blogspot.com/2012/12/hbase-profiling.html
> >>>>
> >>>> In your case - since you have many columns, each of which carry the
> rowkey - you might benefit a lot from HBASE-7279.
> >>>>
> >>>> In the end HBase *is* slower than straight HDFS for full scans. How
> could it not be?
> >>>> So I would start by looking at HDFS first. Make sure Nagle's is
> disbaled in both HBase and HDFS.
> >>>>
> >>>> And lastly SSDs are somewhat new territory for HBase. Maybe Andy
> Purtell is listening, I think he did some tests with HBase on SSDs.
> >>>> With rotating media you typically see an improvement with
> compression. With SSDs the added CPU needed for decompression might
> outweigh the benefits.
> >>>>
> >>>> At the risk of starting a larger discussion here, I would posit that
> HBase's LSM based design, which trades random IO with sequential IO, might
> be a bit more questionable on SSDs.
> >>>>
> >>>> If you can, it would be nice to run a profiler against one of the
> RegionServers (or maybe do it with the single RS setup) and see where it is
> bottlenecked.
> >>>> (And if you send me a sample program to generate some data - not
> 700g, though :) - I'll try to do a bit of profiling during the next days as
> my day job permits, but I do not have any machines with SSDs).
> >>>>
> >>>> -- Lars
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> ________________________________
> >>>> From: Bryan Keller <br...@gmail.com>
> >>>> To: user@hbase.apache.org
> >>>> Sent: Tuesday, April 30, 2013 9:31 PM
> >>>> Subject: Re: Poor HBase map-reduce scan performance
> >>>>
> >>>>
> >>>> Yes, I have tried various settings for setCaching() and I have
> setCacheBlocks(false)
> >>>>
> >>>> On Apr 30, 2013, at 9:17 PM, Ted Yu <yu...@gmail.com> wrote:
> >>>>
> >>>>> From http://hbase.apache.org/book.html#mapreduce.example :
> >>>>>
> >>>>> scan.setCaching(500);        // 1 is the default in Scan, which will
> >>>>> be bad for MapReduce jobs
> >>>>> scan.setCacheBlocks(false);  // don't set to true for MR jobs
> >>>>>
> >>>>> I guess you have used the above setting.
> >>>>>
> >>>>> 0.94.x releases are compatible. Have you considered upgrading to, say
> >>>>> 0.94.7 which was recently released ?
> >>>>>
> >>>>> Cheers
> >>>>>
> >>>>> On Tue, Apr 30, 2013 at 9:01 PM, Bryan Keller <bryanck@gm
> >>
> >
>
>

Re: Poor HBase map-reduce scan performance

Posted by Bryan Keller <br...@gmail.com>.

I attached my patch to the JIRA issue, in case anyone is interested. It can pretty easily be used on its own without patching HBase. I am currently doing this.


On Jul 1, 2013, at 2:23 PM, Enis Söztutar <en...@gmail.com> wrote:

> Bryan,
> 
> 3.6x improvement seems exciting. The ballpark difference between HBase scan
> and hdfs scan is in that order, so it is expected I guess.
> 
> I plan to get back to the trunk patch, add more tests etc next week. In the
> mean time, if you have any changes to the patch, pls attach the patch.
> 
> Enis
> 
> 
> On Mon, Jul 1, 2013 at 3:59 AM, lars hofhansl <la...@apache.org> wrote:
> 
>> Absolutely.
>> 
>> 
>> 
>> ----- Original Message -----
>> From: Ted Yu <yu...@gmail.com>
>> To: user@hbase.apache.org
>> Cc:
>> Sent: Sunday, June 30, 2013 9:32 PM
>> Subject: Re: Poor HBase map-reduce scan performance
>> 
>> Looking at the tail of HBASE-8369, there were some comments which are yet
>> to be addressed.
>> 
>> I think trunk patch should be finalized before backporting.
>> 
>> Cheers
>> 
>> On Mon, Jul 1, 2013 at 12:23 PM, Bryan Keller <br...@gmail.com> wrote:
>> 
>>> I'll attach my patch to HBASE-8369 tomorrow.
>>> 
>>> On Jun 28, 2013, at 10:56 AM, lars hofhansl <la...@apache.org> wrote:
>>> 
>>>> If we can make a clean patch with minimal impact to existing code I
>>> would be supportive of a backport to 0.94.
>>>> 
>>>> -- Lars
>>>> 
>>>> 
>>>> 
>>>> ----- Original Message -----
>>>> From: Bryan Keller <br...@gmail.com>
>>>> To: user@hbase.apache.org; lars hofhansl <la...@apache.org>
>>>> Cc:
>>>> Sent: Tuesday, June 25, 2013 1:56 AM
>>>> Subject: Re: Poor HBase map-reduce scan performance
>>>> 
>>>> I tweaked Enis's snapshot input format and backported it to 0.94.6 and
>>> have snapshot scanning functional on my system. Performance is
>> dramatically
>>> better, as expected i suppose. I'm seeing about 3.6x faster performance
>> vs
>>> TableInputFormat. Also, HBase doesn't get bogged down during a scan as
>> the
>>> regionserver is being bypassed. I'm very excited by this. There are some
>>> issues with file permissions and library dependencies but nothing that
>>> can't be worked out.
>>>> 
>>>> On Jun 5, 2013, at 6:03 PM, lars hofhansl <la...@apache.org> wrote:
>>>> 
>>>>> That's exactly the kind of pre-fetching I was investigating a bit ago
>>> (made a patch, but ran out of time).
>>>>> This pre-fetching is strictly client only, where the client keeps the
>>> server busy while it is processing the previous batch, but filling up a
>> 2nd
>>> buffer.
>>>>> 
>>>>> 
>>>>> -- Lars
>>>>> 
>>>>> 
>>>>> 
>>>>> ________________________________
>>>>> From: Sandy Pratt <pr...@adobe.com>
>>>>> To: "user@hbase.apache.org" <us...@hbase.apache.org>
>>>>> Sent: Wednesday, June 5, 2013 10:58 AM
>>>>> Subject: Re: Poor HBase map-reduce scan performance
>>>>> 
>>>>> 
>>>>> Yong,
>>>>> 
>>>>> As a thought experiment, imagine how it impacts the throughput of TCP
>> to
>>>>> keep the window size at 1.  That means there's only one packet in
>> flight
>>>>> at a time, and total throughput is a fraction of what it could be.
>>>>> 
>>>>> That's effectively what happens with RPC.  The server sends a batch,
>>> then
>>>>> does nothing while it waits for the client to ask for more.  During
>> that
>>>>> time, the pipe between them is empty.  Increasing the batch size can
>>> help
>>>>> a bit, in essence creating a really huge packet, but the problem
>>> remains.
>>>>> There will always be stalls in the pipe.
>>>>> 
>>>>> What you want is for the window size to be large enough that the pipe
>> is
>>>>> saturated.  A streaming API accomplishes that by stuffing data down
>> the
>>>>> network pipe as quickly as possible.
>>>>> 
>>>>> Sandy
>>>>> 
>>>>> On 6/5/13 7:55 AM, "yonghu" <yo...@gmail.com> wrote:
>>>>> 
>>>>>> Can anyone explain why client + rpc + server will decrease the
>>> performance
>>>>>> of scanning? I mean the Regionserver and Tasktracker are the same
>> node
>>>>>> when
>>>>>> you use MapReduce to scan the HBase table. So, in my understanding,
>>> there
>>>>>> will be no rpc cost.
>>>>>> 
>>>>>> Thanks!
>>>>>> 
>>>>>> Yong
>>>>>> 
>>>>>> 
>>>>>> On Wed, Jun 5, 2013 at 10:09 AM, Sandy Pratt <pr...@adobe.com>
>>> wrote:
>>>>>> 
>>>>>>> https://issues.apache.org/jira/browse/HBASE-8691
>>>>>>> 
>>>>>>> 
>>>>>>> On 6/4/13 6:11 PM, "Sandy Pratt" <pr...@adobe.com> wrote:
>>>>>>> 
>>>>>>>> Haven't had a chance to write a JIRA yet, but I thought I'd pop in
>>> here
>>>>>>>> with an update in the meantime.
>>>>>>>> 
>>>>>>>> I tried a number of different approaches to eliminate latency and
>>>>>>>> "bubbles" in the scan pipeline, and eventually arrived at adding a
>>>>>>>> streaming scan API to the region server, along with refactoring the
>>>>>>> scan
>>>>>>>> interface into an event-drive message receiver interface.  In so
>>>>>>> doing, I
>>>>>>>> was able to take scan speed on my cluster from 59,537 records/sec
>>> with
>>>>>>> the
>>>>>>>> classic scanner to 222,703 records per second with my new scan API.
>>>>>>>> Needless to say, I'm pleased ;)
>>>>>>>> 
>>>>>>>> More details forthcoming when I get a chance.
>>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> Sandy
>>>>>>>> 
>>>>>>>> On 5/23/13 3:47 PM, "Ted Yu" <yu...@gmail.com> wrote:
>>>>>>>> 
>>>>>>>>> Thanks for the update, Sandy.
>>>>>>>>> 
>>>>>>>>> If you can open a JIRA and attach your producer / consumer scanner
>>>>>>> there,
>>>>>>>>> that would be great.
>>>>>>>>> 
>>>>>>>>> On Thu, May 23, 2013 at 3:42 PM, Sandy Pratt <pr...@adobe.com>
>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> I wrote myself a Scanner wrapper that uses a producer/consumer
>>>>>>> queue to
>>>>>>>>>> keep the client fed with a full buffer as much as possible.  When
>>>>>>>>>> scanning
>>>>>>>>>> my table with scanner caching at 100 records, I see about a 24%
>>>>>>> uplift
>>>>>>>>>> in
>>>>>>>>>> performance (~35k records/sec with the ClientScanner and ~44k
>>>>>>>>>> records/sec
>>>>>>>>>> with my P/C scanner).  However, when I set scanner caching to
>> 5000,
>>>>>>>>>> it's
>>>>>>>>>> more of a wash compared to the standard ClientScanner: ~53k
>>>>>>> records/sec
>>>>>>>>>> with the ClientScanner and ~60k records/sec with the P/C scanner.
>>>>>>>>>> 
>>>>>>>>>> I'm not sure what to make of those results.  I think next I'll
>> shut
>>>>>>>>>> down
>>>>>>>>>> HBase and read the HFiles directly, to see if there's a drop off
>> in
>>>>>>>>>> performance between reading them directly vs. via the
>> RegionServer.
>>>>>>>>>> 
>>>>>>>>>> I still think that to really solve this there needs to be sliding
>>>>>>>>>> window
>>>>>>>>>> of records in flight between disk and RS, and between RS and
>>> client.
>>>>>>>>>> I'm
>>>>>>>>>> thinking there's probably a single batch of records in flight
>>>>>>> between
>>>>>>>>>> RS
>>>>>>>>>> and client at the moment.
>>>>>>>>>> 
>>>>>>>>>> Sandy
>>>>>>>>>> 
>>>>>>>>>> On 5/23/13 8:45 AM, "Bryan Keller" <br...@gmail.com> wrote:
>>>>>>>>>> 
>>>>>>>>>>> I am considering scanning a snapshot instead of the table. I
>>>>>>> believe
>>>>>>>>>> this
>>>>>>>>>>> is what the ExportSnapshot class does. If I could use the
>> scanning
>>>>>>>>>> code
>>>>>>>>>>> from ExportSnapshot then I will be able to scan the HDFS files
>>>>>>>>>> directly
>>>>>>>>>>> and bypass the regionservers. This could potentially give me a
>>> huge
>>>>>>>>>> boost
>>>>>>>>>>> in performance for full table scans. However, it doesn't really
>>>>>>>>>> address
>>>>>>>>>>> the poor scan performance against a table.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>> 
>>> 
>>> 
>> 
>>

Re: Poor HBase map-reduce scan performance

Posted by Enis Söztutar <en...@gmail.com>.

Bryan,

3.6x improvement seems exciting. The ballpark difference between HBase scan
and hdfs scan is in that order, so it is expected I guess.

I plan to get back to the trunk patch, add more tests etc next week. In the
mean time, if you have any changes to the patch, pls attach the patch.

Enis


On Mon, Jul 1, 2013 at 3:59 AM, lars hofhansl <la...@apache.org> wrote:

> Absolutely.
>
>
>
> ----- Original Message -----
> From: Ted Yu <yu...@gmail.com>
> To: user@hbase.apache.org
> Cc:
> Sent: Sunday, June 30, 2013 9:32 PM
> Subject: Re: Poor HBase map-reduce scan performance
>
> Looking at the tail of HBASE-8369, there were some comments which are yet
> to be addressed.
>
> I think trunk patch should be finalized before backporting.
>
> Cheers
>
> On Mon, Jul 1, 2013 at 12:23 PM, Bryan Keller <br...@gmail.com> wrote:
>
> > I'll attach my patch to HBASE-8369 tomorrow.
> >
> > On Jun 28, 2013, at 10:56 AM, lars hofhansl <la...@apache.org> wrote:
> >
> > > If we can make a clean patch with minimal impact to existing code I
> > would be supportive of a backport to 0.94.
> > >
> > > -- Lars
> > >
> > >
> > >
> > > ----- Original Message -----
> > > From: Bryan Keller <br...@gmail.com>
> > > To: user@hbase.apache.org; lars hofhansl <la...@apache.org>
> > > Cc:
> > > Sent: Tuesday, June 25, 2013 1:56 AM
> > > Subject: Re: Poor HBase map-reduce scan performance
> > >
> > > I tweaked Enis's snapshot input format and backported it to 0.94.6 and
> > have snapshot scanning functional on my system. Performance is
> dramatically
> > better, as expected i suppose. I'm seeing about 3.6x faster performance
> vs
> > TableInputFormat. Also, HBase doesn't get bogged down during a scan as
> the
> > regionserver is being bypassed. I'm very excited by this. There are some
> > issues with file permissions and library dependencies but nothing that
> > can't be worked out.
> > >
> > > On Jun 5, 2013, at 6:03 PM, lars hofhansl <la...@apache.org> wrote:
> > >
> > >> That's exactly the kind of pre-fetching I was investigating a bit ago
> > (made a patch, but ran out of time).
> > >> This pre-fetching is strictly client only, where the client keeps the
> > server busy while it is processing the previous batch, but filling up a
> 2nd
> > buffer.
> > >>
> > >>
> > >> -- Lars
> > >>
> > >>
> > >>
> > >> ________________________________
> > >> From: Sandy Pratt <pr...@adobe.com>
> > >> To: "user@hbase.apache.org" <us...@hbase.apache.org>
> > >> Sent: Wednesday, June 5, 2013 10:58 AM
> > >> Subject: Re: Poor HBase map-reduce scan performance
> > >>
> > >>
> > >> Yong,
> > >>
> > >> As a thought experiment, imagine how it impacts the throughput of TCP
> to
> > >> keep the window size at 1.  That means there's only one packet in
> flight
> > >> at a time, and total throughput is a fraction of what it could be.
> > >>
> > >> That's effectively what happens with RPC.  The server sends a batch,
> > then
> > >> does nothing while it waits for the client to ask for more.  During
> that
> > >> time, the pipe between them is empty.  Increasing the batch size can
> > help
> > >> a bit, in essence creating a really huge packet, but the problem
> > remains.
> > >> There will always be stalls in the pipe.
> > >>
> > >> What you want is for the window size to be large enough that the pipe
> is
> > >> saturated.  A streaming API accomplishes that by stuffing data down
> the
> > >> network pipe as quickly as possible.
> > >>
> > >> Sandy
> > >>
> > >> On 6/5/13 7:55 AM, "yonghu" <yo...@gmail.com> wrote:
> > >>
> > >>> Can anyone explain why client + rpc + server will decrease the
> > performance
> > >>> of scanning? I mean the Regionserver and Tasktracker are the same
> node
> > >>> when
> > >>> you use MapReduce to scan the HBase table. So, in my understanding,
> > there
> > >>> will be no rpc cost.
> > >>>
> > >>> Thanks!
> > >>>
> > >>> Yong
> > >>>
> > >>>
> > >>> On Wed, Jun 5, 2013 at 10:09 AM, Sandy Pratt <pr...@adobe.com>
> > wrote:
> > >>>
> > >>>> https://issues.apache.org/jira/browse/HBASE-8691
> > >>>>
> > >>>>
> > >>>> On 6/4/13 6:11 PM, "Sandy Pratt" <pr...@adobe.com> wrote:
> > >>>>
> > >>>>> Haven't had a chance to write a JIRA yet, but I thought I'd pop in
> > here
> > >>>>> with an update in the meantime.
> > >>>>>
> > >>>>> I tried a number of different approaches to eliminate latency and
> > >>>>> "bubbles" in the scan pipeline, and eventually arrived at adding a
> > >>>>> streaming scan API to the region server, along with refactoring the
> > >>>> scan
> > >>>>> interface into an event-drive message receiver interface.  In so
> > >>>> doing, I
> > >>>>> was able to take scan speed on my cluster from 59,537 records/sec
> > with
> > >>>> the
> > >>>>> classic scanner to 222,703 records per second with my new scan API.
> > >>>>> Needless to say, I'm pleased ;)
> > >>>>>
> > >>>>> More details forthcoming when I get a chance.
> > >>>>>
> > >>>>> Thanks,
> > >>>>> Sandy
> > >>>>>
> > >>>>> On 5/23/13 3:47 PM, "Ted Yu" <yu...@gmail.com> wrote:
> > >>>>>
> > >>>>>> Thanks for the update, Sandy.
> > >>>>>>
> > >>>>>> If you can open a JIRA and attach your producer / consumer scanner
> > >>>> there,
> > >>>>>> that would be great.
> > >>>>>>
> > >>>>>> On Thu, May 23, 2013 at 3:42 PM, Sandy Pratt <pr...@adobe.com>
> > >>>> wrote:
> > >>>>>>
> > >>>>>>> I wrote myself a Scanner wrapper that uses a producer/consumer
> > >>>> queue to
> > >>>>>>> keep the client fed with a full buffer as much as possible.  When
> > >>>>>>> scanning
> > >>>>>>> my table with scanner caching at 100 records, I see about a 24%
> > >>>> uplift
> > >>>>>>> in
> > >>>>>>> performance (~35k records/sec with the ClientScanner and ~44k
> > >>>>>>> records/sec
> > >>>>>>> with my P/C scanner).  However, when I set scanner caching to
> 5000,
> > >>>>>>> it's
> > >>>>>>> more of a wash compared to the standard ClientScanner: ~53k
> > >>>> records/sec
> > >>>>>>> with the ClientScanner and ~60k records/sec with the P/C scanner.
> > >>>>>>>
> > >>>>>>> I'm not sure what to make of those results.  I think next I'll
> shut
> > >>>>>>> down
> > >>>>>>> HBase and read the HFiles directly, to see if there's a drop off
> in
> > >>>>>>> performance between reading them directly vs. via the
> RegionServer.
> > >>>>>>>
> > >>>>>>> I still think that to really solve this there needs to be sliding
> > >>>>>>> window
> > >>>>>>> of records in flight between disk and RS, and between RS and
> > client.
> > >>>>>>> I'm
> > >>>>>>> thinking there's probably a single batch of records in flight
> > >>>> between
> > >>>>>>> RS
> > >>>>>>> and client at the moment.
> > >>>>>>>
> > >>>>>>> Sandy
> > >>>>>>>
> > >>>>>>> On 5/23/13 8:45 AM, "Bryan Keller" <br...@gmail.com> wrote:
> > >>>>>>>
> > >>>>>>>> I am considering scanning a snapshot instead of the table. I
> > >>>> believe
> > >>>>>>> this
> > >>>>>>>> is what the ExportSnapshot class does. If I could use the
> scanning
> > >>>>>>> code
> > >>>>>>>> from ExportSnapshot then I will be able to scan the HDFS files
> > >>>>>>> directly
> > >>>>>>>> and bypass the regionservers. This could potentially give me a
> > huge
> > >>>>>>> boost
> > >>>>>>>> in performance for full table scans. However, it doesn't really
> > >>>>>>> address
> > >>>>>>>> the poor scan performance against a table.
> > >>>>>>>
> > >>>>>>>
> > >>>>>
> > >>>>
> > >
> >
> >
>
>

Re: Poor HBase map-reduce scan performance

Posted by lars hofhansl <la...@apache.org>.

Absolutely.



----- Original Message -----
From: Ted Yu <yu...@gmail.com>
To: user@hbase.apache.org
Cc: 
Sent: Sunday, June 30, 2013 9:32 PM
Subject: Re: Poor HBase map-reduce scan performance

Looking at the tail of HBASE-8369, there were some comments which are yet
to be addressed.

I think trunk patch should be finalized before backporting.

Cheers

On Mon, Jul 1, 2013 at 12:23 PM, Bryan Keller <br...@gmail.com> wrote:

> I'll attach my patch to HBASE-8369 tomorrow.
>
> On Jun 28, 2013, at 10:56 AM, lars hofhansl <la...@apache.org> wrote:
>
> > If we can make a clean patch with minimal impact to existing code I
> would be supportive of a backport to 0.94.
> >
> > -- Lars
> >
> >
> >
> > ----- Original Message -----
> > From: Bryan Keller <br...@gmail.com>
> > To: user@hbase.apache.org; lars hofhansl <la...@apache.org>
> > Cc:
> > Sent: Tuesday, June 25, 2013 1:56 AM
> > Subject: Re: Poor HBase map-reduce scan performance
> >
> > I tweaked Enis's snapshot input format and backported it to 0.94.6 and
> have snapshot scanning functional on my system. Performance is dramatically
> better, as expected i suppose. I'm seeing about 3.6x faster performance vs
> TableInputFormat. Also, HBase doesn't get bogged down during a scan as the
> regionserver is being bypassed. I'm very excited by this. There are some
> issues with file permissions and library dependencies but nothing that
> can't be worked out.
> >
> > On Jun 5, 2013, at 6:03 PM, lars hofhansl <la...@apache.org> wrote:
> >
> >> That's exactly the kind of pre-fetching I was investigating a bit ago
> (made a patch, but ran out of time).
> >> This pre-fetching is strictly client only, where the client keeps the
> server busy while it is processing the previous batch, but filling up a 2nd
> buffer.
> >>
> >>
> >> -- Lars
> >>
> >>
> >>
> >> ________________________________
> >> From: Sandy Pratt <pr...@adobe.com>
> >> To: "user@hbase.apache.org" <us...@hbase.apache.org>
> >> Sent: Wednesday, June 5, 2013 10:58 AM
> >> Subject: Re: Poor HBase map-reduce scan performance
> >>
> >>
> >> Yong,
> >>
> >> As a thought experiment, imagine how it impacts the throughput of TCP to
> >> keep the window size at 1.  That means there's only one packet in flight
> >> at a time, and total throughput is a fraction of what it could be.
> >>
> >> That's effectively what happens with RPC.  The server sends a batch,
> then
> >> does nothing while it waits for the client to ask for more.  During that
> >> time, the pipe between them is empty.  Increasing the batch size can
> help
> >> a bit, in essence creating a really huge packet, but the problem
> remains.
> >> There will always be stalls in the pipe.
> >>
> >> What you want is for the window size to be large enough that the pipe is
> >> saturated.  A streaming API accomplishes that by stuffing data down the
> >> network pipe as quickly as possible.
> >>
> >> Sandy
> >>
> >> On 6/5/13 7:55 AM, "yonghu" <yo...@gmail.com> wrote:
> >>
> >>> Can anyone explain why client + rpc + server will decrease the
> performance
> >>> of scanning? I mean the Regionserver and Tasktracker are the same node
> >>> when
> >>> you use MapReduce to scan the HBase table. So, in my understanding,
> there
> >>> will be no rpc cost.
> >>>
> >>> Thanks!
> >>>
> >>> Yong
> >>>
> >>>
> >>> On Wed, Jun 5, 2013 at 10:09 AM, Sandy Pratt <pr...@adobe.com>
> wrote:
> >>>
> >>>> https://issues.apache.org/jira/browse/HBASE-8691
> >>>>
> >>>>
> >>>> On 6/4/13 6:11 PM, "Sandy Pratt" <pr...@adobe.com> wrote:
> >>>>
> >>>>> Haven't had a chance to write a JIRA yet, but I thought I'd pop in
> here
> >>>>> with an update in the meantime.
> >>>>>
> >>>>> I tried a number of different approaches to eliminate latency and
> >>>>> "bubbles" in the scan pipeline, and eventually arrived at adding a
> >>>>> streaming scan API to the region server, along with refactoring the
> >>>> scan
> >>>>> interface into an event-drive message receiver interface.  In so
> >>>> doing, I
> >>>>> was able to take scan speed on my cluster from 59,537 records/sec
> with
> >>>> the
> >>>>> classic scanner to 222,703 records per second with my new scan API.
> >>>>> Needless to say, I'm pleased ;)
> >>>>>
> >>>>> More details forthcoming when I get a chance.
> >>>>>
> >>>>> Thanks,
> >>>>> Sandy
> >>>>>
> >>>>> On 5/23/13 3:47 PM, "Ted Yu" <yu...@gmail.com> wrote:
> >>>>>
> >>>>>> Thanks for the update, Sandy.
> >>>>>>
> >>>>>> If you can open a JIRA and attach your producer / consumer scanner
> >>>> there,
> >>>>>> that would be great.
> >>>>>>
> >>>>>> On Thu, May 23, 2013 at 3:42 PM, Sandy Pratt <pr...@adobe.com>
> >>>> wrote:
> >>>>>>
> >>>>>>> I wrote myself a Scanner wrapper that uses a producer/consumer
> >>>> queue to
> >>>>>>> keep the client fed with a full buffer as much as possible.  When
> >>>>>>> scanning
> >>>>>>> my table with scanner caching at 100 records, I see about a 24%
> >>>> uplift
> >>>>>>> in
> >>>>>>> performance (~35k records/sec with the ClientScanner and ~44k
> >>>>>>> records/sec
> >>>>>>> with my P/C scanner).  However, when I set scanner caching to 5000,
> >>>>>>> it's
> >>>>>>> more of a wash compared to the standard ClientScanner: ~53k
> >>>> records/sec
> >>>>>>> with the ClientScanner and ~60k records/sec with the P/C scanner.
> >>>>>>>
> >>>>>>> I'm not sure what to make of those results.  I think next I'll shut
> >>>>>>> down
> >>>>>>> HBase and read the HFiles directly, to see if there's a drop off in
> >>>>>>> performance between reading them directly vs. via the RegionServer.
> >>>>>>>
> >>>>>>> I still think that to really solve this there needs to be sliding
> >>>>>>> window
> >>>>>>> of records in flight between disk and RS, and between RS and
> client.
> >>>>>>> I'm
> >>>>>>> thinking there's probably a single batch of records in flight
> >>>> between
> >>>>>>> RS
> >>>>>>> and client at the moment.
> >>>>>>>
> >>>>>>> Sandy
> >>>>>>>
> >>>>>>> On 5/23/13 8:45 AM, "Bryan Keller" <br...@gmail.com> wrote:
> >>>>>>>
> >>>>>>>> I am considering scanning a snapshot instead of the table. I
> >>>> believe
> >>>>>>> this
> >>>>>>>> is what the ExportSnapshot class does. If I could use the scanning
> >>>>>>> code
> >>>>>>>> from ExportSnapshot then I will be able to scan the HDFS files
> >>>>>>> directly
> >>>>>>>> and bypass the regionservers. This could potentially give me a
> huge
> >>>>>>> boost
> >>>>>>>> in performance for full table scans. However, it doesn't really
> >>>>>>> address
> >>>>>>>> the poor scan performance against a table.
> >>>>>>>
> >>>>>>>
> >>>>>
> >>>>
> >
>
>

Re: Poor HBase map-reduce scan performance

Posted by Ted Yu <yu...@gmail.com>.

Looking at the tail of HBASE-8369, there were some comments which are yet
to be addressed.

I think trunk patch should be finalized before backporting.

Cheers

On Mon, Jul 1, 2013 at 12:23 PM, Bryan Keller <br...@gmail.com> wrote:

> I'll attach my patch to HBASE-8369 tomorrow.
>
> On Jun 28, 2013, at 10:56 AM, lars hofhansl <la...@apache.org> wrote:
>
> > If we can make a clean patch with minimal impact to existing code I
> would be supportive of a backport to 0.94.
> >
> > -- Lars
> >
> >
> >
> > ----- Original Message -----
> > From: Bryan Keller <br...@gmail.com>
> > To: user@hbase.apache.org; lars hofhansl <la...@apache.org>
> > Cc:
> > Sent: Tuesday, June 25, 2013 1:56 AM
> > Subject: Re: Poor HBase map-reduce scan performance
> >
> > I tweaked Enis's snapshot input format and backported it to 0.94.6 and
> have snapshot scanning functional on my system. Performance is dramatically
> better, as expected i suppose. I'm seeing about 3.6x faster performance vs
> TableInputFormat. Also, HBase doesn't get bogged down during a scan as the
> regionserver is being bypassed. I'm very excited by this. There are some
> issues with file permissions and library dependencies but nothing that
> can't be worked out.
> >
> > On Jun 5, 2013, at 6:03 PM, lars hofhansl <la...@apache.org> wrote:
> >
> >> That's exactly the kind of pre-fetching I was investigating a bit ago
> (made a patch, but ran out of time).
> >> This pre-fetching is strictly client only, where the client keeps the
> server busy while it is processing the previous batch, but filling up a 2nd
> buffer.
> >>
> >>
> >> -- Lars
> >>
> >>
> >>
> >> ________________________________
> >> From: Sandy Pratt <pr...@adobe.com>
> >> To: "user@hbase.apache.org" <us...@hbase.apache.org>
> >> Sent: Wednesday, June 5, 2013 10:58 AM
> >> Subject: Re: Poor HBase map-reduce scan performance
> >>
> >>
> >> Yong,
> >>
> >> As a thought experiment, imagine how it impacts the throughput of TCP to
> >> keep the window size at 1.  That means there's only one packet in flight
> >> at a time, and total throughput is a fraction of what it could be.
> >>
> >> That's effectively what happens with RPC.  The server sends a batch,
> then
> >> does nothing while it waits for the client to ask for more.  During that
> >> time, the pipe between them is empty.  Increasing the batch size can
> help
> >> a bit, in essence creating a really huge packet, but the problem
> remains.
> >> There will always be stalls in the pipe.
> >>
> >> What you want is for the window size to be large enough that the pipe is
> >> saturated.  A streaming API accomplishes that by stuffing data down the
> >> network pipe as quickly as possible.
> >>
> >> Sandy
> >>
> >> On 6/5/13 7:55 AM, "yonghu" <yo...@gmail.com> wrote:
> >>
> >>> Can anyone explain why client + rpc + server will decrease the
> performance
> >>> of scanning? I mean the Regionserver and Tasktracker are the same node
> >>> when
> >>> you use MapReduce to scan the HBase table. So, in my understanding,
> there
> >>> will be no rpc cost.
> >>>
> >>> Thanks!
> >>>
> >>> Yong
> >>>
> >>>
> >>> On Wed, Jun 5, 2013 at 10:09 AM, Sandy Pratt <pr...@adobe.com>
> wrote:
> >>>
> >>>> https://issues.apache.org/jira/browse/HBASE-8691
> >>>>
> >>>>
> >>>> On 6/4/13 6:11 PM, "Sandy Pratt" <pr...@adobe.com> wrote:
> >>>>
> >>>>> Haven't had a chance to write a JIRA yet, but I thought I'd pop in
> here
> >>>>> with an update in the meantime.
> >>>>>
> >>>>> I tried a number of different approaches to eliminate latency and
> >>>>> "bubbles" in the scan pipeline, and eventually arrived at adding a
> >>>>> streaming scan API to the region server, along with refactoring the
> >>>> scan
> >>>>> interface into an event-drive message receiver interface.  In so
> >>>> doing, I
> >>>>> was able to take scan speed on my cluster from 59,537 records/sec
> with
> >>>> the
> >>>>> classic scanner to 222,703 records per second with my new scan API.
> >>>>> Needless to say, I'm pleased ;)
> >>>>>
> >>>>> More details forthcoming when I get a chance.
> >>>>>
> >>>>> Thanks,
> >>>>> Sandy
> >>>>>
> >>>>> On 5/23/13 3:47 PM, "Ted Yu" <yu...@gmail.com> wrote:
> >>>>>
> >>>>>> Thanks for the update, Sandy.
> >>>>>>
> >>>>>> If you can open a JIRA and attach your producer / consumer scanner
> >>>> there,
> >>>>>> that would be great.
> >>>>>>
> >>>>>> On Thu, May 23, 2013 at 3:42 PM, Sandy Pratt <pr...@adobe.com>
> >>>> wrote:
> >>>>>>
> >>>>>>> I wrote myself a Scanner wrapper that uses a producer/consumer
> >>>> queue to
> >>>>>>> keep the client fed with a full buffer as much as possible.  When
> >>>>>>> scanning
> >>>>>>> my table with scanner caching at 100 records, I see about a 24%
> >>>> uplift
> >>>>>>> in
> >>>>>>> performance (~35k records/sec with the ClientScanner and ~44k
> >>>>>>> records/sec
> >>>>>>> with my P/C scanner).  However, when I set scanner caching to 5000,
> >>>>>>> it's
> >>>>>>> more of a wash compared to the standard ClientScanner: ~53k
> >>>> records/sec
> >>>>>>> with the ClientScanner and ~60k records/sec with the P/C scanner.
> >>>>>>>
> >>>>>>> I'm not sure what to make of those results.  I think next I'll shut
> >>>>>>> down
> >>>>>>> HBase and read the HFiles directly, to see if there's a drop off in
> >>>>>>> performance between reading them directly vs. via the RegionServer.
> >>>>>>>
> >>>>>>> I still think that to really solve this there needs to be sliding
> >>>>>>> window
> >>>>>>> of records in flight between disk and RS, and between RS and
> client.
> >>>>>>> I'm
> >>>>>>> thinking there's probably a single batch of records in flight
> >>>> between
> >>>>>>> RS
> >>>>>>> and client at the moment.
> >>>>>>>
> >>>>>>> Sandy
> >>>>>>>
> >>>>>>> On 5/23/13 8:45 AM, "Bryan Keller" <br...@gmail.com> wrote:
> >>>>>>>
> >>>>>>>> I am considering scanning a snapshot instead of the table. I
> >>>> believe
> >>>>>>> this
> >>>>>>>> is what the ExportSnapshot class does. If I could use the scanning
> >>>>>>> code
> >>>>>>>> from ExportSnapshot then I will be able to scan the HDFS files
> >>>>>>> directly
> >>>>>>>> and bypass the regionservers. This could potentially give me a
> huge
> >>>>>>> boost
> >>>>>>>> in performance for full table scans. However, it doesn't really
> >>>>>>> address
> >>>>>>>> the poor scan performance against a table.
> >>>>>>>
> >>>>>>>
> >>>>>
> >>>>
> >
>
>

Re: Poor HBase map-reduce scan performance

Posted by Bryan Keller <br...@gmail.com>.

I'll attach my patch to HBASE-8369 tomorrow.

On Jun 28, 2013, at 10:56 AM, lars hofhansl <la...@apache.org> wrote:

> If we can make a clean patch with minimal impact to existing code I would be supportive of a backport to 0.94.
> 
> -- Lars
> 
> 
> 
> ----- Original Message -----
> From: Bryan Keller <br...@gmail.com>
> To: user@hbase.apache.org; lars hofhansl <la...@apache.org>
> Cc: 
> Sent: Tuesday, June 25, 2013 1:56 AM
> Subject: Re: Poor HBase map-reduce scan performance
> 
> I tweaked Enis's snapshot input format and backported it to 0.94.6 and have snapshot scanning functional on my system. Performance is dramatically better, as expected i suppose. I'm seeing about 3.6x faster performance vs TableInputFormat. Also, HBase doesn't get bogged down during a scan as the regionserver is being bypassed. I'm very excited by this. There are some issues with file permissions and library dependencies but nothing that can't be worked out.
> 
> On Jun 5, 2013, at 6:03 PM, lars hofhansl <la...@apache.org> wrote:
> 
>> That's exactly the kind of pre-fetching I was investigating a bit ago (made a patch, but ran out of time).
>> This pre-fetching is strictly client only, where the client keeps the server busy while it is processing the previous batch, but filling up a 2nd buffer.
>> 
>> 
>> -- Lars
>> 
>> 
>> 
>> ________________________________
>> From: Sandy Pratt <pr...@adobe.com>
>> To: "user@hbase.apache.org" <us...@hbase.apache.org> 
>> Sent: Wednesday, June 5, 2013 10:58 AM
>> Subject: Re: Poor HBase map-reduce scan performance
>> 
>> 
>> Yong,
>> 
>> As a thought experiment, imagine how it impacts the throughput of TCP to
>> keep the window size at 1.  That means there's only one packet in flight
>> at a time, and total throughput is a fraction of what it could be.
>> 
>> That's effectively what happens with RPC.  The server sends a batch, then
>> does nothing while it waits for the client to ask for more.  During that
>> time, the pipe between them is empty.  Increasing the batch size can help
>> a bit, in essence creating a really huge packet, but the problem remains.
>> There will always be stalls in the pipe.
>> 
>> What you want is for the window size to be large enough that the pipe is
>> saturated.  A streaming API accomplishes that by stuffing data down the
>> network pipe as quickly as possible.
>> 
>> Sandy
>> 
>> On 6/5/13 7:55 AM, "yonghu" <yo...@gmail.com> wrote:
>> 
>>> Can anyone explain why client + rpc + server will decrease the performance
>>> of scanning? I mean the Regionserver and Tasktracker are the same node
>>> when
>>> you use MapReduce to scan the HBase table. So, in my understanding, there
>>> will be no rpc cost.
>>> 
>>> Thanks!
>>> 
>>> Yong
>>> 
>>> 
>>> On Wed, Jun 5, 2013 at 10:09 AM, Sandy Pratt <pr...@adobe.com> wrote:
>>> 
>>>> https://issues.apache.org/jira/browse/HBASE-8691
>>>> 
>>>> 
>>>> On 6/4/13 6:11 PM, "Sandy Pratt" <pr...@adobe.com> wrote:
>>>> 
>>>>> Haven't had a chance to write a JIRA yet, but I thought I'd pop in here
>>>>> with an update in the meantime.
>>>>> 
>>>>> I tried a number of different approaches to eliminate latency and
>>>>> "bubbles" in the scan pipeline, and eventually arrived at adding a
>>>>> streaming scan API to the region server, along with refactoring the
>>>> scan
>>>>> interface into an event-drive message receiver interface.  In so
>>>> doing, I
>>>>> was able to take scan speed on my cluster from 59,537 records/sec with
>>>> the
>>>>> classic scanner to 222,703 records per second with my new scan API.
>>>>> Needless to say, I'm pleased ;)
>>>>> 
>>>>> More details forthcoming when I get a chance.
>>>>> 
>>>>> Thanks,
>>>>> Sandy
>>>>> 
>>>>> On 5/23/13 3:47 PM, "Ted Yu" <yu...@gmail.com> wrote:
>>>>> 
>>>>>> Thanks for the update, Sandy.
>>>>>> 
>>>>>> If you can open a JIRA and attach your producer / consumer scanner
>>>> there,
>>>>>> that would be great.
>>>>>> 
>>>>>> On Thu, May 23, 2013 at 3:42 PM, Sandy Pratt <pr...@adobe.com>
>>>> wrote:
>>>>>> 
>>>>>>> I wrote myself a Scanner wrapper that uses a producer/consumer
>>>> queue to
>>>>>>> keep the client fed with a full buffer as much as possible.  When
>>>>>>> scanning
>>>>>>> my table with scanner caching at 100 records, I see about a 24%
>>>> uplift
>>>>>>> in
>>>>>>> performance (~35k records/sec with the ClientScanner and ~44k
>>>>>>> records/sec
>>>>>>> with my P/C scanner).  However, when I set scanner caching to 5000,
>>>>>>> it's
>>>>>>> more of a wash compared to the standard ClientScanner: ~53k
>>>> records/sec
>>>>>>> with the ClientScanner and ~60k records/sec with the P/C scanner.
>>>>>>> 
>>>>>>> I'm not sure what to make of those results.  I think next I'll shut
>>>>>>> down
>>>>>>> HBase and read the HFiles directly, to see if there's a drop off in
>>>>>>> performance between reading them directly vs. via the RegionServer.
>>>>>>> 
>>>>>>> I still think that to really solve this there needs to be sliding
>>>>>>> window
>>>>>>> of records in flight between disk and RS, and between RS and client.
>>>>>>> I'm
>>>>>>> thinking there's probably a single batch of records in flight
>>>> between
>>>>>>> RS
>>>>>>> and client at the moment.
>>>>>>> 
>>>>>>> Sandy
>>>>>>> 
>>>>>>> On 5/23/13 8:45 AM, "Bryan Keller" <br...@gmail.com> wrote:
>>>>>>> 
>>>>>>>> I am considering scanning a snapshot instead of the table. I
>>>> believe
>>>>>>> this
>>>>>>>> is what the ExportSnapshot class does. If I could use the scanning
>>>>>>> code
>>>>>>>> from ExportSnapshot then I will be able to scan the HDFS files
>>>>>>> directly
>>>>>>>> and bypass the regionservers. This could potentially give me a huge
>>>>>>> boost
>>>>>>>> in performance for full table scans. However, it doesn't really
>>>>>>> address
>>>>>>>> the poor scan performance against a table.
>>>>>>> 
>>>>>>> 
>>>>> 
>>>> 
>

Re: Poor HBase map-reduce scan performance

Posted by lars hofhansl <la...@apache.org>.

If we can make a clean patch with minimal impact to existing code I would be supportive of a backport to 0.94.

-- Lars



----- Original Message -----
From: Bryan Keller <br...@gmail.com>
To: user@hbase.apache.org; lars hofhansl <la...@apache.org>
Cc: 
Sent: Tuesday, June 25, 2013 1:56 AM
Subject: Re: Poor HBase map-reduce scan performance

I tweaked Enis's snapshot input format and backported it to 0.94.6 and have snapshot scanning functional on my system. Performance is dramatically better, as expected i suppose. I'm seeing about 3.6x faster performance vs TableInputFormat. Also, HBase doesn't get bogged down during a scan as the regionserver is being bypassed. I'm very excited by this. There are some issues with file permissions and library dependencies but nothing that can't be worked out.

On Jun 5, 2013, at 6:03 PM, lars hofhansl <la...@apache.org> wrote:

> That's exactly the kind of pre-fetching I was investigating a bit ago (made a patch, but ran out of time).
> This pre-fetching is strictly client only, where the client keeps the server busy while it is processing the previous batch, but filling up a 2nd buffer.
> 
> 
> -- Lars
> 
> 
> 
> ________________________________
> From: Sandy Pratt <pr...@adobe.com>
> To: "user@hbase.apache.org" <us...@hbase.apache.org> 
> Sent: Wednesday, June 5, 2013 10:58 AM
> Subject: Re: Poor HBase map-reduce scan performance
> 
> 
> Yong,
> 
> As a thought experiment, imagine how it impacts the throughput of TCP to
> keep the window size at 1.  That means there's only one packet in flight
> at a time, and total throughput is a fraction of what it could be.
> 
> That's effectively what happens with RPC.  The server sends a batch, then
> does nothing while it waits for the client to ask for more.  During that
> time, the pipe between them is empty.  Increasing the batch size can help
> a bit, in essence creating a really huge packet, but the problem remains.
> There will always be stalls in the pipe.
> 
> What you want is for the window size to be large enough that the pipe is
> saturated.  A streaming API accomplishes that by stuffing data down the
> network pipe as quickly as possible.
> 
> Sandy
> 
> On 6/5/13 7:55 AM, "yonghu" <yo...@gmail.com> wrote:
> 
>> Can anyone explain why client + rpc + server will decrease the performance
>> of scanning? I mean the Regionserver and Tasktracker are the same node
>> when
>> you use MapReduce to scan the HBase table. So, in my understanding, there
>> will be no rpc cost.
>> 
>> Thanks!
>> 
>> Yong
>> 
>> 
>> On Wed, Jun 5, 2013 at 10:09 AM, Sandy Pratt <pr...@adobe.com> wrote:
>> 
>>> https://issues.apache.org/jira/browse/HBASE-8691
>>> 
>>> 
>>> On 6/4/13 6:11 PM, "Sandy Pratt" <pr...@adobe.com> wrote:
>>> 
>>>> Haven't had a chance to write a JIRA yet, but I thought I'd pop in here
>>>> with an update in the meantime.
>>>> 
>>>> I tried a number of different approaches to eliminate latency and
>>>> "bubbles" in the scan pipeline, and eventually arrived at adding a
>>>> streaming scan API to the region server, along with refactoring the
>>> scan
>>>> interface into an event-drive message receiver interface.  In so
>>> doing, I
>>>> was able to take scan speed on my cluster from 59,537 records/sec with
>>> the
>>>> classic scanner to 222,703 records per second with my new scan API.
>>>> Needless to say, I'm pleased ;)
>>>> 
>>>> More details forthcoming when I get a chance.
>>>> 
>>>> Thanks,
>>>> Sandy
>>>> 
>>>> On 5/23/13 3:47 PM, "Ted Yu" <yu...@gmail.com> wrote:
>>>> 
>>>>> Thanks for the update, Sandy.
>>>>> 
>>>>> If you can open a JIRA and attach your producer / consumer scanner
>>> there,
>>>>> that would be great.
>>>>> 
>>>>> On Thu, May 23, 2013 at 3:42 PM, Sandy Pratt <pr...@adobe.com>
>>> wrote:
>>>>> 
>>>>>> I wrote myself a Scanner wrapper that uses a producer/consumer
>>> queue to
>>>>>> keep the client fed with a full buffer as much as possible.  When
>>>>>> scanning
>>>>>> my table with scanner caching at 100 records, I see about a 24%
>>> uplift
>>>>>> in
>>>>>> performance (~35k records/sec with the ClientScanner and ~44k
>>>>>> records/sec
>>>>>> with my P/C scanner).  However, when I set scanner caching to 5000,
>>>>>> it's
>>>>>> more of a wash compared to the standard ClientScanner: ~53k
>>> records/sec
>>>>>> with the ClientScanner and ~60k records/sec with the P/C scanner.
>>>>>> 
>>>>>> I'm not sure what to make of those results.  I think next I'll shut
>>>>>> down
>>>>>> HBase and read the HFiles directly, to see if there's a drop off in
>>>>>> performance between reading them directly vs. via the RegionServer.
>>>>>> 
>>>>>> I still think that to really solve this there needs to be sliding
>>>>>> window
>>>>>> of records in flight between disk and RS, and between RS and client.
>>>>>> I'm
>>>>>> thinking there's probably a single batch of records in flight
>>> between
>>>>>> RS
>>>>>> and client at the moment.
>>>>>> 
>>>>>> Sandy
>>>>>> 
>>>>>> On 5/23/13 8:45 AM, "Bryan Keller" <br...@gmail.com> wrote:
>>>>>> 
>>>>>>> I am considering scanning a snapshot instead of the table. I
>>> believe
>>>>>> this
>>>>>>> is what the ExportSnapshot class does. If I could use the scanning
>>>>>> code
>>>>>>> from ExportSnapshot then I will be able to scan the HDFS files
>>>>>> directly
>>>>>>> and bypass the regionservers. This could potentially give me a huge
>>>>>> boost
>>>>>>> in performance for full table scans. However, it doesn't really
>>>>>> address
>>>>>>> the poor scan performance against a table.
>>>>>> 
>>>>>> 
>>>> 
>>>

Re: Poor HBase map-reduce scan performance

Posted by Bryan Keller <br...@gmail.com>.

I tweaked Enis's snapshot input format and backported it to 0.94.6 and have snapshot scanning functional on my system. Performance is dramatically better, as expected i suppose. I'm seeing about 3.6x faster performance vs TableInputFormat. Also, HBase doesn't get bogged down during a scan as the regionserver is being bypassed. I'm very excited by this. There are some issues with file permissions and library dependencies but nothing that can't be worked out.

On Jun 5, 2013, at 6:03 PM, lars hofhansl <la...@apache.org> wrote:

> That's exactly the kind of pre-fetching I was investigating a bit ago (made a patch, but ran out of time).
> This pre-fetching is strictly client only, where the client keeps the server busy while it is processing the previous batch, but filling up a 2nd buffer.
> 
> 
> -- Lars
> 
> 
> 
> ________________________________
> From: Sandy Pratt <pr...@adobe.com>
> To: "user@hbase.apache.org" <us...@hbase.apache.org> 
> Sent: Wednesday, June 5, 2013 10:58 AM
> Subject: Re: Poor HBase map-reduce scan performance
> 
> 
> Yong,
> 
> As a thought experiment, imagine how it impacts the throughput of TCP to
> keep the window size at 1.  That means there's only one packet in flight
> at a time, and total throughput is a fraction of what it could be.
> 
> That's effectively what happens with RPC.  The server sends a batch, then
> does nothing while it waits for the client to ask for more.  During that
> time, the pipe between them is empty.  Increasing the batch size can help
> a bit, in essence creating a really huge packet, but the problem remains.
> There will always be stalls in the pipe.
> 
> What you want is for the window size to be large enough that the pipe is
> saturated.  A streaming API accomplishes that by stuffing data down the
> network pipe as quickly as possible.
> 
> Sandy
> 
> On 6/5/13 7:55 AM, "yonghu" <yo...@gmail.com> wrote:
> 
>> Can anyone explain why client + rpc + server will decrease the performance
>> of scanning? I mean the Regionserver and Tasktracker are the same node
>> when
>> you use MapReduce to scan the HBase table. So, in my understanding, there
>> will be no rpc cost.
>> 
>> Thanks!
>> 
>> Yong
>> 
>> 
>> On Wed, Jun 5, 2013 at 10:09 AM, Sandy Pratt <pr...@adobe.com> wrote:
>> 
>>> https://issues.apache.org/jira/browse/HBASE-8691
>>> 
>>> 
>>> On 6/4/13 6:11 PM, "Sandy Pratt" <pr...@adobe.com> wrote:
>>> 
>>>> Haven't had a chance to write a JIRA yet, but I thought I'd pop in here
>>>> with an update in the meantime.
>>>> 
>>>> I tried a number of different approaches to eliminate latency and
>>>> "bubbles" in the scan pipeline, and eventually arrived at adding a
>>>> streaming scan API to the region server, along with refactoring the
>>> scan
>>>> interface into an event-drive message receiver interface.  In so
>>> doing, I
>>>> was able to take scan speed on my cluster from 59,537 records/sec with
>>> the
>>>> classic scanner to 222,703 records per second with my new scan API.
>>>> Needless to say, I'm pleased ;)
>>>> 
>>>> More details forthcoming when I get a chance.
>>>> 
>>>> Thanks,
>>>> Sandy
>>>> 
>>>> On 5/23/13 3:47 PM, "Ted Yu" <yu...@gmail.com> wrote:
>>>> 
>>>>> Thanks for the update, Sandy.
>>>>> 
>>>>> If you can open a JIRA and attach your producer / consumer scanner
>>> there,
>>>>> that would be great.
>>>>> 
>>>>> On Thu, May 23, 2013 at 3:42 PM, Sandy Pratt <pr...@adobe.com>
>>> wrote:
>>>>> 
>>>>>> I wrote myself a Scanner wrapper that uses a producer/consumer
>>> queue to
>>>>>> keep the client fed with a full buffer as much as possible.  When
>>>>>> scanning
>>>>>> my table with scanner caching at 100 records, I see about a 24%
>>> uplift
>>>>>> in
>>>>>> performance (~35k records/sec with the ClientScanner and ~44k
>>>>>> records/sec
>>>>>> with my P/C scanner).  However, when I set scanner caching to 5000,
>>>>>> it's
>>>>>> more of a wash compared to the standard ClientScanner: ~53k
>>> records/sec
>>>>>> with the ClientScanner and ~60k records/sec with the P/C scanner.
>>>>>> 
>>>>>> I'm not sure what to make of those results.  I think next I'll shut
>>>>>> down
>>>>>> HBase and read the HFiles directly, to see if there's a drop off in
>>>>>> performance between reading them directly vs. via the RegionServer.
>>>>>> 
>>>>>> I still think that to really solve this there needs to be sliding
>>>>>> window
>>>>>> of records in flight between disk and RS, and between RS and client.
>>>>>> I'm
>>>>>> thinking there's probably a single batch of records in flight
>>> between
>>>>>> RS
>>>>>> and client at the moment.
>>>>>> 
>>>>>> Sandy
>>>>>> 
>>>>>> On 5/23/13 8:45 AM, "Bryan Keller" <br...@gmail.com> wrote:
>>>>>> 
>>>>>>> I am considering scanning a snapshot instead of the table. I
>>> believe
>>>>>> this
>>>>>>> is what the ExportSnapshot class does. If I could use the scanning
>>>>>> code
>>>>>>> from ExportSnapshot then I will be able to scan the HDFS files
>>>>>> directly
>>>>>>> and bypass the regionservers. This could potentially give me a huge
>>>>>> boost
>>>>>>> in performance for full table scans. However, it doesn't really
>>>>>> address
>>>>>>> the poor scan performance against a table.
>>>>>> 
>>>>>> 
>>>> 
>>>

Re: Poor HBase map-reduce scan performance

Posted by lars hofhansl <la...@apache.org>.

That's exactly the kind of pre-fetching I was investigating a bit ago (made a patch, but ran out of time).
This pre-fetching is strictly client only, where the client keeps the server busy while it is processing the previous batch, but filling up a 2nd buffer.


-- Lars



________________________________
 From: Sandy Pratt <pr...@adobe.com>
To: "user@hbase.apache.org" <us...@hbase.apache.org> 
Sent: Wednesday, June 5, 2013 10:58 AM
Subject: Re: Poor HBase map-reduce scan performance
 

Yong,

As a thought experiment, imagine how it impacts the throughput of TCP to
keep the window size at 1.  That means there's only one packet in flight
at a time, and total throughput is a fraction of what it could be.

That's effectively what happens with RPC.  The server sends a batch, then
does nothing while it waits for the client to ask for more.  During that
time, the pipe between them is empty.  Increasing the batch size can help
a bit, in essence creating a really huge packet, but the problem remains.
There will always be stalls in the pipe.

What you want is for the window size to be large enough that the pipe is
saturated.  A streaming API accomplishes that by stuffing data down the
network pipe as quickly as possible.

Sandy

On 6/5/13 7:55 AM, "yonghu" <yo...@gmail.com> wrote:

>Can anyone explain why client + rpc + server will decrease the performance
>of scanning? I mean the Regionserver and Tasktracker are the same node
>when
>you use MapReduce to scan the HBase table. So, in my understanding, there
>will be no rpc cost.
>
>Thanks!
>
>Yong
>
>
>On Wed, Jun 5, 2013 at 10:09 AM, Sandy Pratt <pr...@adobe.com> wrote:
>
>> https://issues.apache.org/jira/browse/HBASE-8691
>>
>>
>> On 6/4/13 6:11 PM, "Sandy Pratt" <pr...@adobe.com> wrote:
>>
>> >Haven't had a chance to write a JIRA yet, but I thought I'd pop in here
>> >with an update in the meantime.
>> >
>> >I tried a number of different approaches to eliminate latency and
>> >"bubbles" in the scan pipeline, and eventually arrived at adding a
>> >streaming scan API to the region server, along with refactoring the
>>scan
>> >interface into an event-drive message receiver interface.  In so
>>doing, I
>> >was able to take scan speed on my cluster from 59,537 records/sec with
>>the
>> >classic scanner to 222,703 records per second with my new scan API.
>> >Needless to say, I'm pleased ;)
>> >
>> >More details forthcoming when I get a chance.
>> >
>> >Thanks,
>> >Sandy
>> >
>> >On 5/23/13 3:47 PM, "Ted Yu" <yu...@gmail.com> wrote:
>> >
>> >>Thanks for the update, Sandy.
>> >>
>> >>If you can open a JIRA and attach your producer / consumer scanner
>>there,
>> >>that would be great.
>> >>
>> >>On Thu, May 23, 2013 at 3:42 PM, Sandy Pratt <pr...@adobe.com>
>>wrote:
>> >>
>> >>> I wrote myself a Scanner wrapper that uses a producer/consumer
>>queue to
>> >>> keep the client fed with a full buffer as much as possible.  When
>> >>>scanning
>> >>> my table with scanner caching at 100 records, I see about a 24%
>>uplift
>> >>>in
>> >>> performance (~35k records/sec with the ClientScanner and ~44k
>> >>>records/sec
>> >>> with my P/C scanner).  However, when I set scanner caching to 5000,
>> >>>it's
>> >>> more of a wash compared to the standard ClientScanner: ~53k
>>records/sec
>> >>> with the ClientScanner and ~60k records/sec with the P/C scanner.
>> >>>
>> >>> I'm not sure what to make of those results.  I think next I'll shut
>> >>>down
>> >>> HBase and read the HFiles directly, to see if there's a drop off in
>> >>> performance between reading them directly vs. via the RegionServer.
>> >>>
>> >>> I still think that to really solve this there needs to be sliding
>> >>>window
>> >>> of records in flight between disk and RS, and between RS and client.
>> >>>I'm
>> >>> thinking there's probably a single batch of records in flight
>>between
>> >>>RS
>> >>> and client at the moment.
>> >>>
>> >>> Sandy
>> >>>
>> >>> On 5/23/13 8:45 AM, "Bryan Keller" <br...@gmail.com> wrote:
>> >>>
>> >>> >I am considering scanning a snapshot instead of the table. I
>>believe
>> >>>this
>> >>> >is what the ExportSnapshot class does. If I could use the scanning
>> >>>code
>> >>> >from ExportSnapshot then I will be able to scan the HDFS files
>> >>>directly
>> >>> >and bypass the regionservers. This could potentially give me a huge
>> >>>boost
>> >>> >in performance for full table scans. However, it doesn't really
>> >>>address
>> >>> >the poor scan performance against a table.
>> >>>
>> >>>
>> >
>>
>>

Re: Poor HBase map-reduce scan performance

Posted by Sandy Pratt <pr...@adobe.com>.

That's my understanding of how the current scan API works, yes.  The
client calls next() to fetch a batch.  While it's waiting for the response
from the server, it blocks.  After the server responds to the next() call,
it does nothing for that scanner until the following next() call.  That
makes for some significant bubbles in the pipeline even with larger batch
sizes for next().

Anyone please correct me if I'm wrong.


On 6/5/13 11:14 AM, "yonghu" <yo...@gmail.com> wrote:

>Dear Sandy,
>
>Thanks for your explanation.
>
>However, what I don't get is your term "client", is this "client" means
>MapReduce jobs? If I understand you right, this means Map function will
>process the tuples and during this processing time, the regionserver did
>nothing?
>
>regards!
>
>Yong
>
>
>On Wed, Jun 5, 2013 at 6:12 PM, Ted Yu <yu...@gmail.com> wrote:
>
>> bq. the Regionserver and Tasktracker are the same node when you use
>> MapReduce to scan the HBase table.
>>
>> The scan performed by the Tasktracker on that node would very likely
>>access
>> data hosted by region server on other node(s). So there would be RPC
>> involved.
>>
>> There is some discussion on providing shadow reads - writes to specific
>> region are solely served by one region server but the reads can be
>>served
>> by more than one region server. Of course consistency is one aspect that
>> must be tackled.
>>
>> Cheers
>>
>> On Wed, Jun 5, 2013 at 7:55 AM, yonghu <yo...@gmail.com> wrote:
>>
>> > Can anyone explain why client + rpc + server will decrease the
>> performance
>> > of scanning? I mean the Regionserver and Tasktracker are the same node
>> when
>> > you use MapReduce to scan the HBase table. So, in my understanding,
>>there
>> > will be no rpc cost.
>> >
>> > Thanks!
>> >
>> > Yong
>> >
>> >
>> > On Wed, Jun 5, 2013 at 10:09 AM, Sandy Pratt <pr...@adobe.com>
>>wrote:
>> >
>> > > https://issues.apache.org/jira/browse/HBASE-8691
>> > >
>> > >
>> > > On 6/4/13 6:11 PM, "Sandy Pratt" <pr...@adobe.com> wrote:
>> > >
>> > > >Haven't had a chance to write a JIRA yet, but I thought I'd pop in
>> here
>> > > >with an update in the meantime.
>> > > >
>> > > >I tried a number of different approaches to eliminate latency and
>> > > >"bubbles" in the scan pipeline, and eventually arrived at adding a
>> > > >streaming scan API to the region server, along with refactoring the
>> scan
>> > > >interface into an event-drive message receiver interface.  In so
>> doing,
>> > I
>> > > >was able to take scan speed on my cluster from 59,537 records/sec
>>with
>> > the
>> > > >classic scanner to 222,703 records per second with my new scan API.
>> > > >Needless to say, I'm pleased ;)
>> > > >
>> > > >More details forthcoming when I get a chance.
>> > > >
>> > > >Thanks,
>> > > >Sandy
>> > > >
>> > > >On 5/23/13 3:47 PM, "Ted Yu" <yu...@gmail.com> wrote:
>> > > >
>> > > >>Thanks for the update, Sandy.
>> > > >>
>> > > >>If you can open a JIRA and attach your producer / consumer scanner
>> > there,
>> > > >>that would be great.
>> > > >>
>> > > >>On Thu, May 23, 2013 at 3:42 PM, Sandy Pratt <pr...@adobe.com>
>> > wrote:
>> > > >>
>> > > >>> I wrote myself a Scanner wrapper that uses a producer/consumer
>> queue
>> > to
>> > > >>> keep the client fed with a full buffer as much as possible.
>>When
>> > > >>>scanning
>> > > >>> my table with scanner caching at 100 records, I see about a 24%
>> > uplift
>> > > >>>in
>> > > >>> performance (~35k records/sec with the ClientScanner and ~44k
>> > > >>>records/sec
>> > > >>> with my P/C scanner).  However, when I set scanner caching to
>>5000,
>> > > >>>it's
>> > > >>> more of a wash compared to the standard ClientScanner: ~53k
>> > records/sec
>> > > >>> with the ClientScanner and ~60k records/sec with the P/C
>>scanner.
>> > > >>>
>> > > >>> I'm not sure what to make of those results.  I think next I'll
>>shut
>> > > >>>down
>> > > >>> HBase and read the HFiles directly, to see if there's a drop
>>off in
>> > > >>> performance between reading them directly vs. via the
>>RegionServer.
>> > > >>>
>> > > >>> I still think that to really solve this there needs to be
>>sliding
>> > > >>>window
>> > > >>> of records in flight between disk and RS, and between RS and
>> client.
>> > > >>>I'm
>> > > >>> thinking there's probably a single batch of records in flight
>> between
>> > > >>>RS
>> > > >>> and client at the moment.
>> > > >>>
>> > > >>> Sandy
>> > > >>>
>> > > >>> On 5/23/13 8:45 AM, "Bryan Keller" <br...@gmail.com> wrote:
>> > > >>>
>> > > >>> >I am considering scanning a snapshot instead of the table. I
>> believe
>> > > >>>this
>> > > >>> >is what the ExportSnapshot class does. If I could use the
>>scanning
>> > > >>>code
>> > > >>> >from ExportSnapshot then I will be able to scan the HDFS files
>> > > >>>directly
>> > > >>> >and bypass the regionservers. This could potentially give me a
>> huge
>> > > >>>boost
>> > > >>> >in performance for full table scans. However, it doesn't really
>> > > >>>address
>> > > >>> >the poor scan performance against a table.
>> > > >>>
>> > > >>>
>> > > >
>> > >
>> > >
>> >
>>

Re: Poor HBase map-reduce scan performance

Posted by yonghu <yo...@gmail.com>.

Dear Sandy,

Thanks for your explanation.

However, what I don't get is your term "client", is this "client" means
MapReduce jobs? If I understand you right, this means Map function will
process the tuples and during this processing time, the regionserver did
nothing?

regards!

Yong


On Wed, Jun 5, 2013 at 6:12 PM, Ted Yu <yu...@gmail.com> wrote:

> bq. the Regionserver and Tasktracker are the same node when you use
> MapReduce to scan the HBase table.
>
> The scan performed by the Tasktracker on that node would very likely access
> data hosted by region server on other node(s). So there would be RPC
> involved.
>
> There is some discussion on providing shadow reads - writes to specific
> region are solely served by one region server but the reads can be served
> by more than one region server. Of course consistency is one aspect that
> must be tackled.
>
> Cheers
>
> On Wed, Jun 5, 2013 at 7:55 AM, yonghu <yo...@gmail.com> wrote:
>
> > Can anyone explain why client + rpc + server will decrease the
> performance
> > of scanning? I mean the Regionserver and Tasktracker are the same node
> when
> > you use MapReduce to scan the HBase table. So, in my understanding, there
> > will be no rpc cost.
> >
> > Thanks!
> >
> > Yong
> >
> >
> > On Wed, Jun 5, 2013 at 10:09 AM, Sandy Pratt <pr...@adobe.com> wrote:
> >
> > > https://issues.apache.org/jira/browse/HBASE-8691
> > >
> > >
> > > On 6/4/13 6:11 PM, "Sandy Pratt" <pr...@adobe.com> wrote:
> > >
> > > >Haven't had a chance to write a JIRA yet, but I thought I'd pop in
> here
> > > >with an update in the meantime.
> > > >
> > > >I tried a number of different approaches to eliminate latency and
> > > >"bubbles" in the scan pipeline, and eventually arrived at adding a
> > > >streaming scan API to the region server, along with refactoring the
> scan
> > > >interface into an event-drive message receiver interface.  In so
> doing,
> > I
> > > >was able to take scan speed on my cluster from 59,537 records/sec with
> > the
> > > >classic scanner to 222,703 records per second with my new scan API.
> > > >Needless to say, I'm pleased ;)
> > > >
> > > >More details forthcoming when I get a chance.
> > > >
> > > >Thanks,
> > > >Sandy
> > > >
> > > >On 5/23/13 3:47 PM, "Ted Yu" <yu...@gmail.com> wrote:
> > > >
> > > >>Thanks for the update, Sandy.
> > > >>
> > > >>If you can open a JIRA and attach your producer / consumer scanner
> > there,
> > > >>that would be great.
> > > >>
> > > >>On Thu, May 23, 2013 at 3:42 PM, Sandy Pratt <pr...@adobe.com>
> > wrote:
> > > >>
> > > >>> I wrote myself a Scanner wrapper that uses a producer/consumer
> queue
> > to
> > > >>> keep the client fed with a full buffer as much as possible.  When
> > > >>>scanning
> > > >>> my table with scanner caching at 100 records, I see about a 24%
> > uplift
> > > >>>in
> > > >>> performance (~35k records/sec with the ClientScanner and ~44k
> > > >>>records/sec
> > > >>> with my P/C scanner).  However, when I set scanner caching to 5000,
> > > >>>it's
> > > >>> more of a wash compared to the standard ClientScanner: ~53k
> > records/sec
> > > >>> with the ClientScanner and ~60k records/sec with the P/C scanner.
> > > >>>
> > > >>> I'm not sure what to make of those results.  I think next I'll shut
> > > >>>down
> > > >>> HBase and read the HFiles directly, to see if there's a drop off in
> > > >>> performance between reading them directly vs. via the RegionServer.
> > > >>>
> > > >>> I still think that to really solve this there needs to be sliding
> > > >>>window
> > > >>> of records in flight between disk and RS, and between RS and
> client.
> > > >>>I'm
> > > >>> thinking there's probably a single batch of records in flight
> between
> > > >>>RS
> > > >>> and client at the moment.
> > > >>>
> > > >>> Sandy
> > > >>>
> > > >>> On 5/23/13 8:45 AM, "Bryan Keller" <br...@gmail.com> wrote:
> > > >>>
> > > >>> >I am considering scanning a snapshot instead of the table. I
> believe
> > > >>>this
> > > >>> >is what the ExportSnapshot class does. If I could use the scanning
> > > >>>code
> > > >>> >from ExportSnapshot then I will be able to scan the HDFS files
> > > >>>directly
> > > >>> >and bypass the regionservers. This could potentially give me a
> huge
> > > >>>boost
> > > >>> >in performance for full table scans. However, it doesn't really
> > > >>>address
> > > >>> >the poor scan performance against a table.
> > > >>>
> > > >>>
> > > >
> > >
> > >
> >
>

Re: Poor HBase map-reduce scan performance

Posted by Ted Yu <yu...@gmail.com>.

bq. the Regionserver and Tasktracker are the same node when you use
MapReduce to scan the HBase table.

The scan performed by the Tasktracker on that node would very likely access
data hosted by region server on other node(s). So there would be RPC
involved.

There is some discussion on providing shadow reads - writes to specific
region are solely served by one region server but the reads can be served
by more than one region server. Of course consistency is one aspect that
must be tackled.

Cheers

On Wed, Jun 5, 2013 at 7:55 AM, yonghu <yo...@gmail.com> wrote:

> Can anyone explain why client + rpc + server will decrease the performance
> of scanning? I mean the Regionserver and Tasktracker are the same node when
> you use MapReduce to scan the HBase table. So, in my understanding, there
> will be no rpc cost.
>
> Thanks!
>
> Yong
>
>
> On Wed, Jun 5, 2013 at 10:09 AM, Sandy Pratt <pr...@adobe.com> wrote:
>
> > https://issues.apache.org/jira/browse/HBASE-8691
> >
> >
> > On 6/4/13 6:11 PM, "Sandy Pratt" <pr...@adobe.com> wrote:
> >
> > >Haven't had a chance to write a JIRA yet, but I thought I'd pop in here
> > >with an update in the meantime.
> > >
> > >I tried a number of different approaches to eliminate latency and
> > >"bubbles" in the scan pipeline, and eventually arrived at adding a
> > >streaming scan API to the region server, along with refactoring the scan
> > >interface into an event-drive message receiver interface.  In so doing,
> I
> > >was able to take scan speed on my cluster from 59,537 records/sec with
> the
> > >classic scanner to 222,703 records per second with my new scan API.
> > >Needless to say, I'm pleased ;)
> > >
> > >More details forthcoming when I get a chance.
> > >
> > >Thanks,
> > >Sandy
> > >
> > >On 5/23/13 3:47 PM, "Ted Yu" <yu...@gmail.com> wrote:
> > >
> > >>Thanks for the update, Sandy.
> > >>
> > >>If you can open a JIRA and attach your producer / consumer scanner
> there,
> > >>that would be great.
> > >>
> > >>On Thu, May 23, 2013 at 3:42 PM, Sandy Pratt <pr...@adobe.com>
> wrote:
> > >>
> > >>> I wrote myself a Scanner wrapper that uses a producer/consumer queue
> to
> > >>> keep the client fed with a full buffer as much as possible.  When
> > >>>scanning
> > >>> my table with scanner caching at 100 records, I see about a 24%
> uplift
> > >>>in
> > >>> performance (~35k records/sec with the ClientScanner and ~44k
> > >>>records/sec
> > >>> with my P/C scanner).  However, when I set scanner caching to 5000,
> > >>>it's
> > >>> more of a wash compared to the standard ClientScanner: ~53k
> records/sec
> > >>> with the ClientScanner and ~60k records/sec with the P/C scanner.
> > >>>
> > >>> I'm not sure what to make of those results.  I think next I'll shut
> > >>>down
> > >>> HBase and read the HFiles directly, to see if there's a drop off in
> > >>> performance between reading them directly vs. via the RegionServer.
> > >>>
> > >>> I still think that to really solve this there needs to be sliding
> > >>>window
> > >>> of records in flight between disk and RS, and between RS and client.
> > >>>I'm
> > >>> thinking there's probably a single batch of records in flight between
> > >>>RS
> > >>> and client at the moment.
> > >>>
> > >>> Sandy
> > >>>
> > >>> On 5/23/13 8:45 AM, "Bryan Keller" <br...@gmail.com> wrote:
> > >>>
> > >>> >I am considering scanning a snapshot instead of the table. I believe
> > >>>this
> > >>> >is what the ExportSnapshot class does. If I could use the scanning
> > >>>code
> > >>> >from ExportSnapshot then I will be able to scan the HDFS files
> > >>>directly
> > >>> >and bypass the regionservers. This could potentially give me a huge
> > >>>boost
> > >>> >in performance for full table scans. However, it doesn't really
> > >>>address
> > >>> >the poor scan performance against a table.
> > >>>
> > >>>
> > >
> >
> >
>

Re: Poor HBase map-reduce scan performance

Posted by Sandy Pratt <pr...@adobe.com>.

Yong,

As a thought experiment, imagine how it impacts the throughput of TCP to
keep the window size at 1.  That means there's only one packet in flight
at a time, and total throughput is a fraction of what it could be.

That's effectively what happens with RPC.  The server sends a batch, then
does nothing while it waits for the client to ask for more.  During that
time, the pipe between them is empty.  Increasing the batch size can help
a bit, in essence creating a really huge packet, but the problem remains.
There will always be stalls in the pipe.

What you want is for the window size to be large enough that the pipe is
saturated.  A streaming API accomplishes that by stuffing data down the
network pipe as quickly as possible.

Sandy

On 6/5/13 7:55 AM, "yonghu" <yo...@gmail.com> wrote:

>Can anyone explain why client + rpc + server will decrease the performance
>of scanning? I mean the Regionserver and Tasktracker are the same node
>when
>you use MapReduce to scan the HBase table. So, in my understanding, there
>will be no rpc cost.
>
>Thanks!
>
>Yong
>
>
>On Wed, Jun 5, 2013 at 10:09 AM, Sandy Pratt <pr...@adobe.com> wrote:
>
>> https://issues.apache.org/jira/browse/HBASE-8691
>>
>>
>> On 6/4/13 6:11 PM, "Sandy Pratt" <pr...@adobe.com> wrote:
>>
>> >Haven't had a chance to write a JIRA yet, but I thought I'd pop in here
>> >with an update in the meantime.
>> >
>> >I tried a number of different approaches to eliminate latency and
>> >"bubbles" in the scan pipeline, and eventually arrived at adding a
>> >streaming scan API to the region server, along with refactoring the
>>scan
>> >interface into an event-drive message receiver interface.  In so
>>doing, I
>> >was able to take scan speed on my cluster from 59,537 records/sec with
>>the
>> >classic scanner to 222,703 records per second with my new scan API.
>> >Needless to say, I'm pleased ;)
>> >
>> >More details forthcoming when I get a chance.
>> >
>> >Thanks,
>> >Sandy
>> >
>> >On 5/23/13 3:47 PM, "Ted Yu" <yu...@gmail.com> wrote:
>> >
>> >>Thanks for the update, Sandy.
>> >>
>> >>If you can open a JIRA and attach your producer / consumer scanner
>>there,
>> >>that would be great.
>> >>
>> >>On Thu, May 23, 2013 at 3:42 PM, Sandy Pratt <pr...@adobe.com>
>>wrote:
>> >>
>> >>> I wrote myself a Scanner wrapper that uses a producer/consumer
>>queue to
>> >>> keep the client fed with a full buffer as much as possible.  When
>> >>>scanning
>> >>> my table with scanner caching at 100 records, I see about a 24%
>>uplift
>> >>>in
>> >>> performance (~35k records/sec with the ClientScanner and ~44k
>> >>>records/sec
>> >>> with my P/C scanner).  However, when I set scanner caching to 5000,
>> >>>it's
>> >>> more of a wash compared to the standard ClientScanner: ~53k
>>records/sec
>> >>> with the ClientScanner and ~60k records/sec with the P/C scanner.
>> >>>
>> >>> I'm not sure what to make of those results.  I think next I'll shut
>> >>>down
>> >>> HBase and read the HFiles directly, to see if there's a drop off in
>> >>> performance between reading them directly vs. via the RegionServer.
>> >>>
>> >>> I still think that to really solve this there needs to be sliding
>> >>>window
>> >>> of records in flight between disk and RS, and between RS and client.
>> >>>I'm
>> >>> thinking there's probably a single batch of records in flight
>>between
>> >>>RS
>> >>> and client at the moment.
>> >>>
>> >>> Sandy
>> >>>
>> >>> On 5/23/13 8:45 AM, "Bryan Keller" <br...@gmail.com> wrote:
>> >>>
>> >>> >I am considering scanning a snapshot instead of the table. I
>>believe
>> >>>this
>> >>> >is what the ExportSnapshot class does. If I could use the scanning
>> >>>code
>> >>> >from ExportSnapshot then I will be able to scan the HDFS files
>> >>>directly
>> >>> >and bypass the regionservers. This could potentially give me a huge
>> >>>boost
>> >>> >in performance for full table scans. However, it doesn't really
>> >>>address
>> >>> >the poor scan performance against a table.
>> >>>
>> >>>
>> >
>>
>>

Re: Poor HBase map-reduce scan performance

Posted by yonghu <yo...@gmail.com>.

Can anyone explain why client + rpc + server will decrease the performance
of scanning? I mean the Regionserver and Tasktracker are the same node when
you use MapReduce to scan the HBase table. So, in my understanding, there
will be no rpc cost.

Thanks!

Yong


On Wed, Jun 5, 2013 at 10:09 AM, Sandy Pratt <pr...@adobe.com> wrote:

> https://issues.apache.org/jira/browse/HBASE-8691
>
>
> On 6/4/13 6:11 PM, "Sandy Pratt" <pr...@adobe.com> wrote:
>
> >Haven't had a chance to write a JIRA yet, but I thought I'd pop in here
> >with an update in the meantime.
> >
> >I tried a number of different approaches to eliminate latency and
> >"bubbles" in the scan pipeline, and eventually arrived at adding a
> >streaming scan API to the region server, along with refactoring the scan
> >interface into an event-drive message receiver interface.  In so doing, I
> >was able to take scan speed on my cluster from 59,537 records/sec with the
> >classic scanner to 222,703 records per second with my new scan API.
> >Needless to say, I'm pleased ;)
> >
> >More details forthcoming when I get a chance.
> >
> >Thanks,
> >Sandy
> >
> >On 5/23/13 3:47 PM, "Ted Yu" <yu...@gmail.com> wrote:
> >
> >>Thanks for the update, Sandy.
> >>
> >>If you can open a JIRA and attach your producer / consumer scanner there,
> >>that would be great.
> >>
> >>On Thu, May 23, 2013 at 3:42 PM, Sandy Pratt <pr...@adobe.com> wrote:
> >>
> >>> I wrote myself a Scanner wrapper that uses a producer/consumer queue to
> >>> keep the client fed with a full buffer as much as possible.  When
> >>>scanning
> >>> my table with scanner caching at 100 records, I see about a 24% uplift
> >>>in
> >>> performance (~35k records/sec with the ClientScanner and ~44k
> >>>records/sec
> >>> with my P/C scanner).  However, when I set scanner caching to 5000,
> >>>it's
> >>> more of a wash compared to the standard ClientScanner: ~53k records/sec
> >>> with the ClientScanner and ~60k records/sec with the P/C scanner.
> >>>
> >>> I'm not sure what to make of those results.  I think next I'll shut
> >>>down
> >>> HBase and read the HFiles directly, to see if there's a drop off in
> >>> performance between reading them directly vs. via the RegionServer.
> >>>
> >>> I still think that to really solve this there needs to be sliding
> >>>window
> >>> of records in flight between disk and RS, and between RS and client.
> >>>I'm
> >>> thinking there's probably a single batch of records in flight between
> >>>RS
> >>> and client at the moment.
> >>>
> >>> Sandy
> >>>
> >>> On 5/23/13 8:45 AM, "Bryan Keller" <br...@gmail.com> wrote:
> >>>
> >>> >I am considering scanning a snapshot instead of the table. I believe
> >>>this
> >>> >is what the ExportSnapshot class does. If I could use the scanning
> >>>code
> >>> >from ExportSnapshot then I will be able to scan the HDFS files
> >>>directly
> >>> >and bypass the regionservers. This could potentially give me a huge
> >>>boost
> >>> >in performance for full table scans. However, it doesn't really
> >>>address
> >>> >the poor scan performance against a table.
> >>>
> >>>
> >
>
>

Re: Poor HBase map-reduce scan performance

Posted by Sandy Pratt <pr...@adobe.com>.

https://issues.apache.org/jira/browse/HBASE-8691


On 6/4/13 6:11 PM, "Sandy Pratt" <pr...@adobe.com> wrote:

>Haven't had a chance to write a JIRA yet, but I thought I'd pop in here
>with an update in the meantime.
>
>I tried a number of different approaches to eliminate latency and
>"bubbles" in the scan pipeline, and eventually arrived at adding a
>streaming scan API to the region server, along with refactoring the scan
>interface into an event-drive message receiver interface.  In so doing, I
>was able to take scan speed on my cluster from 59,537 records/sec with the
>classic scanner to 222,703 records per second with my new scan API.
>Needless to say, I'm pleased ;)
>
>More details forthcoming when I get a chance.
>
>Thanks,
>Sandy
>
>On 5/23/13 3:47 PM, "Ted Yu" <yu...@gmail.com> wrote:
>
>>Thanks for the update, Sandy.
>>
>>If you can open a JIRA and attach your producer / consumer scanner there,
>>that would be great.
>>
>>On Thu, May 23, 2013 at 3:42 PM, Sandy Pratt <pr...@adobe.com> wrote:
>>
>>> I wrote myself a Scanner wrapper that uses a producer/consumer queue to
>>> keep the client fed with a full buffer as much as possible.  When
>>>scanning
>>> my table with scanner caching at 100 records, I see about a 24% uplift
>>>in
>>> performance (~35k records/sec with the ClientScanner and ~44k
>>>records/sec
>>> with my P/C scanner).  However, when I set scanner caching to 5000,
>>>it's
>>> more of a wash compared to the standard ClientScanner: ~53k records/sec
>>> with the ClientScanner and ~60k records/sec with the P/C scanner.
>>>
>>> I'm not sure what to make of those results.  I think next I'll shut
>>>down
>>> HBase and read the HFiles directly, to see if there's a drop off in
>>> performance between reading them directly vs. via the RegionServer.
>>>
>>> I still think that to really solve this there needs to be sliding
>>>window
>>> of records in flight between disk and RS, and between RS and client.
>>>I'm
>>> thinking there's probably a single batch of records in flight between
>>>RS
>>> and client at the moment.
>>>
>>> Sandy
>>>
>>> On 5/23/13 8:45 AM, "Bryan Keller" <br...@gmail.com> wrote:
>>>
>>> >I am considering scanning a snapshot instead of the table. I believe
>>>this
>>> >is what the ExportSnapshot class does. If I could use the scanning
>>>code
>>> >from ExportSnapshot then I will be able to scan the HDFS files
>>>directly
>>> >and bypass the regionservers. This could potentially give me a huge
>>>boost
>>> >in performance for full table scans. However, it doesn't really
>>>address
>>> >the poor scan performance against a table.
>>>
>>>
>

Re: Poor HBase map-reduce scan performance

Posted by Sandy Pratt <pr...@adobe.com>.

Haven't had a chance to write a JIRA yet, but I thought I'd pop in here
with an update in the meantime.

I tried a number of different approaches to eliminate latency and
"bubbles" in the scan pipeline, and eventually arrived at adding a
streaming scan API to the region server, along with refactoring the scan
interface into an event-drive message receiver interface.  In so doing, I
was able to take scan speed on my cluster from 59,537 records/sec with the
classic scanner to 222,703 records per second with my new scan API.
Needless to say, I'm pleased ;)

More details forthcoming when I get a chance.

Thanks,
Sandy

On 5/23/13 3:47 PM, "Ted Yu" <yu...@gmail.com> wrote:

>Thanks for the update, Sandy.
>
>If you can open a JIRA and attach your producer / consumer scanner there,
>that would be great.
>
>On Thu, May 23, 2013 at 3:42 PM, Sandy Pratt <pr...@adobe.com> wrote:
>
>> I wrote myself a Scanner wrapper that uses a producer/consumer queue to
>> keep the client fed with a full buffer as much as possible.  When
>>scanning
>> my table with scanner caching at 100 records, I see about a 24% uplift
>>in
>> performance (~35k records/sec with the ClientScanner and ~44k
>>records/sec
>> with my P/C scanner).  However, when I set scanner caching to 5000, it's
>> more of a wash compared to the standard ClientScanner: ~53k records/sec
>> with the ClientScanner and ~60k records/sec with the P/C scanner.
>>
>> I'm not sure what to make of those results.  I think next I'll shut down
>> HBase and read the HFiles directly, to see if there's a drop off in
>> performance between reading them directly vs. via the RegionServer.
>>
>> I still think that to really solve this there needs to be sliding window
>> of records in flight between disk and RS, and between RS and client.
>>I'm
>> thinking there's probably a single batch of records in flight between RS
>> and client at the moment.
>>
>> Sandy
>>
>> On 5/23/13 8:45 AM, "Bryan Keller" <br...@gmail.com> wrote:
>>
>> >I am considering scanning a snapshot instead of the table. I believe
>>this
>> >is what the ExportSnapshot class does. If I could use the scanning code
>> >from ExportSnapshot then I will be able to scan the HDFS files directly
>> >and bypass the regionservers. This could potentially give me a huge
>>boost
>> >in performance for full table scans. However, it doesn't really address
>> >the poor scan performance against a table.
>>
>>

Re: Poor HBase map-reduce scan performance

Posted by Ted Yu <yu...@gmail.com>.

Thanks for the update, Sandy.

If you can open a JIRA and attach your producer / consumer scanner there,
that would be great.

On Thu, May 23, 2013 at 3:42 PM, Sandy Pratt <pr...@adobe.com> wrote:

> I wrote myself a Scanner wrapper that uses a producer/consumer queue to
> keep the client fed with a full buffer as much as possible.  When scanning
> my table with scanner caching at 100 records, I see about a 24% uplift in
> performance (~35k records/sec with the ClientScanner and ~44k records/sec
> with my P/C scanner).  However, when I set scanner caching to 5000, it's
> more of a wash compared to the standard ClientScanner: ~53k records/sec
> with the ClientScanner and ~60k records/sec with the P/C scanner.
>
> I'm not sure what to make of those results.  I think next I'll shut down
> HBase and read the HFiles directly, to see if there's a drop off in
> performance between reading them directly vs. via the RegionServer.
>
> I still think that to really solve this there needs to be sliding window
> of records in flight between disk and RS, and between RS and client.  I'm
> thinking there's probably a single batch of records in flight between RS
> and client at the moment.
>
> Sandy
>
> On 5/23/13 8:45 AM, "Bryan Keller" <br...@gmail.com> wrote:
>
> >I am considering scanning a snapshot instead of the table. I believe this
> >is what the ExportSnapshot class does. If I could use the scanning code
> >from ExportSnapshot then I will be able to scan the HDFS files directly
> >and bypass the regionservers. This could potentially give me a huge boost
> >in performance for full table scans. However, it doesn't really address
> >the poor scan performance against a table.
>
>

Re: Poor HBase map-reduce scan performance

Posted by Sandy Pratt <pr...@adobe.com>.

I wrote myself a Scanner wrapper that uses a producer/consumer queue to
keep the client fed with a full buffer as much as possible.  When scanning
my table with scanner caching at 100 records, I see about a 24% uplift in
performance (~35k records/sec with the ClientScanner and ~44k records/sec
with my P/C scanner).  However, when I set scanner caching to 5000, it's
more of a wash compared to the standard ClientScanner: ~53k records/sec
with the ClientScanner and ~60k records/sec with the P/C scanner.

I'm not sure what to make of those results.  I think next I'll shut down
HBase and read the HFiles directly, to see if there's a drop off in
performance between reading them directly vs. via the RegionServer.

I still think that to really solve this there needs to be sliding window
of records in flight between disk and RS, and between RS and client.  I'm
thinking there's probably a single batch of records in flight between RS
and client at the moment.

Sandy

On 5/23/13 8:45 AM, "Bryan Keller" <br...@gmail.com> wrote:

>I am considering scanning a snapshot instead of the table. I believe this
>is what the ExportSnapshot class does. If I could use the scanning code
>from ExportSnapshot then I will be able to scan the HDFS files directly
>and bypass the regionservers. This could potentially give me a huge boost
>in performance for full table scans. However, it doesn't really address
>the poor scan performance against a table.

Re: Poor HBase map-reduce scan performance

Posted by Bryan Keller <br...@gmail.com>.

I am considering scanning a snapshot instead of the table. I believe this is what the ExportSnapshot class does. If I could use the scanning code from ExportSnapshot then I will be able to scan the HDFS files directly and bypass the regionservers. This could potentially give me a huge boost in performance for full table scans. However, it doesn't really address the poor scan performance against a table.

On May 22, 2013, at 3:57 PM, Ted Yu <yu...@gmail.com> wrote:

> Sandy:
> Looking at patch v6 of HBASE-8420, I think it is different from your
> approach below for the case of cache.size() == 0.
> 
> Maybe log a JIRA for further discussion ?
> 
> On Wed, May 22, 2013 at 3:33 PM, Sandy Pratt <pr...@adobe.com> wrote:
> 
>> It seems to be in the ballpark of what I was getting at, but I haven't
>> fully digested the code yet, so I can't say for sure.
>> 
>> Here's what I'm getting at.  Looking at
>> o.a.h.h.client.ClientScanner.next() in the 94.2 source I have loaded, I
>> see there are three branches with respect to the cache:
>> 
>> public Result next() throws IOException {
>> 
>> 
>>  // If the scanner is closed and there's nothing left in the cache, next
>> is a no-op.
>>  if (cache.size() == 0 && this.closed) {
>>    return null;
>>  }
>> 
>>  if (cache.size() == 0) {
>> // Request more results from RS
>>  ...
>>  }
>> 
>>  if (cache.size() > 0) {
>>    return cache.poll();
>>  }
>> 
>>  ...
>>  return null;
>> 
>> }
>> 
>> 
>> I think that middle branch wants to change as follows (pseudo-code):
>> 
>> if the cache size is below a certain threshold then
>>  initiate asynchronous action to refill it
>>  if there is no result to return until the cache refill completes then
>>    block
>>  done
>> done
>> 
>> Or something along those lines.  I haven't grokked the patch well enough
>> yet to tell if that's what it does.  What I think is happening in the
>> 0.94.2 code I've got is that it requests nothing until the cache is empty,
>> then blocks until it's non-empty.  We want to eagerly and asynchronously
>> refill the cache so that we ideally never have to block.
>> 
>> 
>> Sandy
>> 
>> 
>> On 5/22/13 1:39 PM, "Ted Yu" <yu...@gmail.com> wrote:
>> 
>>> Sandy:
>>> Do you think the following JIRA would help with what you expect in this
>>> regard ?
>>> 
>>> HBASE-8420 Port HBASE-6874 Implement prefetching for scanners from 0.89-fb
>>> 
>>> Cheers
>>> 
>>> On Wed, May 22, 2013 at 1:29 PM, Sandy Pratt <pr...@adobe.com> wrote:
>>> 
>>>> I found this thread on search-hadoop.com just now because I've been
>>>> wrestling with the same issue for a while and have as yet been unable to
>>>> solve it.  However, I think I have an idea of the problem.  My theory is
>>>> based on assumptions about what's going on in HBase and HDFS internally,
>>>> so please correct me if I'm wrong.
>>>> 
>>>> Briefly, I think the issue is that sequential reads from HDFS are
>>>> pipelined, whereas sequential reads from HBase are not.  Therefore,
>>>> sequential reads from HDFS tend to keep the IO subsystem saturated,
>>>> while
>>>> sequential reads from HBase allow it to idle for a relatively large
>>>> proportion of time.
>>>> 
>>>> To make this more concrete, suppose that I'm reading N bytes of data
>>>> from
>>>> a file in HDFS.  I issue the calls to open the file and begin to read
>>>> (from an InputStream, for example).  As I'm reading byte 1 of the stream
>>>> at my client, the datanode is reading byte M where 1 < M <= N from disk.
>>>> Thus, three activities tend to happen concurrently for the most part
>>>> (disregarding the beginning and end of the file): 1) processing at the
>>>> client; 2) streaming over the network from datanode to client; and 3)
>>>> reading data from disk at the datanode.  The proportion of time these
>>>> three activities overlap tends towards 100% as N -> infinity.
>>>> 
>>>> Now suppose I read a batch of R records from HBase (where R = whatever
>>>> scanner caching happens to be).  As I understand it, I issue my call to
>>>> ResultScanner.next(), and this causes the RegionServer to block as if
>>>> on a
>>>> page fault while it loads enough HFile blocks from disk to cover the R
>>>> records I (implicitly) requested.  After the blocks are loaded into the
>>>> block cache on the RS, the RS returns R records to me over the network.
>>>> Then I process the R records locally.  When they are exhausted, this
>>>> cycle
>>>> repeats.  The notable upshot is that while the RS is faulting HFile
>>>> blocks
>>>> into the cache, my client is blocked.  Furthermore, while my client is
>>>> processing records, the RS is idle with respect to work on behalf of my
>>>> client.
>>>> 
>>>> That last point is really the killer, if I'm correct in my assumptions.
>>>> It means that Scanner caching and larger block sizes work only to
>>>> amortize
>>>> the fixed overhead of disk IOs and RPCs -- they do nothing to keep the
>>>> IO
>>>> subsystems saturated during sequential reads.  What *should* happen is
>>>> that the RS should treat the Scanner caching value (R above) as a hint
>>>> that it should always have ready records r + 1 to r + R when I'm reading
>>>> record r, at least up to the region boundary.  The RS should be
>>>> preparing
>>>> eagerly for the next call to ResultScanner.next(), which I suspect it's
>>>> currently not doing.
>>>> 
>>>> Another way to state this would be to say that the client should tell
>>>> the
>>>> RS to prepare the next batch of records soon enough that they can start
>>>> to
>>>> arrive at the client just as the client is finishing the current batch.
>>>> As is, I suspect it doesn't request more from the RS until the local
>>>> batch
>>>> is exhausted.
>>>> 
>>>> As I cautioned before, this is based on assumptions about how the
>>>> internals work, so please correct me if I'm wrong.  Also, I'm way behind
>>>> on the mailing list, so I probably won't see any responses unless CC'd
>>>> directly.
>>>> 
>>>> Sandy
>>>> 
>>>> On 5/10/13 8:46 AM, "Bryan Keller" <br...@gmail.com> wrote:
>>>> 
>>>>> FYI, I ran tests with compression on and off.
>>>>> 
>>>>> With a plain HDFS sequence file and compression off, I am getting very
>>>>> good I/O numbers, roughly 75% of theoretical max for reads. With snappy
>>>>> compression on with a sequence file, I/O speed is about 3x slower.
>>>>> However the file size is 3x smaller so it takes about the same time to
>>>>> scan.
>>>>> 
>>>>> With HBase, the results are equivalent (just much slower than a
>>>> sequence
>>>>> file). Scanning a compressed table is about 3x slower I/O than an
>>>>> uncompressed table, but the table is 3x smaller, so the time to scan is
>>>>> about the same. Scanning an HBase table takes about 3x as long as
>>>>> scanning the sequence file export of the table, either compressed or
>>>>> uncompressed. The sequence file export file size ends up being just
>>>>> barely larger than the table, either compressed or uncompressed
>>>>> 
>>>>> So in sum, compression slows down I/O 3x, but the file is 3x smaller so
>>>>> the time to scan is about the same. Adding in HBase slows things down
>>>>> another 3x. So I'm seeing 9x faster I/O scanning an uncompressed
>>>> sequence
>>>>> file vs scanning a compressed table.
>>>>> 
>>>>> 
>>>>> On May 8, 2013, at 10:15 AM, Bryan Keller <br...@gmail.com> wrote:
>>>>> 
>>>>>> Thanks for the offer Lars! I haven't made much progress speeding
>>>> things
>>>>>> up.
>>>>>> 
>>>>>> I finally put together a test program that populates a table that is
>>>>>> similar to my production dataset. I have a readme that should describe
>>>>>> things, hopefully enough to make it useable. There is code to
>>>> populate a
>>>>>> test table, code to scan the table, and code to scan sequence files
>>>> from
>>>>>> an export (to compare HBase w/ raw HDFS). I use a gradle build script.
>>>>>> 
>>>>>> You can find the code here:
>>>>>> 
>>>>>> https://dl.dropboxusercontent.com/u/6880177/hbasetest.zip
>>>>>> 
>>>>>> 
>>>>>> On May 4, 2013, at 6:33 PM, lars hofhansl <la...@apache.org> wrote:
>>>>>> 
>>>>>>> The blockbuffers are not reused, but that by itself should not be a
>>>>>>> problem as they are all the same size (at least I have never
>>>> identified
>>>>>>> that as one in my profiling sessions).
>>>>>>> 
>>>>>>> My offer still stands to do some profiling myself if there is an
>>>> easy
>>>>>>> way to generate data of similar shape.
>>>>>>> 
>>>>>>> -- Lars
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> ________________________________
>>>>>>> From: Bryan Keller <br...@gmail.com>
>>>>>>> To: user@hbase.apache.org
>>>>>>> Sent: Friday, May 3, 2013 3:44 AM
>>>>>>> Subject: Re: Poor HBase map-reduce scan performance
>>>>>>> 
>>>>>>> 
>>>>>>> Actually I'm not too confident in my results re block size, they may
>>>>>>> have been related to major compaction. I'm going to rerun before
>>>>>>> drawing any conclusions.
>>>>>>> 
>>>>>>> On May 3, 2013, at 12:17 AM, Bryan Keller <br...@gmail.com>
>> wrote:
>>>>>>> 
>>>>>>>> I finally made some progress. I tried a very large HBase block size
>>>>>>>> (16mb), and it significantly improved scan performance. I went from
>>>>>>>> 45-50 min to 24 min. Not great but much better. Before I had it set
>>>> to
>>>>>>>> 128k. Scanning an equivalent sequence file takes 10 min. My random
>>>>>>>> read performance will probably suffer with such a large block size
>>>>>>>> (theoretically), so I probably can't keep it this big. I care about
>>>>>>>> random read performance too. I've read having a block size this big
>>>> is
>>>>>>>> not recommended, is that correct?
>>>>>>>> 
>>>>>>>> I haven't dug too deeply into the code, are the block buffers
>>>> reused
>>>>>>>> or is each new block read a new allocation? Perhaps a buffer pool
>>>>>>>> could help here if there isn't one already. When doing a scan, HBase
>>>>>>>> could reuse previously allocated block buffers instead of
>>>> allocating a
>>>>>>>> new one for each block. Then block size shouldn't affect scan
>>>>>>>> performance much.
>>>>>>>> 
>>>>>>>> I'm not using a block encoder. Also, I'm still sifting through the
>>>>>>>> profiler results, I'll see if I can make more sense of it and run
>>>> some
>>>>>>>> more experiments.
>>>>>>>> 
>>>>>>>> On May 2, 2013, at 5:46 PM, lars hofhansl <la...@apache.org>
>> wrote:
>>>>>>>> 
>>>>>>>>> Interesting. If you can try 0.94.7 (but it'll probably not have
>>>>>>>>> changed that much from 0.94.4)
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Do you have enabled one of the block encoders (FAST_DIFF, etc)? If
>>>>>>>>> so, try without. They currently need to reallocate a ByteBuffer for
>>>>>>>>> each single KV.
>>>>>>>>> (Sine you see ScannerV2 rather than EncodedScannerV2 you probably
>>>>>>>>> have not enabled encoding, just checking).
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> And do you have a stack trace for the ByteBuffer.allocate(). That
>>>> is
>>>>>>>>> a strange one since it never came up in my profiling (unless you
>>>>>>>>> enabled block encoding).
>>>>>>>>> (You can get traces from VisualVM by creating a snapshot, but
>>>> you'd
>>>>>>>>> have to drill in to find the allocate()).
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> During normal scanning (again, without encoding) there should be
>>>> no
>>>>>>>>> allocation happening except for blocks read from disk (and they
>>>>>>>>> should all be the same size, thus allocation should be cheap).
>>>>>>>>> 
>>>>>>>>> -- Lars
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> ________________________________
>>>>>>>>> From: Bryan Keller <br...@gmail.com>
>>>>>>>>> To: user@hbase.apache.org
>>>>>>>>> Sent: Thursday, May 2, 2013 10:54 AM
>>>>>>>>> Subject: Re: Poor HBase map-reduce scan performance
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> I ran one of my regionservers through VisualVM. It looks like the
>>>>>>>>> top hot spots are HFileReaderV2$ScannerV2.getKeyValue() and
>>>>>>>>> ByteBuffer.allocate(). It appears at first glance that memory
>>>>>>>>> allocations may be an issue. Decompression was next below that but
>>>>>>>>> less of an issue it seems.
>>>>>>>>> 
>>>>>>>>> Would changing the block size, either HDFS or HBase, help here?
>>>>>>>>> 
>>>>>>>>> Also, if anyone has tips on how else to profile, that would be
>>>>>>>>> appreciated. VisualVM can produce a lot of noise that is hard to
>>>> sift
>>>>>>>>> through.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On May 1, 2013, at 9:49 PM, Bryan Keller <br...@gmail.com>
>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> I used exactly 0.94.4, pulled from the tag in subversion.
>>>>>>>>>> 
>>>>>>>>>> On May 1, 2013, at 9:41 PM, lars hofhansl <la...@apache.org>
>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Hmm... Did you actually use exactly version 0.94.4, or the
>>>> latest
>>>>>>>>>>> 0.94.7.
>>>>>>>>>>> I would be very curious to see profiling data.
>>>>>>>>>>> 
>>>>>>>>>>> -- Lars
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> ----- Original Message -----
>>>>>>>>>>> From: Bryan Keller <br...@gmail.com>
>>>>>>>>>>> To: "user@hbase.apache.org" <us...@hbase.apache.org>
>>>>>>>>>>> Cc:
>>>>>>>>>>> Sent: Wednesday, May 1, 2013 6:01 PM
>>>>>>>>>>> Subject: Re: Poor HBase map-reduce scan performance
>>>>>>>>>>> 
>>>>>>>>>>> I tried running my test with 0.94.4, unfortunately performance
>>>> was
>>>>>>>>>>> about the same. I'm planning on profiling the regionserver and
>>>>>>>>>>> trying some other things tonight and tomorrow and will report
>>>> back.
>>>>>>>>>>> 
>>>>>>>>>>> On May 1, 2013, at 8:00 AM, Bryan Keller <br...@gmail.com>
>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> Yes I would like to try this, if you can point me to the
>>>> pom.xml
>>>>>>>>>>>> patch that would save me some time.
>>>>>>>>>>>> 
>>>>>>>>>>>> On Tuesday, April 30, 2013, lars hofhansl wrote:
>>>>>>>>>>>> If you can, try 0.94.4+; it should significantly reduce the
>>>>>>>>>>>> amount of bytes copied around in RAM during scanning, especially
>>>>>>>>>>>> if you have wide rows and/or large key portions. That in turns
>>>>>>>>>>>> makes scans scale better across cores, since RAM is shared
>>>>>>>>>>>> resource between cores (much like disk).
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> It's not hard to build the latest HBase against Cloudera's
>>>>>>>>>>>> version of Hadoop. I can send along a simple patch to pom.xml to
>>>>>>>>>>>> do that.
>>>>>>>>>>>> 
>>>>>>>>>>>> -- Lars
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> ________________________________
>>>>>>>>>>>> From: Bryan Keller <br...@gmail.com>
>>>>>>>>>>>> To: user@hbase.apache.org
>>>>>>>>>>>> Sent: Tuesday, April 30, 2013 11:02 PM
>>>>>>>>>>>> Subject: Re: Poor HBase map-reduce scan performance
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> The table has hashed keys so rows are evenly distributed
>>>> amongst
>>>>>>>>>>>> the regionservers, and load on each regionserver is pretty much
>>>>>>>>>>>> the same. I also have per-table balancing turned on. I get
>>>> mostly
>>>>>>>>>>>> data local mappers with only a few rack local (maybe 10 of the
>>>> 250
>>>>>>>>>>>> mappers).
>>>>>>>>>>>> 
>>>>>>>>>>>> Currently the table is a wide table schema, with lists of data
>>>>>>>>>>>> structures stored as columns with column prefixes grouping the
>>>>>>>>>>>> data structures (e.g. 1_name, 1_address, 1_city, 2_name,
>>>>>>>>>>>> 2_address, 2_city). I was thinking of moving those data
>>>> structures
>>>>>>>>>>>> to protobuf which would cut down on the number of columns. The
>>>>>>>>>>>> downside is I can't filter on one value with that, but it is a
>>>>>>>>>>>> tradeoff I would make for performance. I was also considering
>>>>>>>>>>>> restructuring the table into a tall table.
>>>>>>>>>>>> 
>>>>>>>>>>>> Something interesting is that my old regionserver machines had
>>>>>>>>>>>> five 15k SCSI drives instead of 2 SSDs, and performance was
>>>> about
>>>>>>>>>>>> the same. Also, my old network was 1gbit, now it is 10gbit. So
>>>>>>>>>>>> neither network nor disk I/O appear to be the bottleneck. The
>>>> CPU
>>>>>>>>>>>> is rather high for the regionserver so it seems like the best
>>>>>>>>>>>> candidate to investigate. I will try profiling it tomorrow and
>>>>>>>>>>>> will report back. I may revisit compression on vs off since that
>>>>>>>>>>>> is adding load to the CPU.
>>>>>>>>>>>> 
>>>>>>>>>>>> I'll also come up with a sample program that generates data
>>>>>>>>>>>> similar to my table.
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> On Apr 30, 2013, at 10:01 PM, lars hofhansl <la...@apache.org>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> Your average row is 35k so scanner caching would not make a
>>>> huge
>>>>>>>>>>>>> difference, although I would have expected some improvements by
>>>>>>>>>>>>> setting it to 10 or 50 since you have a wide 10ge pipe.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I assume your table is split sufficiently to touch all
>>>>>>>>>>>>> RegionServer... Do you see the same load/IO on all region
>>>> servers?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> A bunch of scan improvements went into HBase since 0.94.2.
>>>>>>>>>>>>> I blogged about some of these changes here:
>>>>>>>>>>>>> http://hadoop-hbase.blogspot.com/2012/12/hbase-profiling.html
>>>>>>>>>>>>> 
>>>>>>>>>>>>> In your case - since you have many columns, each of which
>>>> carry
>>>>>>>>>>>>> the rowkey - you might benefit a lot from HBASE-7279.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> In the end HBase *is* slower than straight HDFS for full
>>>> scans.
>>>>>>>>>>>>> How could it not be?
>>>>>>>>>>>>> So I would start by looking at HDFS first. Make sure Nagle's
>>>> is
>>>>>>>>>>>>> disbaled in both HBase and HDFS.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> And lastly SSDs are somewhat new territory for HBase. Maybe
>>>> Andy
>>>>>>>>>>>>> Purtell is listening, I think he did some tests with HBase on
>>>>>>>>>>>>> SSDs.
>>>>>>>>>>>>> With rotating media you typically see an improvement with
>>>>>>>>>>>>> compression. With SSDs the added CPU needed for decompression
>>>>>>>>>>>>> might outweigh the benefits.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> At the risk of starting a larger discussion here, I would
>>>> posit
>>>>>>>>>>>>> that HBase's LSM based design, which trades random IO with
>>>>>>>>>>>>> sequential IO, might be a bit more questionable on SSDs.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> If you can, it would be nice to run a profiler against one of
>>>>>>>>>>>>> the RegionServers (or maybe do it with the single RS setup) and
>>>>>>>>>>>>> see where it is bottlenecked.
>>>>>>>>>>>>> (And if you send me a sample program to generate some data -
>>>> not
>>>>>>>>>>>>> 700g, though :) - I'll try to do a bit of profiling during the
>>>>>>>>>>>>> next days as my day job permits, but I do not have any machines
>>>>>>>>>>>>> with SSDs).
>>>>>>>>>>>>> 
>>>>>>>>>>>>> -- Lars
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> ________________________________
>>>>>>>>>>>>> From: Bryan Keller <br...@gmail.com>
>>>>>>>>>>>>> To: user@hbase.apache.org
>>>>>>>>>>>>> Sent: Tuesday, April 30, 2013 9:31 PM
>>>>>>>>>>>>> Subject: Re: Poor HBase map-reduce scan performance
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Yes, I have tried various settings for setCaching() and I have
>>>>>>>>>>>>> setCacheBlocks(false)
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Apr 30, 2013, at 9:17 PM, Ted Yu <yu...@gmail.com>
>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> From http://hbase.apache.org/book.html#mapreduce.example :
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> scan.setCaching(500);        // 1 is the default in Scan,
>>>> which
>>>>>>>>>>>>>> will
>>>>>>>>>>>>>> be bad for MapReduce jobs
>>>>>>>>>>>>>> scan.setCacheBlocks(false);  // don't set to true for MR jobs
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I guess you have used the above setting.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 0.94.x releases are compatible. Have you considered upgrading
>>>>>>>>>>>>>> to, say
>>>>>>>>>>>>>> 0.94.7 which was recently released ?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Cheers
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Tue, Apr 30, 2013 at 9:01 PM, Bryan Keller <bryanck@gm
>>>>>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>>> 
>> 
>>

Re: Poor HBase map-reduce scan performance

Posted by Ted Yu <yu...@gmail.com>.

Sandy:
Looking at patch v6 of HBASE-8420, I think it is different from your
approach below for the case of cache.size() == 0.

Maybe log a JIRA for further discussion ?

On Wed, May 22, 2013 at 3:33 PM, Sandy Pratt <pr...@adobe.com> wrote:

> It seems to be in the ballpark of what I was getting at, but I haven't
> fully digested the code yet, so I can't say for sure.
>
> Here's what I'm getting at.  Looking at
> o.a.h.h.client.ClientScanner.next() in the 94.2 source I have loaded, I
> see there are three branches with respect to the cache:
>
> public Result next() throws IOException {
>
>
>   // If the scanner is closed and there's nothing left in the cache, next
> is a no-op.
>   if (cache.size() == 0 && this.closed) {
>     return null;
>   }
>
>   if (cache.size() == 0) {
> // Request more results from RS
>   ...
>   }
>
>   if (cache.size() > 0) {
>     return cache.poll();
>   }
>
>   ...
>   return null;
>
> }
>
>
> I think that middle branch wants to change as follows (pseudo-code):
>
> if the cache size is below a certain threshold then
>   initiate asynchronous action to refill it
>   if there is no result to return until the cache refill completes then
>     block
>   done
> done
>
> Or something along those lines.  I haven't grokked the patch well enough
> yet to tell if that's what it does.  What I think is happening in the
> 0.94.2 code I've got is that it requests nothing until the cache is empty,
> then blocks until it's non-empty.  We want to eagerly and asynchronously
> refill the cache so that we ideally never have to block.
>
>
> Sandy
>
>
> On 5/22/13 1:39 PM, "Ted Yu" <yu...@gmail.com> wrote:
>
> >Sandy:
> >Do you think the following JIRA would help with what you expect in this
> >regard ?
> >
> >HBASE-8420 Port HBASE-6874 Implement prefetching for scanners from 0.89-fb
> >
> >Cheers
> >
> >On Wed, May 22, 2013 at 1:29 PM, Sandy Pratt <pr...@adobe.com> wrote:
> >
> >> I found this thread on search-hadoop.com just now because I've been
> >> wrestling with the same issue for a while and have as yet been unable to
> >> solve it.  However, I think I have an idea of the problem.  My theory is
> >> based on assumptions about what's going on in HBase and HDFS internally,
> >> so please correct me if I'm wrong.
> >>
> >> Briefly, I think the issue is that sequential reads from HDFS are
> >> pipelined, whereas sequential reads from HBase are not.  Therefore,
> >> sequential reads from HDFS tend to keep the IO subsystem saturated,
> >>while
> >> sequential reads from HBase allow it to idle for a relatively large
> >> proportion of time.
> >>
> >> To make this more concrete, suppose that I'm reading N bytes of data
> >>from
> >> a file in HDFS.  I issue the calls to open the file and begin to read
> >> (from an InputStream, for example).  As I'm reading byte 1 of the stream
> >> at my client, the datanode is reading byte M where 1 < M <= N from disk.
> >> Thus, three activities tend to happen concurrently for the most part
> >> (disregarding the beginning and end of the file): 1) processing at the
> >> client; 2) streaming over the network from datanode to client; and 3)
> >> reading data from disk at the datanode.  The proportion of time these
> >> three activities overlap tends towards 100% as N -> infinity.
> >>
> >> Now suppose I read a batch of R records from HBase (where R = whatever
> >> scanner caching happens to be).  As I understand it, I issue my call to
> >> ResultScanner.next(), and this causes the RegionServer to block as if
> >>on a
> >> page fault while it loads enough HFile blocks from disk to cover the R
> >> records I (implicitly) requested.  After the blocks are loaded into the
> >> block cache on the RS, the RS returns R records to me over the network.
> >> Then I process the R records locally.  When they are exhausted, this
> >>cycle
> >> repeats.  The notable upshot is that while the RS is faulting HFile
> >>blocks
> >> into the cache, my client is blocked.  Furthermore, while my client is
> >> processing records, the RS is idle with respect to work on behalf of my
> >> client.
> >>
> >> That last point is really the killer, if I'm correct in my assumptions.
> >> It means that Scanner caching and larger block sizes work only to
> >>amortize
> >> the fixed overhead of disk IOs and RPCs -- they do nothing to keep the
> >>IO
> >> subsystems saturated during sequential reads.  What *should* happen is
> >> that the RS should treat the Scanner caching value (R above) as a hint
> >> that it should always have ready records r + 1 to r + R when I'm reading
> >> record r, at least up to the region boundary.  The RS should be
> >>preparing
> >> eagerly for the next call to ResultScanner.next(), which I suspect it's
> >> currently not doing.
> >>
> >> Another way to state this would be to say that the client should tell
> >>the
> >> RS to prepare the next batch of records soon enough that they can start
> >>to
> >> arrive at the client just as the client is finishing the current batch.
> >> As is, I suspect it doesn't request more from the RS until the local
> >>batch
> >> is exhausted.
> >>
> >> As I cautioned before, this is based on assumptions about how the
> >> internals work, so please correct me if I'm wrong.  Also, I'm way behind
> >> on the mailing list, so I probably won't see any responses unless CC'd
> >> directly.
> >>
> >> Sandy
> >>
> >> On 5/10/13 8:46 AM, "Bryan Keller" <br...@gmail.com> wrote:
> >>
> >> >FYI, I ran tests with compression on and off.
> >> >
> >> >With a plain HDFS sequence file and compression off, I am getting very
> >> >good I/O numbers, roughly 75% of theoretical max for reads. With snappy
> >> >compression on with a sequence file, I/O speed is about 3x slower.
> >> >However the file size is 3x smaller so it takes about the same time to
> >> >scan.
> >> >
> >> >With HBase, the results are equivalent (just much slower than a
> >>sequence
> >> >file). Scanning a compressed table is about 3x slower I/O than an
> >> >uncompressed table, but the table is 3x smaller, so the time to scan is
> >> >about the same. Scanning an HBase table takes about 3x as long as
> >> >scanning the sequence file export of the table, either compressed or
> >> >uncompressed. The sequence file export file size ends up being just
> >> >barely larger than the table, either compressed or uncompressed
> >> >
> >> >So in sum, compression slows down I/O 3x, but the file is 3x smaller so
> >> >the time to scan is about the same. Adding in HBase slows things down
> >> >another 3x. So I'm seeing 9x faster I/O scanning an uncompressed
> >>sequence
> >> >file vs scanning a compressed table.
> >> >
> >> >
> >> >On May 8, 2013, at 10:15 AM, Bryan Keller <br...@gmail.com> wrote:
> >> >
> >> >> Thanks for the offer Lars! I haven't made much progress speeding
> >>things
> >> >>up.
> >> >>
> >> >> I finally put together a test program that populates a table that is
> >> >>similar to my production dataset. I have a readme that should describe
> >> >>things, hopefully enough to make it useable. There is code to
> >>populate a
> >> >>test table, code to scan the table, and code to scan sequence files
> >>from
> >> >>an export (to compare HBase w/ raw HDFS). I use a gradle build script.
> >> >>
> >> >> You can find the code here:
> >> >>
> >> >> https://dl.dropboxusercontent.com/u/6880177/hbasetest.zip
> >> >>
> >> >>
> >> >> On May 4, 2013, at 6:33 PM, lars hofhansl <la...@apache.org> wrote:
> >> >>
> >> >>> The blockbuffers are not reused, but that by itself should not be a
> >> >>>problem as they are all the same size (at least I have never
> >>identified
> >> >>>that as one in my profiling sessions).
> >> >>>
> >> >>> My offer still stands to do some profiling myself if there is an
> >>easy
> >> >>>way to generate data of similar shape.
> >> >>>
> >> >>> -- Lars
> >> >>>
> >> >>>
> >> >>>
> >> >>> ________________________________
> >> >>> From: Bryan Keller <br...@gmail.com>
> >> >>> To: user@hbase.apache.org
> >> >>> Sent: Friday, May 3, 2013 3:44 AM
> >> >>> Subject: Re: Poor HBase map-reduce scan performance
> >> >>>
> >> >>>
> >> >>> Actually I'm not too confident in my results re block size, they may
> >> >>>have been related to major compaction. I'm going to rerun before
> >> >>>drawing any conclusions.
> >> >>>
> >> >>> On May 3, 2013, at 12:17 AM, Bryan Keller <br...@gmail.com>
> wrote:
> >> >>>
> >> >>>> I finally made some progress. I tried a very large HBase block size
> >> >>>>(16mb), and it significantly improved scan performance. I went from
> >> >>>>45-50 min to 24 min. Not great but much better. Before I had it set
> >>to
> >> >>>>128k. Scanning an equivalent sequence file takes 10 min. My random
> >> >>>>read performance will probably suffer with such a large block size
> >> >>>>(theoretically), so I probably can't keep it this big. I care about
> >> >>>>random read performance too. I've read having a block size this big
> >>is
> >> >>>>not recommended, is that correct?
> >> >>>>
> >> >>>> I haven't dug too deeply into the code, are the block buffers
> >>reused
> >> >>>>or is each new block read a new allocation? Perhaps a buffer pool
> >> >>>>could help here if there isn't one already. When doing a scan, HBase
> >> >>>>could reuse previously allocated block buffers instead of
> >>allocating a
> >> >>>>new one for each block. Then block size shouldn't affect scan
> >> >>>>performance much.
> >> >>>>
> >> >>>> I'm not using a block encoder. Also, I'm still sifting through the
> >> >>>>profiler results, I'll see if I can make more sense of it and run
> >>some
> >> >>>>more experiments.
> >> >>>>
> >> >>>> On May 2, 2013, at 5:46 PM, lars hofhansl <la...@apache.org>
> wrote:
> >> >>>>
> >> >>>>> Interesting. If you can try 0.94.7 (but it'll probably not have
> >> >>>>>changed that much from 0.94.4)
> >> >>>>>
> >> >>>>>
> >> >>>>> Do you have enabled one of the block encoders (FAST_DIFF, etc)? If
> >> >>>>>so, try without. They currently need to reallocate a ByteBuffer for
> >> >>>>>each single KV.
> >> >>>>> (Sine you see ScannerV2 rather than EncodedScannerV2 you probably
> >> >>>>>have not enabled encoding, just checking).
> >> >>>>>
> >> >>>>>
> >> >>>>> And do you have a stack trace for the ByteBuffer.allocate(). That
> >>is
> >> >>>>>a strange one since it never came up in my profiling (unless you
> >> >>>>>enabled block encoding).
> >> >>>>> (You can get traces from VisualVM by creating a snapshot, but
> >>you'd
> >> >>>>>have to drill in to find the allocate()).
> >> >>>>>
> >> >>>>>
> >> >>>>> During normal scanning (again, without encoding) there should be
> >>no
> >> >>>>>allocation happening except for blocks read from disk (and they
> >> >>>>>should all be the same size, thus allocation should be cheap).
> >> >>>>>
> >> >>>>> -- Lars
> >> >>>>>
> >> >>>>>
> >> >>>>>
> >> >>>>> ________________________________
> >> >>>>> From: Bryan Keller <br...@gmail.com>
> >> >>>>> To: user@hbase.apache.org
> >> >>>>> Sent: Thursday, May 2, 2013 10:54 AM
> >> >>>>> Subject: Re: Poor HBase map-reduce scan performance
> >> >>>>>
> >> >>>>>
> >> >>>>> I ran one of my regionservers through VisualVM. It looks like the
> >> >>>>>top hot spots are HFileReaderV2$ScannerV2.getKeyValue() and
> >> >>>>>ByteBuffer.allocate(). It appears at first glance that memory
> >> >>>>>allocations may be an issue. Decompression was next below that but
> >> >>>>>less of an issue it seems.
> >> >>>>>
> >> >>>>> Would changing the block size, either HDFS or HBase, help here?
> >> >>>>>
> >> >>>>> Also, if anyone has tips on how else to profile, that would be
> >> >>>>>appreciated. VisualVM can produce a lot of noise that is hard to
> >>sift
> >> >>>>>through.
> >> >>>>>
> >> >>>>>
> >> >>>>> On May 1, 2013, at 9:49 PM, Bryan Keller <br...@gmail.com>
> >>wrote:
> >> >>>>>
> >> >>>>>> I used exactly 0.94.4, pulled from the tag in subversion.
> >> >>>>>>
> >> >>>>>> On May 1, 2013, at 9:41 PM, lars hofhansl <la...@apache.org>
> >>wrote:
> >> >>>>>>
> >> >>>>>>> Hmm... Did you actually use exactly version 0.94.4, or the
> >>latest
> >> >>>>>>>0.94.7.
> >> >>>>>>> I would be very curious to see profiling data.
> >> >>>>>>>
> >> >>>>>>> -- Lars
> >> >>>>>>>
> >> >>>>>>>
> >> >>>>>>>
> >> >>>>>>> ----- Original Message -----
> >> >>>>>>> From: Bryan Keller <br...@gmail.com>
> >> >>>>>>> To: "user@hbase.apache.org" <us...@hbase.apache.org>
> >> >>>>>>> Cc:
> >> >>>>>>> Sent: Wednesday, May 1, 2013 6:01 PM
> >> >>>>>>> Subject: Re: Poor HBase map-reduce scan performance
> >> >>>>>>>
> >> >>>>>>> I tried running my test with 0.94.4, unfortunately performance
> >>was
> >> >>>>>>>about the same. I'm planning on profiling the regionserver and
> >> >>>>>>>trying some other things tonight and tomorrow and will report
> >>back.
> >> >>>>>>>
> >> >>>>>>> On May 1, 2013, at 8:00 AM, Bryan Keller <br...@gmail.com>
> >> wrote:
> >> >>>>>>>
> >> >>>>>>>> Yes I would like to try this, if you can point me to the
> >>pom.xml
> >> >>>>>>>>patch that would save me some time.
> >> >>>>>>>>
> >> >>>>>>>> On Tuesday, April 30, 2013, lars hofhansl wrote:
> >> >>>>>>>> If you can, try 0.94.4+; it should significantly reduce the
> >> >>>>>>>>amount of bytes copied around in RAM during scanning, especially
> >> >>>>>>>>if you have wide rows and/or large key portions. That in turns
> >> >>>>>>>>makes scans scale better across cores, since RAM is shared
> >> >>>>>>>>resource between cores (much like disk).
> >> >>>>>>>>
> >> >>>>>>>>
> >> >>>>>>>> It's not hard to build the latest HBase against Cloudera's
> >> >>>>>>>>version of Hadoop. I can send along a simple patch to pom.xml to
> >> >>>>>>>>do that.
> >> >>>>>>>>
> >> >>>>>>>> -- Lars
> >> >>>>>>>>
> >> >>>>>>>>
> >> >>>>>>>>
> >> >>>>>>>> ________________________________
> >> >>>>>>>>  From: Bryan Keller <br...@gmail.com>
> >> >>>>>>>> To: user@hbase.apache.org
> >> >>>>>>>> Sent: Tuesday, April 30, 2013 11:02 PM
> >> >>>>>>>> Subject: Re: Poor HBase map-reduce scan performance
> >> >>>>>>>>
> >> >>>>>>>>
> >> >>>>>>>> The table has hashed keys so rows are evenly distributed
> >>amongst
> >> >>>>>>>>the regionservers, and load on each regionserver is pretty much
> >> >>>>>>>>the same. I also have per-table balancing turned on. I get
> >>mostly
> >> >>>>>>>>data local mappers with only a few rack local (maybe 10 of the
> >>250
> >> >>>>>>>>mappers).
> >> >>>>>>>>
> >> >>>>>>>> Currently the table is a wide table schema, with lists of data
> >> >>>>>>>>structures stored as columns with column prefixes grouping the
> >> >>>>>>>>data structures (e.g. 1_name, 1_address, 1_city, 2_name,
> >> >>>>>>>>2_address, 2_city). I was thinking of moving those data
> >>structures
> >> >>>>>>>>to protobuf which would cut down on the number of columns. The
> >> >>>>>>>>downside is I can't filter on one value with that, but it is a
> >> >>>>>>>>tradeoff I would make for performance. I was also considering
> >> >>>>>>>>restructuring the table into a tall table.
> >> >>>>>>>>
> >> >>>>>>>> Something interesting is that my old regionserver machines had
> >> >>>>>>>>five 15k SCSI drives instead of 2 SSDs, and performance was
> >>about
> >> >>>>>>>>the same. Also, my old network was 1gbit, now it is 10gbit. So
> >> >>>>>>>>neither network nor disk I/O appear to be the bottleneck. The
> >>CPU
> >> >>>>>>>>is rather high for the regionserver so it seems like the best
> >> >>>>>>>>candidate to investigate. I will try profiling it tomorrow and
> >> >>>>>>>>will report back. I may revisit compression on vs off since that
> >> >>>>>>>>is adding load to the CPU.
> >> >>>>>>>>
> >> >>>>>>>> I'll also come up with a sample program that generates data
> >> >>>>>>>>similar to my table.
> >> >>>>>>>>
> >> >>>>>>>>
> >> >>>>>>>> On Apr 30, 2013, at 10:01 PM, lars hofhansl <la...@apache.org>
> >> >>>>>>>>wrote:
> >> >>>>>>>>
> >> >>>>>>>>> Your average row is 35k so scanner caching would not make a
> >>huge
> >> >>>>>>>>>difference, although I would have expected some improvements by
> >> >>>>>>>>>setting it to 10 or 50 since you have a wide 10ge pipe.
> >> >>>>>>>>>
> >> >>>>>>>>> I assume your table is split sufficiently to touch all
> >> >>>>>>>>>RegionServer... Do you see the same load/IO on all region
> >>servers?
> >> >>>>>>>>>
> >> >>>>>>>>> A bunch of scan improvements went into HBase since 0.94.2.
> >> >>>>>>>>> I blogged about some of these changes here:
> >> >>>>>>>>>http://hadoop-hbase.blogspot.com/2012/12/hbase-profiling.html
> >> >>>>>>>>>
> >> >>>>>>>>> In your case - since you have many columns, each of which
> >>carry
> >> >>>>>>>>>the rowkey - you might benefit a lot from HBASE-7279.
> >> >>>>>>>>>
> >> >>>>>>>>> In the end HBase *is* slower than straight HDFS for full
> >>scans.
> >> >>>>>>>>>How could it not be?
> >> >>>>>>>>> So I would start by looking at HDFS first. Make sure Nagle's
> >>is
> >> >>>>>>>>>disbaled in both HBase and HDFS.
> >> >>>>>>>>>
> >> >>>>>>>>> And lastly SSDs are somewhat new territory for HBase. Maybe
> >>Andy
> >> >>>>>>>>>Purtell is listening, I think he did some tests with HBase on
> >> >>>>>>>>>SSDs.
> >> >>>>>>>>> With rotating media you typically see an improvement with
> >> >>>>>>>>>compression. With SSDs the added CPU needed for decompression
> >> >>>>>>>>>might outweigh the benefits.
> >> >>>>>>>>>
> >> >>>>>>>>> At the risk of starting a larger discussion here, I would
> >>posit
> >> >>>>>>>>>that HBase's LSM based design, which trades random IO with
> >> >>>>>>>>>sequential IO, might be a bit more questionable on SSDs.
> >> >>>>>>>>>
> >> >>>>>>>>> If you can, it would be nice to run a profiler against one of
> >> >>>>>>>>>the RegionServers (or maybe do it with the single RS setup) and
> >> >>>>>>>>>see where it is bottlenecked.
> >> >>>>>>>>> (And if you send me a sample program to generate some data -
> >>not
> >> >>>>>>>>>700g, though :) - I'll try to do a bit of profiling during the
> >> >>>>>>>>>next days as my day job permits, but I do not have any machines
> >> >>>>>>>>>with SSDs).
> >> >>>>>>>>>
> >> >>>>>>>>> -- Lars
> >> >>>>>>>>>
> >> >>>>>>>>>
> >> >>>>>>>>>
> >> >>>>>>>>>
> >> >>>>>>>>> ________________________________
> >> >>>>>>>>> From: Bryan Keller <br...@gmail.com>
> >> >>>>>>>>> To: user@hbase.apache.org
> >> >>>>>>>>> Sent: Tuesday, April 30, 2013 9:31 PM
> >> >>>>>>>>> Subject: Re: Poor HBase map-reduce scan performance
> >> >>>>>>>>>
> >> >>>>>>>>>
> >> >>>>>>>>> Yes, I have tried various settings for setCaching() and I have
> >> >>>>>>>>>setCacheBlocks(false)
> >> >>>>>>>>>
> >> >>>>>>>>> On Apr 30, 2013, at 9:17 PM, Ted Yu <yu...@gmail.com>
> >>wrote:
> >> >>>>>>>>>
> >> >>>>>>>>>> From http://hbase.apache.org/book.html#mapreduce.example :
> >> >>>>>>>>>>
> >> >>>>>>>>>> scan.setCaching(500);        // 1 is the default in Scan,
> >>which
> >> >>>>>>>>>>will
> >> >>>>>>>>>> be bad for MapReduce jobs
> >> >>>>>>>>>> scan.setCacheBlocks(false);  // don't set to true for MR jobs
> >> >>>>>>>>>>
> >> >>>>>>>>>> I guess you have used the above setting.
> >> >>>>>>>>>>
> >> >>>>>>>>>> 0.94.x releases are compatible. Have you considered upgrading
> >> >>>>>>>>>>to, say
> >> >>>>>>>>>> 0.94.7 which was recently released ?
> >> >>>>>>>>>>
> >> >>>>>>>>>> Cheers
> >> >>>>>>>>>>
> >> >>>>>>>>>> On Tue, Apr 30, 2013 at 9:01 PM, Bryan Keller <bryanck@gm
> >> >>>>>>>
> >> >>
> >> >
> >>
> >>
>
>

Re: Poor HBase map-reduce scan performance

Posted by Sandy Pratt <pr...@adobe.com>.

It seems to be in the ballpark of what I was getting at, but I haven't
fully digested the code yet, so I can't say for sure.

Here's what I'm getting at.  Looking at
o.a.h.h.client.ClientScanner.next() in the 94.2 source I have loaded, I
see there are three branches with respect to the cache:

public Result next() throws IOException {


  // If the scanner is closed and there's nothing left in the cache, next
is a no-op.
  if (cache.size() == 0 && this.closed) {
    return null;
  }

  if (cache.size() == 0) {
// Request more results from RS
  ...
  }

  if (cache.size() > 0) {
    return cache.poll();
  }

  ...
  return null;

}


I think that middle branch wants to change as follows (pseudo-code):

if the cache size is below a certain threshold then
  initiate asynchronous action to refill it
  if there is no result to return until the cache refill completes then
    block
  done
done

Or something along those lines.  I haven't grokked the patch well enough
yet to tell if that's what it does.  What I think is happening in the
0.94.2 code I've got is that it requests nothing until the cache is empty,
then blocks until it's non-empty.  We want to eagerly and asynchronously
refill the cache so that we ideally never have to block.


Sandy


On 5/22/13 1:39 PM, "Ted Yu" <yu...@gmail.com> wrote:

>Sandy:
>Do you think the following JIRA would help with what you expect in this
>regard ?
>
>HBASE-8420 Port HBASE-6874 Implement prefetching for scanners from 0.89-fb
>
>Cheers
>
>On Wed, May 22, 2013 at 1:29 PM, Sandy Pratt <pr...@adobe.com> wrote:
>
>> I found this thread on search-hadoop.com just now because I've been
>> wrestling with the same issue for a while and have as yet been unable to
>> solve it.  However, I think I have an idea of the problem.  My theory is
>> based on assumptions about what's going on in HBase and HDFS internally,
>> so please correct me if I'm wrong.
>>
>> Briefly, I think the issue is that sequential reads from HDFS are
>> pipelined, whereas sequential reads from HBase are not.  Therefore,
>> sequential reads from HDFS tend to keep the IO subsystem saturated,
>>while
>> sequential reads from HBase allow it to idle for a relatively large
>> proportion of time.
>>
>> To make this more concrete, suppose that I'm reading N bytes of data
>>from
>> a file in HDFS.  I issue the calls to open the file and begin to read
>> (from an InputStream, for example).  As I'm reading byte 1 of the stream
>> at my client, the datanode is reading byte M where 1 < M <= N from disk.
>> Thus, three activities tend to happen concurrently for the most part
>> (disregarding the beginning and end of the file): 1) processing at the
>> client; 2) streaming over the network from datanode to client; and 3)
>> reading data from disk at the datanode.  The proportion of time these
>> three activities overlap tends towards 100% as N -> infinity.
>>
>> Now suppose I read a batch of R records from HBase (where R = whatever
>> scanner caching happens to be).  As I understand it, I issue my call to
>> ResultScanner.next(), and this causes the RegionServer to block as if
>>on a
>> page fault while it loads enough HFile blocks from disk to cover the R
>> records I (implicitly) requested.  After the blocks are loaded into the
>> block cache on the RS, the RS returns R records to me over the network.
>> Then I process the R records locally.  When they are exhausted, this
>>cycle
>> repeats.  The notable upshot is that while the RS is faulting HFile
>>blocks
>> into the cache, my client is blocked.  Furthermore, while my client is
>> processing records, the RS is idle with respect to work on behalf of my
>> client.
>>
>> That last point is really the killer, if I'm correct in my assumptions.
>> It means that Scanner caching and larger block sizes work only to
>>amortize
>> the fixed overhead of disk IOs and RPCs -- they do nothing to keep the
>>IO
>> subsystems saturated during sequential reads.  What *should* happen is
>> that the RS should treat the Scanner caching value (R above) as a hint
>> that it should always have ready records r + 1 to r + R when I'm reading
>> record r, at least up to the region boundary.  The RS should be
>>preparing
>> eagerly for the next call to ResultScanner.next(), which I suspect it's
>> currently not doing.
>>
>> Another way to state this would be to say that the client should tell
>>the
>> RS to prepare the next batch of records soon enough that they can start
>>to
>> arrive at the client just as the client is finishing the current batch.
>> As is, I suspect it doesn't request more from the RS until the local
>>batch
>> is exhausted.
>>
>> As I cautioned before, this is based on assumptions about how the
>> internals work, so please correct me if I'm wrong.  Also, I'm way behind
>> on the mailing list, so I probably won't see any responses unless CC'd
>> directly.
>>
>> Sandy
>>
>> On 5/10/13 8:46 AM, "Bryan Keller" <br...@gmail.com> wrote:
>>
>> >FYI, I ran tests with compression on and off.
>> >
>> >With a plain HDFS sequence file and compression off, I am getting very
>> >good I/O numbers, roughly 75% of theoretical max for reads. With snappy
>> >compression on with a sequence file, I/O speed is about 3x slower.
>> >However the file size is 3x smaller so it takes about the same time to
>> >scan.
>> >
>> >With HBase, the results are equivalent (just much slower than a
>>sequence
>> >file). Scanning a compressed table is about 3x slower I/O than an
>> >uncompressed table, but the table is 3x smaller, so the time to scan is
>> >about the same. Scanning an HBase table takes about 3x as long as
>> >scanning the sequence file export of the table, either compressed or
>> >uncompressed. The sequence file export file size ends up being just
>> >barely larger than the table, either compressed or uncompressed
>> >
>> >So in sum, compression slows down I/O 3x, but the file is 3x smaller so
>> >the time to scan is about the same. Adding in HBase slows things down
>> >another 3x. So I'm seeing 9x faster I/O scanning an uncompressed
>>sequence
>> >file vs scanning a compressed table.
>> >
>> >
>> >On May 8, 2013, at 10:15 AM, Bryan Keller <br...@gmail.com> wrote:
>> >
>> >> Thanks for the offer Lars! I haven't made much progress speeding
>>things
>> >>up.
>> >>
>> >> I finally put together a test program that populates a table that is
>> >>similar to my production dataset. I have a readme that should describe
>> >>things, hopefully enough to make it useable. There is code to
>>populate a
>> >>test table, code to scan the table, and code to scan sequence files
>>from
>> >>an export (to compare HBase w/ raw HDFS). I use a gradle build script.
>> >>
>> >> You can find the code here:
>> >>
>> >> https://dl.dropboxusercontent.com/u/6880177/hbasetest.zip
>> >>
>> >>
>> >> On May 4, 2013, at 6:33 PM, lars hofhansl <la...@apache.org> wrote:
>> >>
>> >>> The blockbuffers are not reused, but that by itself should not be a
>> >>>problem as they are all the same size (at least I have never
>>identified
>> >>>that as one in my profiling sessions).
>> >>>
>> >>> My offer still stands to do some profiling myself if there is an
>>easy
>> >>>way to generate data of similar shape.
>> >>>
>> >>> -- Lars
>> >>>
>> >>>
>> >>>
>> >>> ________________________________
>> >>> From: Bryan Keller <br...@gmail.com>
>> >>> To: user@hbase.apache.org
>> >>> Sent: Friday, May 3, 2013 3:44 AM
>> >>> Subject: Re: Poor HBase map-reduce scan performance
>> >>>
>> >>>
>> >>> Actually I'm not too confident in my results re block size, they may
>> >>>have been related to major compaction. I'm going to rerun before
>> >>>drawing any conclusions.
>> >>>
>> >>> On May 3, 2013, at 12:17 AM, Bryan Keller <br...@gmail.com> wrote:
>> >>>
>> >>>> I finally made some progress. I tried a very large HBase block size
>> >>>>(16mb), and it significantly improved scan performance. I went from
>> >>>>45-50 min to 24 min. Not great but much better. Before I had it set
>>to
>> >>>>128k. Scanning an equivalent sequence file takes 10 min. My random
>> >>>>read performance will probably suffer with such a large block size
>> >>>>(theoretically), so I probably can't keep it this big. I care about
>> >>>>random read performance too. I've read having a block size this big
>>is
>> >>>>not recommended, is that correct?
>> >>>>
>> >>>> I haven't dug too deeply into the code, are the block buffers
>>reused
>> >>>>or is each new block read a new allocation? Perhaps a buffer pool
>> >>>>could help here if there isn't one already. When doing a scan, HBase
>> >>>>could reuse previously allocated block buffers instead of
>>allocating a
>> >>>>new one for each block. Then block size shouldn't affect scan
>> >>>>performance much.
>> >>>>
>> >>>> I'm not using a block encoder. Also, I'm still sifting through the
>> >>>>profiler results, I'll see if I can make more sense of it and run
>>some
>> >>>>more experiments.
>> >>>>
>> >>>> On May 2, 2013, at 5:46 PM, lars hofhansl <la...@apache.org> wrote:
>> >>>>
>> >>>>> Interesting. If you can try 0.94.7 (but it'll probably not have
>> >>>>>changed that much from 0.94.4)
>> >>>>>
>> >>>>>
>> >>>>> Do you have enabled one of the block encoders (FAST_DIFF, etc)? If
>> >>>>>so, try without. They currently need to reallocate a ByteBuffer for
>> >>>>>each single KV.
>> >>>>> (Sine you see ScannerV2 rather than EncodedScannerV2 you probably
>> >>>>>have not enabled encoding, just checking).
>> >>>>>
>> >>>>>
>> >>>>> And do you have a stack trace for the ByteBuffer.allocate(). That
>>is
>> >>>>>a strange one since it never came up in my profiling (unless you
>> >>>>>enabled block encoding).
>> >>>>> (You can get traces from VisualVM by creating a snapshot, but
>>you'd
>> >>>>>have to drill in to find the allocate()).
>> >>>>>
>> >>>>>
>> >>>>> During normal scanning (again, without encoding) there should be
>>no
>> >>>>>allocation happening except for blocks read from disk (and they
>> >>>>>should all be the same size, thus allocation should be cheap).
>> >>>>>
>> >>>>> -- Lars
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> ________________________________
>> >>>>> From: Bryan Keller <br...@gmail.com>
>> >>>>> To: user@hbase.apache.org
>> >>>>> Sent: Thursday, May 2, 2013 10:54 AM
>> >>>>> Subject: Re: Poor HBase map-reduce scan performance
>> >>>>>
>> >>>>>
>> >>>>> I ran one of my regionservers through VisualVM. It looks like the
>> >>>>>top hot spots are HFileReaderV2$ScannerV2.getKeyValue() and
>> >>>>>ByteBuffer.allocate(). It appears at first glance that memory
>> >>>>>allocations may be an issue. Decompression was next below that but
>> >>>>>less of an issue it seems.
>> >>>>>
>> >>>>> Would changing the block size, either HDFS or HBase, help here?
>> >>>>>
>> >>>>> Also, if anyone has tips on how else to profile, that would be
>> >>>>>appreciated. VisualVM can produce a lot of noise that is hard to
>>sift
>> >>>>>through.
>> >>>>>
>> >>>>>
>> >>>>> On May 1, 2013, at 9:49 PM, Bryan Keller <br...@gmail.com>
>>wrote:
>> >>>>>
>> >>>>>> I used exactly 0.94.4, pulled from the tag in subversion.
>> >>>>>>
>> >>>>>> On May 1, 2013, at 9:41 PM, lars hofhansl <la...@apache.org>
>>wrote:
>> >>>>>>
>> >>>>>>> Hmm... Did you actually use exactly version 0.94.4, or the
>>latest
>> >>>>>>>0.94.7.
>> >>>>>>> I would be very curious to see profiling data.
>> >>>>>>>
>> >>>>>>> -- Lars
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> ----- Original Message -----
>> >>>>>>> From: Bryan Keller <br...@gmail.com>
>> >>>>>>> To: "user@hbase.apache.org" <us...@hbase.apache.org>
>> >>>>>>> Cc:
>> >>>>>>> Sent: Wednesday, May 1, 2013 6:01 PM
>> >>>>>>> Subject: Re: Poor HBase map-reduce scan performance
>> >>>>>>>
>> >>>>>>> I tried running my test with 0.94.4, unfortunately performance
>>was
>> >>>>>>>about the same. I'm planning on profiling the regionserver and
>> >>>>>>>trying some other things tonight and tomorrow and will report
>>back.
>> >>>>>>>
>> >>>>>>> On May 1, 2013, at 8:00 AM, Bryan Keller <br...@gmail.com>
>> wrote:
>> >>>>>>>
>> >>>>>>>> Yes I would like to try this, if you can point me to the
>>pom.xml
>> >>>>>>>>patch that would save me some time.
>> >>>>>>>>
>> >>>>>>>> On Tuesday, April 30, 2013, lars hofhansl wrote:
>> >>>>>>>> If you can, try 0.94.4+; it should significantly reduce the
>> >>>>>>>>amount of bytes copied around in RAM during scanning, especially
>> >>>>>>>>if you have wide rows and/or large key portions. That in turns
>> >>>>>>>>makes scans scale better across cores, since RAM is shared
>> >>>>>>>>resource between cores (much like disk).
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> It's not hard to build the latest HBase against Cloudera's
>> >>>>>>>>version of Hadoop. I can send along a simple patch to pom.xml to
>> >>>>>>>>do that.
>> >>>>>>>>
>> >>>>>>>> -- Lars
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> ________________________________
>> >>>>>>>>  From: Bryan Keller <br...@gmail.com>
>> >>>>>>>> To: user@hbase.apache.org
>> >>>>>>>> Sent: Tuesday, April 30, 2013 11:02 PM
>> >>>>>>>> Subject: Re: Poor HBase map-reduce scan performance
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> The table has hashed keys so rows are evenly distributed
>>amongst
>> >>>>>>>>the regionservers, and load on each regionserver is pretty much
>> >>>>>>>>the same. I also have per-table balancing turned on. I get
>>mostly
>> >>>>>>>>data local mappers with only a few rack local (maybe 10 of the
>>250
>> >>>>>>>>mappers).
>> >>>>>>>>
>> >>>>>>>> Currently the table is a wide table schema, with lists of data
>> >>>>>>>>structures stored as columns with column prefixes grouping the
>> >>>>>>>>data structures (e.g. 1_name, 1_address, 1_city, 2_name,
>> >>>>>>>>2_address, 2_city). I was thinking of moving those data
>>structures
>> >>>>>>>>to protobuf which would cut down on the number of columns. The
>> >>>>>>>>downside is I can't filter on one value with that, but it is a
>> >>>>>>>>tradeoff I would make for performance. I was also considering
>> >>>>>>>>restructuring the table into a tall table.
>> >>>>>>>>
>> >>>>>>>> Something interesting is that my old regionserver machines had
>> >>>>>>>>five 15k SCSI drives instead of 2 SSDs, and performance was
>>about
>> >>>>>>>>the same. Also, my old network was 1gbit, now it is 10gbit. So
>> >>>>>>>>neither network nor disk I/O appear to be the bottleneck. The
>>CPU
>> >>>>>>>>is rather high for the regionserver so it seems like the best
>> >>>>>>>>candidate to investigate. I will try profiling it tomorrow and
>> >>>>>>>>will report back. I may revisit compression on vs off since that
>> >>>>>>>>is adding load to the CPU.
>> >>>>>>>>
>> >>>>>>>> I'll also come up with a sample program that generates data
>> >>>>>>>>similar to my table.
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> On Apr 30, 2013, at 10:01 PM, lars hofhansl <la...@apache.org>
>> >>>>>>>>wrote:
>> >>>>>>>>
>> >>>>>>>>> Your average row is 35k so scanner caching would not make a
>>huge
>> >>>>>>>>>difference, although I would have expected some improvements by
>> >>>>>>>>>setting it to 10 or 50 since you have a wide 10ge pipe.
>> >>>>>>>>>
>> >>>>>>>>> I assume your table is split sufficiently to touch all
>> >>>>>>>>>RegionServer... Do you see the same load/IO on all region
>>servers?
>> >>>>>>>>>
>> >>>>>>>>> A bunch of scan improvements went into HBase since 0.94.2.
>> >>>>>>>>> I blogged about some of these changes here:
>> >>>>>>>>>http://hadoop-hbase.blogspot.com/2012/12/hbase-profiling.html
>> >>>>>>>>>
>> >>>>>>>>> In your case - since you have many columns, each of which
>>carry
>> >>>>>>>>>the rowkey - you might benefit a lot from HBASE-7279.
>> >>>>>>>>>
>> >>>>>>>>> In the end HBase *is* slower than straight HDFS for full
>>scans.
>> >>>>>>>>>How could it not be?
>> >>>>>>>>> So I would start by looking at HDFS first. Make sure Nagle's
>>is
>> >>>>>>>>>disbaled in both HBase and HDFS.
>> >>>>>>>>>
>> >>>>>>>>> And lastly SSDs are somewhat new territory for HBase. Maybe
>>Andy
>> >>>>>>>>>Purtell is listening, I think he did some tests with HBase on
>> >>>>>>>>>SSDs.
>> >>>>>>>>> With rotating media you typically see an improvement with
>> >>>>>>>>>compression. With SSDs the added CPU needed for decompression
>> >>>>>>>>>might outweigh the benefits.
>> >>>>>>>>>
>> >>>>>>>>> At the risk of starting a larger discussion here, I would
>>posit
>> >>>>>>>>>that HBase's LSM based design, which trades random IO with
>> >>>>>>>>>sequential IO, might be a bit more questionable on SSDs.
>> >>>>>>>>>
>> >>>>>>>>> If you can, it would be nice to run a profiler against one of
>> >>>>>>>>>the RegionServers (or maybe do it with the single RS setup) and
>> >>>>>>>>>see where it is bottlenecked.
>> >>>>>>>>> (And if you send me a sample program to generate some data -
>>not
>> >>>>>>>>>700g, though :) - I'll try to do a bit of profiling during the
>> >>>>>>>>>next days as my day job permits, but I do not have any machines
>> >>>>>>>>>with SSDs).
>> >>>>>>>>>
>> >>>>>>>>> -- Lars
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>> ________________________________
>> >>>>>>>>> From: Bryan Keller <br...@gmail.com>
>> >>>>>>>>> To: user@hbase.apache.org
>> >>>>>>>>> Sent: Tuesday, April 30, 2013 9:31 PM
>> >>>>>>>>> Subject: Re: Poor HBase map-reduce scan performance
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>> Yes, I have tried various settings for setCaching() and I have
>> >>>>>>>>>setCacheBlocks(false)
>> >>>>>>>>>
>> >>>>>>>>> On Apr 30, 2013, at 9:17 PM, Ted Yu <yu...@gmail.com>
>>wrote:
>> >>>>>>>>>
>> >>>>>>>>>> From http://hbase.apache.org/book.html#mapreduce.example :
>> >>>>>>>>>>
>> >>>>>>>>>> scan.setCaching(500);        // 1 is the default in Scan,
>>which
>> >>>>>>>>>>will
>> >>>>>>>>>> be bad for MapReduce jobs
>> >>>>>>>>>> scan.setCacheBlocks(false);  // don't set to true for MR jobs
>> >>>>>>>>>>
>> >>>>>>>>>> I guess you have used the above setting.
>> >>>>>>>>>>
>> >>>>>>>>>> 0.94.x releases are compatible. Have you considered upgrading
>> >>>>>>>>>>to, say
>> >>>>>>>>>> 0.94.7 which was recently released ?
>> >>>>>>>>>>
>> >>>>>>>>>> Cheers
>> >>>>>>>>>>
>> >>>>>>>>>> On Tue, Apr 30, 2013 at 9:01 PM, Bryan Keller <bryanck@gm
>> >>>>>>>
>> >>
>> >
>>
>>

Re: Poor HBase map-reduce scan performance

Posted by Ted Yu <yu...@gmail.com>.

Sandy:
Do you think the following JIRA would help with what you expect in this
regard ?

HBASE-8420 Port HBASE-6874 Implement prefetching for scanners from 0.89-fb

Cheers

On Wed, May 22, 2013 at 1:29 PM, Sandy Pratt <pr...@adobe.com> wrote:

> I found this thread on search-hadoop.com just now because I've been
> wrestling with the same issue for a while and have as yet been unable to
> solve it.  However, I think I have an idea of the problem.  My theory is
> based on assumptions about what's going on in HBase and HDFS internally,
> so please correct me if I'm wrong.
>
> Briefly, I think the issue is that sequential reads from HDFS are
> pipelined, whereas sequential reads from HBase are not.  Therefore,
> sequential reads from HDFS tend to keep the IO subsystem saturated, while
> sequential reads from HBase allow it to idle for a relatively large
> proportion of time.
>
> To make this more concrete, suppose that I'm reading N bytes of data from
> a file in HDFS.  I issue the calls to open the file and begin to read
> (from an InputStream, for example).  As I'm reading byte 1 of the stream
> at my client, the datanode is reading byte M where 1 < M <= N from disk.
> Thus, three activities tend to happen concurrently for the most part
> (disregarding the beginning and end of the file): 1) processing at the
> client; 2) streaming over the network from datanode to client; and 3)
> reading data from disk at the datanode.  The proportion of time these
> three activities overlap tends towards 100% as N -> infinity.
>
> Now suppose I read a batch of R records from HBase (where R = whatever
> scanner caching happens to be).  As I understand it, I issue my call to
> ResultScanner.next(), and this causes the RegionServer to block as if on a
> page fault while it loads enough HFile blocks from disk to cover the R
> records I (implicitly) requested.  After the blocks are loaded into the
> block cache on the RS, the RS returns R records to me over the network.
> Then I process the R records locally.  When they are exhausted, this cycle
> repeats.  The notable upshot is that while the RS is faulting HFile blocks
> into the cache, my client is blocked.  Furthermore, while my client is
> processing records, the RS is idle with respect to work on behalf of my
> client.
>
> That last point is really the killer, if I'm correct in my assumptions.
> It means that Scanner caching and larger block sizes work only to amortize
> the fixed overhead of disk IOs and RPCs -- they do nothing to keep the IO
> subsystems saturated during sequential reads.  What *should* happen is
> that the RS should treat the Scanner caching value (R above) as a hint
> that it should always have ready records r + 1 to r + R when I'm reading
> record r, at least up to the region boundary.  The RS should be preparing
> eagerly for the next call to ResultScanner.next(), which I suspect it's
> currently not doing.
>
> Another way to state this would be to say that the client should tell the
> RS to prepare the next batch of records soon enough that they can start to
> arrive at the client just as the client is finishing the current batch.
> As is, I suspect it doesn't request more from the RS until the local batch
> is exhausted.
>
> As I cautioned before, this is based on assumptions about how the
> internals work, so please correct me if I'm wrong.  Also, I'm way behind
> on the mailing list, so I probably won't see any responses unless CC'd
> directly.
>
> Sandy
>
> On 5/10/13 8:46 AM, "Bryan Keller" <br...@gmail.com> wrote:
>
> >FYI, I ran tests with compression on and off.
> >
> >With a plain HDFS sequence file and compression off, I am getting very
> >good I/O numbers, roughly 75% of theoretical max for reads. With snappy
> >compression on with a sequence file, I/O speed is about 3x slower.
> >However the file size is 3x smaller so it takes about the same time to
> >scan.
> >
> >With HBase, the results are equivalent (just much slower than a sequence
> >file). Scanning a compressed table is about 3x slower I/O than an
> >uncompressed table, but the table is 3x smaller, so the time to scan is
> >about the same. Scanning an HBase table takes about 3x as long as
> >scanning the sequence file export of the table, either compressed or
> >uncompressed. The sequence file export file size ends up being just
> >barely larger than the table, either compressed or uncompressed
> >
> >So in sum, compression slows down I/O 3x, but the file is 3x smaller so
> >the time to scan is about the same. Adding in HBase slows things down
> >another 3x. So I'm seeing 9x faster I/O scanning an uncompressed sequence
> >file vs scanning a compressed table.
> >
> >
> >On May 8, 2013, at 10:15 AM, Bryan Keller <br...@gmail.com> wrote:
> >
> >> Thanks for the offer Lars! I haven't made much progress speeding things
> >>up.
> >>
> >> I finally put together a test program that populates a table that is
> >>similar to my production dataset. I have a readme that should describe
> >>things, hopefully enough to make it useable. There is code to populate a
> >>test table, code to scan the table, and code to scan sequence files from
> >>an export (to compare HBase w/ raw HDFS). I use a gradle build script.
> >>
> >> You can find the code here:
> >>
> >> https://dl.dropboxusercontent.com/u/6880177/hbasetest.zip
> >>
> >>
> >> On May 4, 2013, at 6:33 PM, lars hofhansl <la...@apache.org> wrote:
> >>
> >>> The blockbuffers are not reused, but that by itself should not be a
> >>>problem as they are all the same size (at least I have never identified
> >>>that as one in my profiling sessions).
> >>>
> >>> My offer still stands to do some profiling myself if there is an easy
> >>>way to generate data of similar shape.
> >>>
> >>> -- Lars
> >>>
> >>>
> >>>
> >>> ________________________________
> >>> From: Bryan Keller <br...@gmail.com>
> >>> To: user@hbase.apache.org
> >>> Sent: Friday, May 3, 2013 3:44 AM
> >>> Subject: Re: Poor HBase map-reduce scan performance
> >>>
> >>>
> >>> Actually I'm not too confident in my results re block size, they may
> >>>have been related to major compaction. I'm going to rerun before
> >>>drawing any conclusions.
> >>>
> >>> On May 3, 2013, at 12:17 AM, Bryan Keller <br...@gmail.com> wrote:
> >>>
> >>>> I finally made some progress. I tried a very large HBase block size
> >>>>(16mb), and it significantly improved scan performance. I went from
> >>>>45-50 min to 24 min. Not great but much better. Before I had it set to
> >>>>128k. Scanning an equivalent sequence file takes 10 min. My random
> >>>>read performance will probably suffer with such a large block size
> >>>>(theoretically), so I probably can't keep it this big. I care about
> >>>>random read performance too. I've read having a block size this big is
> >>>>not recommended, is that correct?
> >>>>
> >>>> I haven't dug too deeply into the code, are the block buffers reused
> >>>>or is each new block read a new allocation? Perhaps a buffer pool
> >>>>could help here if there isn't one already. When doing a scan, HBase
> >>>>could reuse previously allocated block buffers instead of allocating a
> >>>>new one for each block. Then block size shouldn't affect scan
> >>>>performance much.
> >>>>
> >>>> I'm not using a block encoder. Also, I'm still sifting through the
> >>>>profiler results, I'll see if I can make more sense of it and run some
> >>>>more experiments.
> >>>>
> >>>> On May 2, 2013, at 5:46 PM, lars hofhansl <la...@apache.org> wrote:
> >>>>
> >>>>> Interesting. If you can try 0.94.7 (but it'll probably not have
> >>>>>changed that much from 0.94.4)
> >>>>>
> >>>>>
> >>>>> Do you have enabled one of the block encoders (FAST_DIFF, etc)? If
> >>>>>so, try without. They currently need to reallocate a ByteBuffer for
> >>>>>each single KV.
> >>>>> (Sine you see ScannerV2 rather than EncodedScannerV2 you probably
> >>>>>have not enabled encoding, just checking).
> >>>>>
> >>>>>
> >>>>> And do you have a stack trace for the ByteBuffer.allocate(). That is
> >>>>>a strange one since it never came up in my profiling (unless you
> >>>>>enabled block encoding).
> >>>>> (You can get traces from VisualVM by creating a snapshot, but you'd
> >>>>>have to drill in to find the allocate()).
> >>>>>
> >>>>>
> >>>>> During normal scanning (again, without encoding) there should be no
> >>>>>allocation happening except for blocks read from disk (and they
> >>>>>should all be the same size, thus allocation should be cheap).
> >>>>>
> >>>>> -- Lars
> >>>>>
> >>>>>
> >>>>>
> >>>>> ________________________________
> >>>>> From: Bryan Keller <br...@gmail.com>
> >>>>> To: user@hbase.apache.org
> >>>>> Sent: Thursday, May 2, 2013 10:54 AM
> >>>>> Subject: Re: Poor HBase map-reduce scan performance
> >>>>>
> >>>>>
> >>>>> I ran one of my regionservers through VisualVM. It looks like the
> >>>>>top hot spots are HFileReaderV2$ScannerV2.getKeyValue() and
> >>>>>ByteBuffer.allocate(). It appears at first glance that memory
> >>>>>allocations may be an issue. Decompression was next below that but
> >>>>>less of an issue it seems.
> >>>>>
> >>>>> Would changing the block size, either HDFS or HBase, help here?
> >>>>>
> >>>>> Also, if anyone has tips on how else to profile, that would be
> >>>>>appreciated. VisualVM can produce a lot of noise that is hard to sift
> >>>>>through.
> >>>>>
> >>>>>
> >>>>> On May 1, 2013, at 9:49 PM, Bryan Keller <br...@gmail.com> wrote:
> >>>>>
> >>>>>> I used exactly 0.94.4, pulled from the tag in subversion.
> >>>>>>
> >>>>>> On May 1, 2013, at 9:41 PM, lars hofhansl <la...@apache.org> wrote:
> >>>>>>
> >>>>>>> Hmm... Did you actually use exactly version 0.94.4, or the latest
> >>>>>>>0.94.7.
> >>>>>>> I would be very curious to see profiling data.
> >>>>>>>
> >>>>>>> -- Lars
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> ----- Original Message -----
> >>>>>>> From: Bryan Keller <br...@gmail.com>
> >>>>>>> To: "user@hbase.apache.org" <us...@hbase.apache.org>
> >>>>>>> Cc:
> >>>>>>> Sent: Wednesday, May 1, 2013 6:01 PM
> >>>>>>> Subject: Re: Poor HBase map-reduce scan performance
> >>>>>>>
> >>>>>>> I tried running my test with 0.94.4, unfortunately performance was
> >>>>>>>about the same. I'm planning on profiling the regionserver and
> >>>>>>>trying some other things tonight and tomorrow and will report back.
> >>>>>>>
> >>>>>>> On May 1, 2013, at 8:00 AM, Bryan Keller <br...@gmail.com>
> wrote:
> >>>>>>>
> >>>>>>>> Yes I would like to try this, if you can point me to the pom.xml
> >>>>>>>>patch that would save me some time.
> >>>>>>>>
> >>>>>>>> On Tuesday, April 30, 2013, lars hofhansl wrote:
> >>>>>>>> If you can, try 0.94.4+; it should significantly reduce the
> >>>>>>>>amount of bytes copied around in RAM during scanning, especially
> >>>>>>>>if you have wide rows and/or large key portions. That in turns
> >>>>>>>>makes scans scale better across cores, since RAM is shared
> >>>>>>>>resource between cores (much like disk).
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> It's not hard to build the latest HBase against Cloudera's
> >>>>>>>>version of Hadoop. I can send along a simple patch to pom.xml to
> >>>>>>>>do that.
> >>>>>>>>
> >>>>>>>> -- Lars
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> ________________________________
> >>>>>>>>  From: Bryan Keller <br...@gmail.com>
> >>>>>>>> To: user@hbase.apache.org
> >>>>>>>> Sent: Tuesday, April 30, 2013 11:02 PM
> >>>>>>>> Subject: Re: Poor HBase map-reduce scan performance
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> The table has hashed keys so rows are evenly distributed amongst
> >>>>>>>>the regionservers, and load on each regionserver is pretty much
> >>>>>>>>the same. I also have per-table balancing turned on. I get mostly
> >>>>>>>>data local mappers with only a few rack local (maybe 10 of the 250
> >>>>>>>>mappers).
> >>>>>>>>
> >>>>>>>> Currently the table is a wide table schema, with lists of data
> >>>>>>>>structures stored as columns with column prefixes grouping the
> >>>>>>>>data structures (e.g. 1_name, 1_address, 1_city, 2_name,
> >>>>>>>>2_address, 2_city). I was thinking of moving those data structures
> >>>>>>>>to protobuf which would cut down on the number of columns. The
> >>>>>>>>downside is I can't filter on one value with that, but it is a
> >>>>>>>>tradeoff I would make for performance. I was also considering
> >>>>>>>>restructuring the table into a tall table.
> >>>>>>>>
> >>>>>>>> Something interesting is that my old regionserver machines had
> >>>>>>>>five 15k SCSI drives instead of 2 SSDs, and performance was about
> >>>>>>>>the same. Also, my old network was 1gbit, now it is 10gbit. So
> >>>>>>>>neither network nor disk I/O appear to be the bottleneck. The CPU
> >>>>>>>>is rather high for the regionserver so it seems like the best
> >>>>>>>>candidate to investigate. I will try profiling it tomorrow and
> >>>>>>>>will report back. I may revisit compression on vs off since that
> >>>>>>>>is adding load to the CPU.
> >>>>>>>>
> >>>>>>>> I'll also come up with a sample program that generates data
> >>>>>>>>similar to my table.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Apr 30, 2013, at 10:01 PM, lars hofhansl <la...@apache.org>
> >>>>>>>>wrote:
> >>>>>>>>
> >>>>>>>>> Your average row is 35k so scanner caching would not make a huge
> >>>>>>>>>difference, although I would have expected some improvements by
> >>>>>>>>>setting it to 10 or 50 since you have a wide 10ge pipe.
> >>>>>>>>>
> >>>>>>>>> I assume your table is split sufficiently to touch all
> >>>>>>>>>RegionServer... Do you see the same load/IO on all region servers?
> >>>>>>>>>
> >>>>>>>>> A bunch of scan improvements went into HBase since 0.94.2.
> >>>>>>>>> I blogged about some of these changes here:
> >>>>>>>>>http://hadoop-hbase.blogspot.com/2012/12/hbase-profiling.html
> >>>>>>>>>
> >>>>>>>>> In your case - since you have many columns, each of which carry
> >>>>>>>>>the rowkey - you might benefit a lot from HBASE-7279.
> >>>>>>>>>
> >>>>>>>>> In the end HBase *is* slower than straight HDFS for full scans.
> >>>>>>>>>How could it not be?
> >>>>>>>>> So I would start by looking at HDFS first. Make sure Nagle's is
> >>>>>>>>>disbaled in both HBase and HDFS.
> >>>>>>>>>
> >>>>>>>>> And lastly SSDs are somewhat new territory for HBase. Maybe Andy
> >>>>>>>>>Purtell is listening, I think he did some tests with HBase on
> >>>>>>>>>SSDs.
> >>>>>>>>> With rotating media you typically see an improvement with
> >>>>>>>>>compression. With SSDs the added CPU needed for decompression
> >>>>>>>>>might outweigh the benefits.
> >>>>>>>>>
> >>>>>>>>> At the risk of starting a larger discussion here, I would posit
> >>>>>>>>>that HBase's LSM based design, which trades random IO with
> >>>>>>>>>sequential IO, might be a bit more questionable on SSDs.
> >>>>>>>>>
> >>>>>>>>> If you can, it would be nice to run a profiler against one of
> >>>>>>>>>the RegionServers (or maybe do it with the single RS setup) and
> >>>>>>>>>see where it is bottlenecked.
> >>>>>>>>> (And if you send me a sample program to generate some data - not
> >>>>>>>>>700g, though :) - I'll try to do a bit of profiling during the
> >>>>>>>>>next days as my day job permits, but I do not have any machines
> >>>>>>>>>with SSDs).
> >>>>>>>>>
> >>>>>>>>> -- Lars
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> ________________________________
> >>>>>>>>> From: Bryan Keller <br...@gmail.com>
> >>>>>>>>> To: user@hbase.apache.org
> >>>>>>>>> Sent: Tuesday, April 30, 2013 9:31 PM
> >>>>>>>>> Subject: Re: Poor HBase map-reduce scan performance
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Yes, I have tried various settings for setCaching() and I have
> >>>>>>>>>setCacheBlocks(false)
> >>>>>>>>>
> >>>>>>>>> On Apr 30, 2013, at 9:17 PM, Ted Yu <yu...@gmail.com> wrote:
> >>>>>>>>>
> >>>>>>>>>> From http://hbase.apache.org/book.html#mapreduce.example :
> >>>>>>>>>>
> >>>>>>>>>> scan.setCaching(500);        // 1 is the default in Scan, which
> >>>>>>>>>>will
> >>>>>>>>>> be bad for MapReduce jobs
> >>>>>>>>>> scan.setCacheBlocks(false);  // don't set to true for MR jobs
> >>>>>>>>>>
> >>>>>>>>>> I guess you have used the above setting.
> >>>>>>>>>>
> >>>>>>>>>> 0.94.x releases are compatible. Have you considered upgrading
> >>>>>>>>>>to, say
> >>>>>>>>>> 0.94.7 which was recently released ?
> >>>>>>>>>>
> >>>>>>>>>> Cheers
> >>>>>>>>>>
> >>>>>>>>>> On Tue, Apr 30, 2013 at 9:01 PM, Bryan Keller <bryanck@gm
> >>>>>>>
> >>
> >
>
>

Re: Poor HBase map-reduce scan performance

Posted by Sandy Pratt <pr...@adobe.com>.

I found this thread on search-hadoop.com just now because I've been
wrestling with the same issue for a while and have as yet been unable to
solve it.  However, I think I have an idea of the problem.  My theory is
based on assumptions about what's going on in HBase and HDFS internally,
so please correct me if I'm wrong.

Briefly, I think the issue is that sequential reads from HDFS are
pipelined, whereas sequential reads from HBase are not.  Therefore,
sequential reads from HDFS tend to keep the IO subsystem saturated, while
sequential reads from HBase allow it to idle for a relatively large
proportion of time.

To make this more concrete, suppose that I'm reading N bytes of data from
a file in HDFS.  I issue the calls to open the file and begin to read
(from an InputStream, for example).  As I'm reading byte 1 of the stream
at my client, the datanode is reading byte M where 1 < M <= N from disk.
Thus, three activities tend to happen concurrently for the most part
(disregarding the beginning and end of the file): 1) processing at the
client; 2) streaming over the network from datanode to client; and 3)
reading data from disk at the datanode.  The proportion of time these
three activities overlap tends towards 100% as N -> infinity.

Now suppose I read a batch of R records from HBase (where R = whatever
scanner caching happens to be).  As I understand it, I issue my call to
ResultScanner.next(), and this causes the RegionServer to block as if on a
page fault while it loads enough HFile blocks from disk to cover the R
records I (implicitly) requested.  After the blocks are loaded into the
block cache on the RS, the RS returns R records to me over the network.
Then I process the R records locally.  When they are exhausted, this cycle
repeats.  The notable upshot is that while the RS is faulting HFile blocks
into the cache, my client is blocked.  Furthermore, while my client is
processing records, the RS is idle with respect to work on behalf of my
client.

That last point is really the killer, if I'm correct in my assumptions.
It means that Scanner caching and larger block sizes work only to amortize
the fixed overhead of disk IOs and RPCs -- they do nothing to keep the IO
subsystems saturated during sequential reads.  What *should* happen is
that the RS should treat the Scanner caching value (R above) as a hint
that it should always have ready records r + 1 to r + R when I'm reading
record r, at least up to the region boundary.  The RS should be preparing
eagerly for the next call to ResultScanner.next(), which I suspect it's
currently not doing.

Another way to state this would be to say that the client should tell the
RS to prepare the next batch of records soon enough that they can start to
arrive at the client just as the client is finishing the current batch.
As is, I suspect it doesn't request more from the RS until the local batch
is exhausted.

As I cautioned before, this is based on assumptions about how the
internals work, so please correct me if I'm wrong.  Also, I'm way behind
on the mailing list, so I probably won't see any responses unless CC'd
directly.

Sandy

On 5/10/13 8:46 AM, "Bryan Keller" <br...@gmail.com> wrote:

>FYI, I ran tests with compression on and off.
>
>With a plain HDFS sequence file and compression off, I am getting very
>good I/O numbers, roughly 75% of theoretical max for reads. With snappy
>compression on with a sequence file, I/O speed is about 3x slower.
>However the file size is 3x smaller so it takes about the same time to
>scan.
>
>With HBase, the results are equivalent (just much slower than a sequence
>file). Scanning a compressed table is about 3x slower I/O than an
>uncompressed table, but the table is 3x smaller, so the time to scan is
>about the same. Scanning an HBase table takes about 3x as long as
>scanning the sequence file export of the table, either compressed or
>uncompressed. The sequence file export file size ends up being just
>barely larger than the table, either compressed or uncompressed
>
>So in sum, compression slows down I/O 3x, but the file is 3x smaller so
>the time to scan is about the same. Adding in HBase slows things down
>another 3x. So I'm seeing 9x faster I/O scanning an uncompressed sequence
>file vs scanning a compressed table.
>
>
>On May 8, 2013, at 10:15 AM, Bryan Keller <br...@gmail.com> wrote:
>
>> Thanks for the offer Lars! I haven't made much progress speeding things
>>up.
>>
>> I finally put together a test program that populates a table that is
>>similar to my production dataset. I have a readme that should describe
>>things, hopefully enough to make it useable. There is code to populate a
>>test table, code to scan the table, and code to scan sequence files from
>>an export (to compare HBase w/ raw HDFS). I use a gradle build script.
>>
>> You can find the code here:
>>
>> https://dl.dropboxusercontent.com/u/6880177/hbasetest.zip
>>
>>
>> On May 4, 2013, at 6:33 PM, lars hofhansl <la...@apache.org> wrote:
>>
>>> The blockbuffers are not reused, but that by itself should not be a
>>>problem as they are all the same size (at least I have never identified
>>>that as one in my profiling sessions).
>>>
>>> My offer still stands to do some profiling myself if there is an easy
>>>way to generate data of similar shape.
>>>
>>> -- Lars
>>>
>>>
>>>
>>> ________________________________
>>> From: Bryan Keller <br...@gmail.com>
>>> To: user@hbase.apache.org
>>> Sent: Friday, May 3, 2013 3:44 AM
>>> Subject: Re: Poor HBase map-reduce scan performance
>>>
>>>
>>> Actually I'm not too confident in my results re block size, they may
>>>have been related to major compaction. I'm going to rerun before
>>>drawing any conclusions.
>>>
>>> On May 3, 2013, at 12:17 AM, Bryan Keller <br...@gmail.com> wrote:
>>>
>>>> I finally made some progress. I tried a very large HBase block size
>>>>(16mb), and it significantly improved scan performance. I went from
>>>>45-50 min to 24 min. Not great but much better. Before I had it set to
>>>>128k. Scanning an equivalent sequence file takes 10 min. My random
>>>>read performance will probably suffer with such a large block size
>>>>(theoretically), so I probably can't keep it this big. I care about
>>>>random read performance too. I've read having a block size this big is
>>>>not recommended, is that correct?
>>>>
>>>> I haven't dug too deeply into the code, are the block buffers reused
>>>>or is each new block read a new allocation? Perhaps a buffer pool
>>>>could help here if there isn't one already. When doing a scan, HBase
>>>>could reuse previously allocated block buffers instead of allocating a
>>>>new one for each block. Then block size shouldn't affect scan
>>>>performance much.
>>>>
>>>> I'm not using a block encoder. Also, I'm still sifting through the
>>>>profiler results, I'll see if I can make more sense of it and run some
>>>>more experiments.
>>>>
>>>> On May 2, 2013, at 5:46 PM, lars hofhansl <la...@apache.org> wrote:
>>>>
>>>>> Interesting. If you can try 0.94.7 (but it'll probably not have
>>>>>changed that much from 0.94.4)
>>>>>
>>>>>
>>>>> Do you have enabled one of the block encoders (FAST_DIFF, etc)? If
>>>>>so, try without. They currently need to reallocate a ByteBuffer for
>>>>>each single KV.
>>>>> (Sine you see ScannerV2 rather than EncodedScannerV2 you probably
>>>>>have not enabled encoding, just checking).
>>>>>
>>>>>
>>>>> And do you have a stack trace for the ByteBuffer.allocate(). That is
>>>>>a strange one since it never came up in my profiling (unless you
>>>>>enabled block encoding).
>>>>> (You can get traces from VisualVM by creating a snapshot, but you'd
>>>>>have to drill in to find the allocate()).
>>>>>
>>>>>
>>>>> During normal scanning (again, without encoding) there should be no
>>>>>allocation happening except for blocks read from disk (and they
>>>>>should all be the same size, thus allocation should be cheap).
>>>>>
>>>>> -- Lars
>>>>>
>>>>>
>>>>>
>>>>> ________________________________
>>>>> From: Bryan Keller <br...@gmail.com>
>>>>> To: user@hbase.apache.org
>>>>> Sent: Thursday, May 2, 2013 10:54 AM
>>>>> Subject: Re: Poor HBase map-reduce scan performance
>>>>>
>>>>>
>>>>> I ran one of my regionservers through VisualVM. It looks like the
>>>>>top hot spots are HFileReaderV2$ScannerV2.getKeyValue() and
>>>>>ByteBuffer.allocate(). It appears at first glance that memory
>>>>>allocations may be an issue. Decompression was next below that but
>>>>>less of an issue it seems.
>>>>>
>>>>> Would changing the block size, either HDFS or HBase, help here?
>>>>>
>>>>> Also, if anyone has tips on how else to profile, that would be
>>>>>appreciated. VisualVM can produce a lot of noise that is hard to sift
>>>>>through.
>>>>>
>>>>>
>>>>> On May 1, 2013, at 9:49 PM, Bryan Keller <br...@gmail.com> wrote:
>>>>>
>>>>>> I used exactly 0.94.4, pulled from the tag in subversion.
>>>>>>
>>>>>> On May 1, 2013, at 9:41 PM, lars hofhansl <la...@apache.org> wrote:
>>>>>>
>>>>>>> Hmm... Did you actually use exactly version 0.94.4, or the latest
>>>>>>>0.94.7.
>>>>>>> I would be very curious to see profiling data.
>>>>>>>
>>>>>>> -- Lars
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ----- Original Message -----
>>>>>>> From: Bryan Keller <br...@gmail.com>
>>>>>>> To: "user@hbase.apache.org" <us...@hbase.apache.org>
>>>>>>> Cc:
>>>>>>> Sent: Wednesday, May 1, 2013 6:01 PM
>>>>>>> Subject: Re: Poor HBase map-reduce scan performance
>>>>>>>
>>>>>>> I tried running my test with 0.94.4, unfortunately performance was
>>>>>>>about the same. I'm planning on profiling the regionserver and
>>>>>>>trying some other things tonight and tomorrow and will report back.
>>>>>>>
>>>>>>> On May 1, 2013, at 8:00 AM, Bryan Keller <br...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Yes I would like to try this, if you can point me to the pom.xml
>>>>>>>>patch that would save me some time.
>>>>>>>>
>>>>>>>> On Tuesday, April 30, 2013, lars hofhansl wrote:
>>>>>>>> If you can, try 0.94.4+; it should significantly reduce the
>>>>>>>>amount of bytes copied around in RAM during scanning, especially
>>>>>>>>if you have wide rows and/or large key portions. That in turns
>>>>>>>>makes scans scale better across cores, since RAM is shared
>>>>>>>>resource between cores (much like disk).
>>>>>>>>
>>>>>>>>
>>>>>>>> It's not hard to build the latest HBase against Cloudera's
>>>>>>>>version of Hadoop. I can send along a simple patch to pom.xml to
>>>>>>>>do that.
>>>>>>>>
>>>>>>>> -- Lars
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> ________________________________
>>>>>>>>  From: Bryan Keller <br...@gmail.com>
>>>>>>>> To: user@hbase.apache.org
>>>>>>>> Sent: Tuesday, April 30, 2013 11:02 PM
>>>>>>>> Subject: Re: Poor HBase map-reduce scan performance
>>>>>>>>
>>>>>>>>
>>>>>>>> The table has hashed keys so rows are evenly distributed amongst
>>>>>>>>the regionservers, and load on each regionserver is pretty much
>>>>>>>>the same. I also have per-table balancing turned on. I get mostly
>>>>>>>>data local mappers with only a few rack local (maybe 10 of the 250
>>>>>>>>mappers).
>>>>>>>>
>>>>>>>> Currently the table is a wide table schema, with lists of data
>>>>>>>>structures stored as columns with column prefixes grouping the
>>>>>>>>data structures (e.g. 1_name, 1_address, 1_city, 2_name,
>>>>>>>>2_address, 2_city). I was thinking of moving those data structures
>>>>>>>>to protobuf which would cut down on the number of columns. The
>>>>>>>>downside is I can't filter on one value with that, but it is a
>>>>>>>>tradeoff I would make for performance. I was also considering
>>>>>>>>restructuring the table into a tall table.
>>>>>>>>
>>>>>>>> Something interesting is that my old regionserver machines had
>>>>>>>>five 15k SCSI drives instead of 2 SSDs, and performance was about
>>>>>>>>the same. Also, my old network was 1gbit, now it is 10gbit. So
>>>>>>>>neither network nor disk I/O appear to be the bottleneck. The CPU
>>>>>>>>is rather high for the regionserver so it seems like the best
>>>>>>>>candidate to investigate. I will try profiling it tomorrow and
>>>>>>>>will report back. I may revisit compression on vs off since that
>>>>>>>>is adding load to the CPU.
>>>>>>>>
>>>>>>>> I'll also come up with a sample program that generates data
>>>>>>>>similar to my table.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Apr 30, 2013, at 10:01 PM, lars hofhansl <la...@apache.org>
>>>>>>>>wrote:
>>>>>>>>
>>>>>>>>> Your average row is 35k so scanner caching would not make a huge
>>>>>>>>>difference, although I would have expected some improvements by
>>>>>>>>>setting it to 10 or 50 since you have a wide 10ge pipe.
>>>>>>>>>
>>>>>>>>> I assume your table is split sufficiently to touch all
>>>>>>>>>RegionServer... Do you see the same load/IO on all region servers?
>>>>>>>>>
>>>>>>>>> A bunch of scan improvements went into HBase since 0.94.2.
>>>>>>>>> I blogged about some of these changes here:
>>>>>>>>>http://hadoop-hbase.blogspot.com/2012/12/hbase-profiling.html
>>>>>>>>>
>>>>>>>>> In your case - since you have many columns, each of which carry
>>>>>>>>>the rowkey - you might benefit a lot from HBASE-7279.
>>>>>>>>>
>>>>>>>>> In the end HBase *is* slower than straight HDFS for full scans.
>>>>>>>>>How could it not be?
>>>>>>>>> So I would start by looking at HDFS first. Make sure Nagle's is
>>>>>>>>>disbaled in both HBase and HDFS.
>>>>>>>>>
>>>>>>>>> And lastly SSDs are somewhat new territory for HBase. Maybe Andy
>>>>>>>>>Purtell is listening, I think he did some tests with HBase on
>>>>>>>>>SSDs.
>>>>>>>>> With rotating media you typically see an improvement with
>>>>>>>>>compression. With SSDs the added CPU needed for decompression
>>>>>>>>>might outweigh the benefits.
>>>>>>>>>
>>>>>>>>> At the risk of starting a larger discussion here, I would posit
>>>>>>>>>that HBase's LSM based design, which trades random IO with
>>>>>>>>>sequential IO, might be a bit more questionable on SSDs.
>>>>>>>>>
>>>>>>>>> If you can, it would be nice to run a profiler against one of
>>>>>>>>>the RegionServers (or maybe do it with the single RS setup) and
>>>>>>>>>see where it is bottlenecked.
>>>>>>>>> (And if you send me a sample program to generate some data - not
>>>>>>>>>700g, though :) - I'll try to do a bit of profiling during the
>>>>>>>>>next days as my day job permits, but I do not have any machines
>>>>>>>>>with SSDs).
>>>>>>>>>
>>>>>>>>> -- Lars
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ________________________________
>>>>>>>>> From: Bryan Keller <br...@gmail.com>
>>>>>>>>> To: user@hbase.apache.org
>>>>>>>>> Sent: Tuesday, April 30, 2013 9:31 PM
>>>>>>>>> Subject: Re: Poor HBase map-reduce scan performance
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Yes, I have tried various settings for setCaching() and I have
>>>>>>>>>setCacheBlocks(false)
>>>>>>>>>
>>>>>>>>> On Apr 30, 2013, at 9:17 PM, Ted Yu <yu...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> From http://hbase.apache.org/book.html#mapreduce.example :
>>>>>>>>>>
>>>>>>>>>> scan.setCaching(500);        // 1 is the default in Scan, which
>>>>>>>>>>will
>>>>>>>>>> be bad for MapReduce jobs
>>>>>>>>>> scan.setCacheBlocks(false);  // don't set to true for MR jobs
>>>>>>>>>>
>>>>>>>>>> I guess you have used the above setting.
>>>>>>>>>>
>>>>>>>>>> 0.94.x releases are compatible. Have you considered upgrading
>>>>>>>>>>to, say
>>>>>>>>>> 0.94.7 which was recently released ?
>>>>>>>>>>
>>>>>>>>>> Cheers
>>>>>>>>>>
>>>>>>>>>> On Tue, Apr 30, 2013 at 9:01 PM, Bryan Keller <bryanck@gm
>>>>>>>
>>
>

Re: Poor HBase map-reduce scan performance

Posted by Bryan Keller <br...@gmail.com>.

Thanks Enis, I'll see if I can backport this patch - it is exactly what I was going to try. This should solve my scan performance problems if I can get it to work.

On May 29, 2013, at 1:29 PM, Enis Söztutar <en...@hortonworks.com> wrote:

> Hi,
> 
> Regarding running raw scans on top of Hfiles, you can try a version of the
> patch attached at https://issues.apache.org/jira/browse/HBASE-8369, which
> enables exactly this. However, the patch is for trunk.
> 
> In that, we open one region from snapshot files in each record reader, and
> run a scan through using an internal region scanner. Since this bypasses
> the client + rpc + server daemon layers, it should be able to give optimum
> scan performance.
> 
> There is also a tool called HFilePerformanceBenchmark that intends to
> measure raw performance for HFiles. I've had to do a lot of changes to make
> is workable, but it might be worth to take a look to see whether there is
> any perf difference between scanning a sequence file from hdfs vs scanning
> an hfile.
> 
> Enis
> 
> 
> On Fri, May 24, 2013 at 10:50 PM, lars hofhansl <la...@apache.org> wrote:
> 
>> Sorry. Haven't gotten to this, yet.
>> 
>> Scanning in HBase being about 3x slower than straight HDFS is in the right
>> ballpark, though. It has to a bit more work.
>> 
>> Generally, HBase is great at honing in to a subset (some 10-100m rows) of
>> the data. Raw scan performance is not (yet) a strength of HBase.
>> 
>> So with HDFS you get to 75% of the theoretical maximum read throughput;
>> hence with HBase you to 25% of the theoretical cluster wide maximum disk
>> throughput?
>> 
>> 
>> -- Lars
>> 
>> 
>> 
>> ----- Original Message -----
>> From: Bryan Keller <br...@gmail.com>
>> To: user@hbase.apache.org
>> Cc:
>> Sent: Friday, May 10, 2013 8:46 AM
>> Subject: Re: Poor HBase map-reduce scan performance
>> 
>> FYI, I ran tests with compression on and off.
>> 
>> With a plain HDFS sequence file and compression off, I am getting very
>> good I/O numbers, roughly 75% of theoretical max for reads. With snappy
>> compression on with a sequence file, I/O speed is about 3x slower. However
>> the file size is 3x smaller so it takes about the same time to scan.
>> 
>> With HBase, the results are equivalent (just much slower than a sequence
>> file). Scanning a compressed table is about 3x slower I/O than an
>> uncompressed table, but the table is 3x smaller, so the time to scan is
>> about the same. Scanning an HBase table takes about 3x as long as scanning
>> the sequence file export of the table, either compressed or uncompressed.
>> The sequence file export file size ends up being just barely larger than
>> the table, either compressed or uncompressed
>> 
>> So in sum, compression slows down I/O 3x, but the file is 3x smaller so
>> the time to scan is about the same. Adding in HBase slows things down
>> another 3x. So I'm seeing 9x faster I/O scanning an uncompressed sequence
>> file vs scanning a compressed table.
>> 
>> 
>> On May 8, 2013, at 10:15 AM, Bryan Keller <br...@gmail.com> wrote:
>> 
>>> Thanks for the offer Lars! I haven't made much progress speeding things
>> up.
>>> 
>>> I finally put together a test program that populates a table that is
>> similar to my production dataset. I have a readme that should describe
>> things, hopefully enough to make it useable. There is code to populate a
>> test table, code to scan the table, and code to scan sequence files from an
>> export (to compare HBase w/ raw HDFS). I use a gradle build script.
>>> 
>>> You can find the code here:
>>> 
>>> https://dl.dropboxusercontent.com/u/6880177/hbasetest.zip
>>> 
>>> 
>>> On May 4, 2013, at 6:33 PM, lars hofhansl <la...@apache.org> wrote:
>>> 
>>>> The blockbuffers are not reused, but that by itself should not be a
>> problem as they are all the same size (at least I have never identified
>> that as one in my profiling sessions).
>>>> 
>>>> My offer still stands to do some profiling myself if there is an easy
>> way to generate data of similar shape.
>>>> 
>>>> -- Lars
>>>> 
>>>> 
>>>> 
>>>> ________________________________
>>>> From: Bryan Keller <br...@gmail.com>
>>>> To: user@hbase.apache.org
>>>> Sent: Friday, May 3, 2013 3:44 AM
>>>> Subject: Re: Poor HBase map-reduce scan performance
>>>> 
>>>> 
>>>> Actually I'm not too confident in my results re block size, they may
>> have been related to major compaction. I'm going to rerun before drawing
>> any conclusions.
>>>> 
>>>> On May 3, 2013, at 12:17 AM, Bryan Keller <br...@gmail.com> wrote:
>>>> 
>>>>> I finally made some progress. I tried a very large HBase block size
>> (16mb), and it significantly improved scan performance. I went from 45-50
>> min to 24 min. Not great but much better. Before I had it set to 128k.
>> Scanning an equivalent sequence file takes 10 min. My random read
>> performance will probably suffer with such a large block size
>> (theoretically), so I probably can't keep it this big. I care about random
>> read performance too. I've read having a block size this big is not
>> recommended, is that correct?
>>>>> 
>>>>> I haven't dug too deeply into the code, are the block buffers reused
>> or is each new block read a new allocation? Perhaps a buffer pool could
>> help here if there isn't one already. When doing a scan, HBase could reuse
>> previously allocated block buffers instead of allocating a new one for each
>> block. Then block size shouldn't affect scan performance much.
>>>>> 
>>>>> I'm not using a block encoder. Also, I'm still sifting through the
>> profiler results, I'll see if I can make more sense of it and run some more
>> experiments.
>>>>> 
>>>>> On May 2, 2013, at 5:46 PM, lars hofhansl <la...@apache.org> wrote:
>>>>> 
>>>>>> Interesting. If you can try 0.94.7 (but it'll probably not have
>> changed that much from 0.94.4)
>>>>>> 
>>>>>> 
>>>>>> Do you have enabled one of the block encoders (FAST_DIFF, etc)? If
>> so, try without. They currently need to reallocate a ByteBuffer for each
>> single KV.
>>>>>> (Sine you see ScannerV2 rather than EncodedScannerV2 you probably
>> have not enabled encoding, just checking).
>>>>>> 
>>>>>> 
>>>>>> And do you have a stack trace for the ByteBuffer.allocate(). That is
>> a strange one since it never came up in my profiling (unless you enabled
>> block encoding).
>>>>>> (You can get traces from VisualVM by creating a snapshot, but you'd
>> have to drill in to find the allocate()).
>>>>>> 
>>>>>> 
>>>>>> During normal scanning (again, without encoding) there should be no
>> allocation happening except for blocks read from disk (and they should all
>> be the same size, thus allocation should be cheap).
>>>>>> 
>>>>>> -- Lars
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> ________________________________
>>>>>> From: Bryan Keller <br...@gmail.com>
>>>>>> To: user@hbase.apache.org
>>>>>> Sent: Thursday, May 2, 2013 10:54 AM
>>>>>> Subject: Re: Poor HBase map-reduce scan performance
>>>>>> 
>>>>>> 
>>>>>> I ran one of my regionservers through VisualVM. It looks like the top
>> hot spots are HFileReaderV2$ScannerV2.getKeyValue() and
>> ByteBuffer.allocate(). It appears at first glance that memory allocations
>> may be an issue. Decompression was next below that but less of an issue it
>> seems.
>>>>>> 
>>>>>> Would changing the block size, either HDFS or HBase, help here?
>>>>>> 
>>>>>> Also, if anyone has tips on how else to profile, that would be
>> appreciated. VisualVM can produce a lot of noise that is hard to sift
>> through.
>>>>>> 
>>>>>> 
>>>>>> On May 1, 2013, at 9:49 PM, Bryan Keller <br...@gmail.com> wrote:
>>>>>> 
>>>>>>> I used exactly 0.94.4, pulled from the tag in subversion.
>>>>>>> 
>>>>>>> On May 1, 2013, at 9:41 PM, lars hofhansl <la...@apache.org> wrote:
>>>>>>> 
>>>>>>>> Hmm... Did you actually use exactly version 0.94.4, or the latest
>> 0.94.7.
>>>>>>>> I would be very curious to see profiling data.
>>>>>>>> 
>>>>>>>> -- Lars
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> ----- Original Message -----
>>>>>>>> From: Bryan Keller <br...@gmail.com>
>>>>>>>> To: "user@hbase.apache.org" <us...@hbase.apache.org>
>>>>>>>> Cc:
>>>>>>>> Sent: Wednesday, May 1, 2013 6:01 PM
>>>>>>>> Subject: Re: Poor HBase map-reduce scan performance
>>>>>>>> 
>>>>>>>> I tried running my test with 0.94.4, unfortunately performance was
>> about the same. I'm planning on profiling the regionserver and trying some
>> other things tonight and tomorrow and will report back.
>>>>>>>> 
>>>>>>>> On May 1, 2013, at 8:00 AM, Bryan Keller <br...@gmail.com> wrote:
>>>>>>>> 
>>>>>>>>> Yes I would like to try this, if you can point me to the pom.xml
>> patch that would save me some time.
>>>>>>>>> 
>>>>>>>>> On Tuesday, April 30, 2013, lars hofhansl wrote:
>>>>>>>>> If you can, try 0.94.4+; it should significantly reduce the amount
>> of bytes copied around in RAM during scanning, especially if you have wide
>> rows and/or large key portions. That in turns makes scans scale better
>> across cores, since RAM is shared resource between cores (much like disk).
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> It's not hard to build the latest HBase against Cloudera's version
>> of Hadoop. I can send along a simple patch to pom.xml to do that.
>>>>>>>>> 
>>>>>>>>> -- Lars
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> ________________________________
>>>>>>>>> From: Bryan Keller <br...@gmail.com>
>>>>>>>>> To: user@hbase.apache.org
>>>>>>>>> Sent: Tuesday, April 30, 2013 11:02 PM
>>>>>>>>> Subject: Re: Poor HBase map-reduce scan performance
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> The table has hashed keys so rows are evenly distributed amongst
>> the regionservers, and load on each regionserver is pretty much the same. I
>> also have per-table balancing turned on. I get mostly data local mappers
>> with only a few rack local (maybe 10 of the 250 mappers).
>>>>>>>>> 
>>>>>>>>> Currently the table is a wide table schema, with lists of data
>> structures stored as columns with column prefixes grouping the data
>> structures (e.g. 1_name, 1_address, 1_city, 2_name, 2_address, 2_city). I
>> was thinking of moving those data structures to protobuf which would cut
>> down on the number of columns. The downside is I can't filter on one value
>> with that, but it is a tradeoff I would make for performance. I was also
>> considering restructuring the table into a tall table.
>>>>>>>>> 
>>>>>>>>> Something interesting is that my old regionserver machines had
>> five 15k SCSI drives instead of 2 SSDs, and performance was about the same.
>> Also, my old network was 1gbit, now it is 10gbit. So neither network nor
>> disk I/O appear to be the bottleneck. The CPU is rather high for the
>> regionserver so it seems like the best candidate to investigate. I will try
>> profiling it tomorrow and will report back. I may revisit compression on vs
>> off since that is adding load to the CPU.
>>>>>>>>> 
>>>>>>>>> I'll also come up with a sample program that generates data
>> similar to my table.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Apr 30, 2013, at 10:01 PM, lars hofhansl <la...@apache.org>
>> wrote:
>>>>>>>>> 
>>>>>>>>>> Your average row is 35k so scanner caching would not make a huge
>> difference, although I would have expected some improvements by setting it
>> to 10 or 50 since you have a wide 10ge pipe.
>>>>>>>>>> 
>>>>>>>>>> I assume your table is split sufficiently to touch all
>> RegionServer... Do you see the same load/IO on all region servers?
>>>>>>>>>> 
>>>>>>>>>> A bunch of scan improvements went into HBase since 0.94.2.
>>>>>>>>>> I blogged about some of these changes here:
>> http://hadoop-hbase.blogspot.com/2012/12/hbase-profiling.html
>>>>>>>>>> 
>>>>>>>>>> In your case - since you have many columns, each of which carry
>> the rowkey - you might benefit a lot from HBASE-7279.
>>>>>>>>>> 
>>>>>>>>>> In the end HBase *is* slower than straight HDFS for full scans.
>> How could it not be?
>>>>>>>>>> So I would start by looking at HDFS first. Make sure Nagle's is
>> disbaled in both HBase and HDFS.
>>>>>>>>>> 
>>>>>>>>>> And lastly SSDs are somewhat new territory for HBase. Maybe Andy
>> Purtell is listening, I think he did some tests with HBase on SSDs.
>>>>>>>>>> With rotating media you typically see an improvement with
>> compression. With SSDs the added CPU needed for decompression might
>> outweigh the benefits.
>>>>>>>>>> 
>>>>>>>>>> At the risk of starting a larger discussion here, I would posit
>> that HBase's LSM based design, which trades random IO with sequential IO,
>> might be a bit more questionable on SSDs.
>>>>>>>>>> 
>>>>>>>>>> If you can, it would be nice to run a profiler against one of the
>> RegionServers (or maybe do it with the single RS setup) and see where it is
>> bottlenecked.
>>>>>>>>>> (And if you send me a sample program to generate some data - not
>> 700g, though :) - I'll try to do a bit of profiling during the next days as
>> my day job permits, but I do not have any machines with SSDs).
>>>>>>>>>> 
>>>>>>>>>> -- Lars
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> ________________________________
>>>>>>>>>> From: Bryan Keller <br...@gmail.com>
>>>>>>>>>> To: user@hbase.apache.org
>>>>>>>>>> Sent: Tuesday, April 30, 2013 9:31 PM
>>>>>>>>>> Subject: Re: Poor HBase map-reduce scan performance
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Yes, I have tried various settings for setCaching() and I have
>> setCacheBlocks(false)
>>>>>>>>>> 
>>>>>>>>>> On Apr 30, 2013, at 9:17 PM, Ted Yu <yu...@gmail.com> wrote:
>>>>>>>>>> 
>>>>>>>>>>> From http://hbase.apache.org/book.html#mapreduce.example :
>>>>>>>>>>> 
>>>>>>>>>>> scan.setCaching(500);        // 1 is the default in Scan, which
>> will
>>>>>>>>>>> be bad for MapReduce jobs
>>>>>>>>>>> scan.setCacheBlocks(false);  // don't set to true for MR jobs
>>>>>>>>>>> 
>>>>>>>>>>> I guess you have used the above setting.
>>>>>>>>>>> 
>>>>>>>>>>> 0.94.x releases are compatible. Have you considered upgrading
>> to, say
>>>>>>>>>>> 0.94.7 which was recently released ?
>>>>>>>>>>> 
>>>>>>>>>>> Cheers
>>>>>>>>>>> 
>>>>>>>>>>> On Tue, Apr 30, 2013 at 9:01 PM, Bryan Keller <bryanck@gm
>>>>>>>> 
>>> 
>> 
>>

Re: Poor HBase map-reduce scan performance

Posted by Enis Söztutar <en...@hortonworks.com>.

Hi,

Regarding running raw scans on top of Hfiles, you can try a version of the
patch attached at https://issues.apache.org/jira/browse/HBASE-8369, which
enables exactly this. However, the patch is for trunk.

In that, we open one region from snapshot files in each record reader, and
run a scan through using an internal region scanner. Since this bypasses
the client + rpc + server daemon layers, it should be able to give optimum
scan performance.

There is also a tool called HFilePerformanceBenchmark that intends to
measure raw performance for HFiles. I've had to do a lot of changes to make
is workable, but it might be worth to take a look to see whether there is
any perf difference between scanning a sequence file from hdfs vs scanning
an hfile.

Enis


On Fri, May 24, 2013 at 10:50 PM, lars hofhansl <la...@apache.org> wrote:

> Sorry. Haven't gotten to this, yet.
>
> Scanning in HBase being about 3x slower than straight HDFS is in the right
> ballpark, though. It has to a bit more work.
>
> Generally, HBase is great at honing in to a subset (some 10-100m rows) of
> the data. Raw scan performance is not (yet) a strength of HBase.
>
> So with HDFS you get to 75% of the theoretical maximum read throughput;
> hence with HBase you to 25% of the theoretical cluster wide maximum disk
> throughput?
>
>
> -- Lars
>
>
>
> ----- Original Message -----
> From: Bryan Keller <br...@gmail.com>
> To: user@hbase.apache.org
> Cc:
> Sent: Friday, May 10, 2013 8:46 AM
> Subject: Re: Poor HBase map-reduce scan performance
>
> FYI, I ran tests with compression on and off.
>
> With a plain HDFS sequence file and compression off, I am getting very
> good I/O numbers, roughly 75% of theoretical max for reads. With snappy
> compression on with a sequence file, I/O speed is about 3x slower. However
> the file size is 3x smaller so it takes about the same time to scan.
>
> With HBase, the results are equivalent (just much slower than a sequence
> file). Scanning a compressed table is about 3x slower I/O than an
> uncompressed table, but the table is 3x smaller, so the time to scan is
> about the same. Scanning an HBase table takes about 3x as long as scanning
> the sequence file export of the table, either compressed or uncompressed.
> The sequence file export file size ends up being just barely larger than
> the table, either compressed or uncompressed
>
> So in sum, compression slows down I/O 3x, but the file is 3x smaller so
> the time to scan is about the same. Adding in HBase slows things down
> another 3x. So I'm seeing 9x faster I/O scanning an uncompressed sequence
> file vs scanning a compressed table.
>
>
> On May 8, 2013, at 10:15 AM, Bryan Keller <br...@gmail.com> wrote:
>
> > Thanks for the offer Lars! I haven't made much progress speeding things
> up.
> >
> > I finally put together a test program that populates a table that is
> similar to my production dataset. I have a readme that should describe
> things, hopefully enough to make it useable. There is code to populate a
> test table, code to scan the table, and code to scan sequence files from an
> export (to compare HBase w/ raw HDFS). I use a gradle build script.
> >
> > You can find the code here:
> >
> > https://dl.dropboxusercontent.com/u/6880177/hbasetest.zip
> >
> >
> > On May 4, 2013, at 6:33 PM, lars hofhansl <la...@apache.org> wrote:
> >
> >> The blockbuffers are not reused, but that by itself should not be a
> problem as they are all the same size (at least I have never identified
> that as one in my profiling sessions).
> >>
> >> My offer still stands to do some profiling myself if there is an easy
> way to generate data of similar shape.
> >>
> >> -- Lars
> >>
> >>
> >>
> >> ________________________________
> >> From: Bryan Keller <br...@gmail.com>
> >> To: user@hbase.apache.org
> >> Sent: Friday, May 3, 2013 3:44 AM
> >> Subject: Re: Poor HBase map-reduce scan performance
> >>
> >>
> >> Actually I'm not too confident in my results re block size, they may
> have been related to major compaction. I'm going to rerun before drawing
> any conclusions.
> >>
> >> On May 3, 2013, at 12:17 AM, Bryan Keller <br...@gmail.com> wrote:
> >>
> >>> I finally made some progress. I tried a very large HBase block size
> (16mb), and it significantly improved scan performance. I went from 45-50
> min to 24 min. Not great but much better. Before I had it set to 128k.
> Scanning an equivalent sequence file takes 10 min. My random read
> performance will probably suffer with such a large block size
> (theoretically), so I probably can't keep it this big. I care about random
> read performance too. I've read having a block size this big is not
> recommended, is that correct?
> >>>
> >>> I haven't dug too deeply into the code, are the block buffers reused
> or is each new block read a new allocation? Perhaps a buffer pool could
> help here if there isn't one already. When doing a scan, HBase could reuse
> previously allocated block buffers instead of allocating a new one for each
> block. Then block size shouldn't affect scan performance much.
> >>>
> >>> I'm not using a block encoder. Also, I'm still sifting through the
> profiler results, I'll see if I can make more sense of it and run some more
> experiments.
> >>>
> >>> On May 2, 2013, at 5:46 PM, lars hofhansl <la...@apache.org> wrote:
> >>>
> >>>> Interesting. If you can try 0.94.7 (but it'll probably not have
> changed that much from 0.94.4)
> >>>>
> >>>>
> >>>> Do you have enabled one of the block encoders (FAST_DIFF, etc)? If
> so, try without. They currently need to reallocate a ByteBuffer for each
> single KV.
> >>>> (Sine you see ScannerV2 rather than EncodedScannerV2 you probably
> have not enabled encoding, just checking).
> >>>>
> >>>>
> >>>> And do you have a stack trace for the ByteBuffer.allocate(). That is
> a strange one since it never came up in my profiling (unless you enabled
> block encoding).
> >>>> (You can get traces from VisualVM by creating a snapshot, but you'd
> have to drill in to find the allocate()).
> >>>>
> >>>>
> >>>> During normal scanning (again, without encoding) there should be no
> allocation happening except for blocks read from disk (and they should all
> be the same size, thus allocation should be cheap).
> >>>>
> >>>> -- Lars
> >>>>
> >>>>
> >>>>
> >>>> ________________________________
> >>>> From: Bryan Keller <br...@gmail.com>
> >>>> To: user@hbase.apache.org
> >>>> Sent: Thursday, May 2, 2013 10:54 AM
> >>>> Subject: Re: Poor HBase map-reduce scan performance
> >>>>
> >>>>
> >>>> I ran one of my regionservers through VisualVM. It looks like the top
> hot spots are HFileReaderV2$ScannerV2.getKeyValue() and
> ByteBuffer.allocate(). It appears at first glance that memory allocations
> may be an issue. Decompression was next below that but less of an issue it
> seems.
> >>>>
> >>>> Would changing the block size, either HDFS or HBase, help here?
> >>>>
> >>>> Also, if anyone has tips on how else to profile, that would be
> appreciated. VisualVM can produce a lot of noise that is hard to sift
> through.
> >>>>
> >>>>
> >>>> On May 1, 2013, at 9:49 PM, Bryan Keller <br...@gmail.com> wrote:
> >>>>
> >>>>> I used exactly 0.94.4, pulled from the tag in subversion.
> >>>>>
> >>>>> On May 1, 2013, at 9:41 PM, lars hofhansl <la...@apache.org> wrote:
> >>>>>
> >>>>>> Hmm... Did you actually use exactly version 0.94.4, or the latest
> 0.94.7.
> >>>>>> I would be very curious to see profiling data.
> >>>>>>
> >>>>>> -- Lars
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> ----- Original Message -----
> >>>>>> From: Bryan Keller <br...@gmail.com>
> >>>>>> To: "user@hbase.apache.org" <us...@hbase.apache.org>
> >>>>>> Cc:
> >>>>>> Sent: Wednesday, May 1, 2013 6:01 PM
> >>>>>> Subject: Re: Poor HBase map-reduce scan performance
> >>>>>>
> >>>>>> I tried running my test with 0.94.4, unfortunately performance was
> about the same. I'm planning on profiling the regionserver and trying some
> other things tonight and tomorrow and will report back.
> >>>>>>
> >>>>>> On May 1, 2013, at 8:00 AM, Bryan Keller <br...@gmail.com> wrote:
> >>>>>>
> >>>>>>> Yes I would like to try this, if you can point me to the pom.xml
> patch that would save me some time.
> >>>>>>>
> >>>>>>> On Tuesday, April 30, 2013, lars hofhansl wrote:
> >>>>>>> If you can, try 0.94.4+; it should significantly reduce the amount
> of bytes copied around in RAM during scanning, especially if you have wide
> rows and/or large key portions. That in turns makes scans scale better
> across cores, since RAM is shared resource between cores (much like disk).
> >>>>>>>
> >>>>>>>
> >>>>>>> It's not hard to build the latest HBase against Cloudera's version
> of Hadoop. I can send along a simple patch to pom.xml to do that.
> >>>>>>>
> >>>>>>> -- Lars
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> ________________________________
> >>>>>>>  From: Bryan Keller <br...@gmail.com>
> >>>>>>> To: user@hbase.apache.org
> >>>>>>> Sent: Tuesday, April 30, 2013 11:02 PM
> >>>>>>> Subject: Re: Poor HBase map-reduce scan performance
> >>>>>>>
> >>>>>>>
> >>>>>>> The table has hashed keys so rows are evenly distributed amongst
> the regionservers, and load on each regionserver is pretty much the same. I
> also have per-table balancing turned on. I get mostly data local mappers
> with only a few rack local (maybe 10 of the 250 mappers).
> >>>>>>>
> >>>>>>> Currently the table is a wide table schema, with lists of data
> structures stored as columns with column prefixes grouping the data
> structures (e.g. 1_name, 1_address, 1_city, 2_name, 2_address, 2_city). I
> was thinking of moving those data structures to protobuf which would cut
> down on the number of columns. The downside is I can't filter on one value
> with that, but it is a tradeoff I would make for performance. I was also
> considering restructuring the table into a tall table.
> >>>>>>>
> >>>>>>> Something interesting is that my old regionserver machines had
> five 15k SCSI drives instead of 2 SSDs, and performance was about the same.
> Also, my old network was 1gbit, now it is 10gbit. So neither network nor
> disk I/O appear to be the bottleneck. The CPU is rather high for the
> regionserver so it seems like the best candidate to investigate. I will try
> profiling it tomorrow and will report back. I may revisit compression on vs
> off since that is adding load to the CPU.
> >>>>>>>
> >>>>>>> I'll also come up with a sample program that generates data
> similar to my table.
> >>>>>>>
> >>>>>>>
> >>>>>>> On Apr 30, 2013, at 10:01 PM, lars hofhansl <la...@apache.org>
> wrote:
> >>>>>>>
> >>>>>>>> Your average row is 35k so scanner caching would not make a huge
> difference, although I would have expected some improvements by setting it
> to 10 or 50 since you have a wide 10ge pipe.
> >>>>>>>>
> >>>>>>>> I assume your table is split sufficiently to touch all
> RegionServer... Do you see the same load/IO on all region servers?
> >>>>>>>>
> >>>>>>>> A bunch of scan improvements went into HBase since 0.94.2.
> >>>>>>>> I blogged about some of these changes here:
> http://hadoop-hbase.blogspot.com/2012/12/hbase-profiling.html
> >>>>>>>>
> >>>>>>>> In your case - since you have many columns, each of which carry
> the rowkey - you might benefit a lot from HBASE-7279.
> >>>>>>>>
> >>>>>>>> In the end HBase *is* slower than straight HDFS for full scans.
> How could it not be?
> >>>>>>>> So I would start by looking at HDFS first. Make sure Nagle's is
> disbaled in both HBase and HDFS.
> >>>>>>>>
> >>>>>>>> And lastly SSDs are somewhat new territory for HBase. Maybe Andy
> Purtell is listening, I think he did some tests with HBase on SSDs.
> >>>>>>>> With rotating media you typically see an improvement with
> compression. With SSDs the added CPU needed for decompression might
> outweigh the benefits.
> >>>>>>>>
> >>>>>>>> At the risk of starting a larger discussion here, I would posit
> that HBase's LSM based design, which trades random IO with sequential IO,
> might be a bit more questionable on SSDs.
> >>>>>>>>
> >>>>>>>> If you can, it would be nice to run a profiler against one of the
> RegionServers (or maybe do it with the single RS setup) and see where it is
> bottlenecked.
> >>>>>>>> (And if you send me a sample program to generate some data - not
> 700g, though :) - I'll try to do a bit of profiling during the next days as
> my day job permits, but I do not have any machines with SSDs).
> >>>>>>>>
> >>>>>>>> -- Lars
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> ________________________________
> >>>>>>>> From: Bryan Keller <br...@gmail.com>
> >>>>>>>> To: user@hbase.apache.org
> >>>>>>>> Sent: Tuesday, April 30, 2013 9:31 PM
> >>>>>>>> Subject: Re: Poor HBase map-reduce scan performance
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Yes, I have tried various settings for setCaching() and I have
> setCacheBlocks(false)
> >>>>>>>>
> >>>>>>>> On Apr 30, 2013, at 9:17 PM, Ted Yu <yu...@gmail.com> wrote:
> >>>>>>>>
> >>>>>>>>> From http://hbase.apache.org/book.html#mapreduce.example :
> >>>>>>>>>
> >>>>>>>>> scan.setCaching(500);        // 1 is the default in Scan, which
> will
> >>>>>>>>> be bad for MapReduce jobs
> >>>>>>>>> scan.setCacheBlocks(false);  // don't set to true for MR jobs
> >>>>>>>>>
> >>>>>>>>> I guess you have used the above setting.
> >>>>>>>>>
> >>>>>>>>> 0.94.x releases are compatible. Have you considered upgrading
> to, say
> >>>>>>>>> 0.94.7 which was recently released ?
> >>>>>>>>>
> >>>>>>>>> Cheers
> >>>>>>>>>
> >>>>>>>>> On Tue, Apr 30, 2013 at 9:01 PM, Bryan Keller <bryanck@gm
> >>>>>>
> >
>
>

Re: Poor HBase map-reduce scan performance

Posted by lars hofhansl <la...@apache.org>.

Sorry. Haven't gotten to this, yet.

Scanning in HBase being about 3x slower than straight HDFS is in the right ballpark, though. It has to a bit more work.

Generally, HBase is great at honing in to a subset (some 10-100m rows) of the data. Raw scan performance is not (yet) a strength of HBase.

So with HDFS you get to 75% of the theoretical maximum read throughput; hence with HBase you to 25% of the theoretical cluster wide maximum disk throughput?


-- Lars



----- Original Message -----
From: Bryan Keller <br...@gmail.com>
To: user@hbase.apache.org
Cc: 
Sent: Friday, May 10, 2013 8:46 AM
Subject: Re: Poor HBase map-reduce scan performance

FYI, I ran tests with compression on and off.

With a plain HDFS sequence file and compression off, I am getting very good I/O numbers, roughly 75% of theoretical max for reads. With snappy compression on with a sequence file, I/O speed is about 3x slower. However the file size is 3x smaller so it takes about the same time to scan.

With HBase, the results are equivalent (just much slower than a sequence file). Scanning a compressed table is about 3x slower I/O than an uncompressed table, but the table is 3x smaller, so the time to scan is about the same. Scanning an HBase table takes about 3x as long as scanning the sequence file export of the table, either compressed or uncompressed. The sequence file export file size ends up being just barely larger than the table, either compressed or uncompressed

So in sum, compression slows down I/O 3x, but the file is 3x smaller so the time to scan is about the same. Adding in HBase slows things down another 3x. So I'm seeing 9x faster I/O scanning an uncompressed sequence file vs scanning a compressed table.


On May 8, 2013, at 10:15 AM, Bryan Keller <br...@gmail.com> wrote:

> Thanks for the offer Lars! I haven't made much progress speeding things up.
> 
> I finally put together a test program that populates a table that is similar to my production dataset. I have a readme that should describe things, hopefully enough to make it useable. There is code to populate a test table, code to scan the table, and code to scan sequence files from an export (to compare HBase w/ raw HDFS). I use a gradle build script.
> 
> You can find the code here:
> 
> https://dl.dropboxusercontent.com/u/6880177/hbasetest.zip
> 
> 
> On May 4, 2013, at 6:33 PM, lars hofhansl <la...@apache.org> wrote:
> 
>> The blockbuffers are not reused, but that by itself should not be a problem as they are all the same size (at least I have never identified that as one in my profiling sessions).
>> 
>> My offer still stands to do some profiling myself if there is an easy way to generate data of similar shape.
>> 
>> -- Lars
>> 
>> 
>> 
>> ________________________________
>> From: Bryan Keller <br...@gmail.com>
>> To: user@hbase.apache.org 
>> Sent: Friday, May 3, 2013 3:44 AM
>> Subject: Re: Poor HBase map-reduce scan performance
>> 
>> 
>> Actually I'm not too confident in my results re block size, they may have been related to major compaction. I'm going to rerun before drawing any conclusions.
>> 
>> On May 3, 2013, at 12:17 AM, Bryan Keller <br...@gmail.com> wrote:
>> 
>>> I finally made some progress. I tried a very large HBase block size (16mb), and it significantly improved scan performance. I went from 45-50 min to 24 min. Not great but much better. Before I had it set to 128k. Scanning an equivalent sequence file takes 10 min. My random read performance will probably suffer with such a large block size (theoretically), so I probably can't keep it this big. I care about random read performance too. I've read having a block size this big is not recommended, is that correct?
>>> 
>>> I haven't dug too deeply into the code, are the block buffers reused or is each new block read a new allocation? Perhaps a buffer pool could help here if there isn't one already. When doing a scan, HBase could reuse previously allocated block buffers instead of allocating a new one for each block. Then block size shouldn't affect scan performance much.
>>> 
>>> I'm not using a block encoder. Also, I'm still sifting through the profiler results, I'll see if I can make more sense of it and run some more experiments.
>>> 
>>> On May 2, 2013, at 5:46 PM, lars hofhansl <la...@apache.org> wrote:
>>> 
>>>> Interesting. If you can try 0.94.7 (but it'll probably not have changed that much from 0.94.4)
>>>> 
>>>> 
>>>> Do you have enabled one of the block encoders (FAST_DIFF, etc)? If so, try without. They currently need to reallocate a ByteBuffer for each single KV.
>>>> (Sine you see ScannerV2 rather than EncodedScannerV2 you probably have not enabled encoding, just checking).
>>>> 
>>>> 
>>>> And do you have a stack trace for the ByteBuffer.allocate(). That is a strange one since it never came up in my profiling (unless you enabled block encoding).
>>>> (You can get traces from VisualVM by creating a snapshot, but you'd have to drill in to find the allocate()).
>>>> 
>>>> 
>>>> During normal scanning (again, without encoding) there should be no allocation happening except for blocks read from disk (and they should all be the same size, thus allocation should be cheap).
>>>> 
>>>> -- Lars
>>>> 
>>>> 
>>>> 
>>>> ________________________________
>>>> From: Bryan Keller <br...@gmail.com>
>>>> To: user@hbase.apache.org 
>>>> Sent: Thursday, May 2, 2013 10:54 AM
>>>> Subject: Re: Poor HBase map-reduce scan performance
>>>> 
>>>> 
>>>> I ran one of my regionservers through VisualVM. It looks like the top hot spots are HFileReaderV2$ScannerV2.getKeyValue() and ByteBuffer.allocate(). It appears at first glance that memory allocations may be an issue. Decompression was next below that but less of an issue it seems.
>>>> 
>>>> Would changing the block size, either HDFS or HBase, help here?
>>>> 
>>>> Also, if anyone has tips on how else to profile, that would be appreciated. VisualVM can produce a lot of noise that is hard to sift through.
>>>> 
>>>> 
>>>> On May 1, 2013, at 9:49 PM, Bryan Keller <br...@gmail.com> wrote:
>>>> 
>>>>> I used exactly 0.94.4, pulled from the tag in subversion.
>>>>> 
>>>>> On May 1, 2013, at 9:41 PM, lars hofhansl <la...@apache.org> wrote:
>>>>> 
>>>>>> Hmm... Did you actually use exactly version 0.94.4, or the latest 0.94.7.
>>>>>> I would be very curious to see profiling data.
>>>>>> 
>>>>>> -- Lars
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> ----- Original Message -----
>>>>>> From: Bryan Keller <br...@gmail.com>
>>>>>> To: "user@hbase.apache.org" <us...@hbase.apache.org>
>>>>>> Cc: 
>>>>>> Sent: Wednesday, May 1, 2013 6:01 PM
>>>>>> Subject: Re: Poor HBase map-reduce scan performance
>>>>>> 
>>>>>> I tried running my test with 0.94.4, unfortunately performance was about the same. I'm planning on profiling the regionserver and trying some other things tonight and tomorrow and will report back.
>>>>>> 
>>>>>> On May 1, 2013, at 8:00 AM, Bryan Keller <br...@gmail.com> wrote:
>>>>>> 
>>>>>>> Yes I would like to try this, if you can point me to the pom.xml patch that would save me some time.
>>>>>>> 
>>>>>>> On Tuesday, April 30, 2013, lars hofhansl wrote:
>>>>>>> If you can, try 0.94.4+; it should significantly reduce the amount of bytes copied around in RAM during scanning, especially if you have wide rows and/or large key portions. That in turns makes scans scale better across cores, since RAM is shared resource between cores (much like disk).
>>>>>>> 
>>>>>>> 
>>>>>>> It's not hard to build the latest HBase against Cloudera's version of Hadoop. I can send along a simple patch to pom.xml to do that.
>>>>>>> 
>>>>>>> -- Lars
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> ________________________________
>>>>>>>  From: Bryan Keller <br...@gmail.com>
>>>>>>> To: user@hbase.apache.org
>>>>>>> Sent: Tuesday, April 30, 2013 11:02 PM
>>>>>>> Subject: Re: Poor HBase map-reduce scan performance
>>>>>>> 
>>>>>>> 
>>>>>>> The table has hashed keys so rows are evenly distributed amongst the regionservers, and load on each regionserver is pretty much the same. I also have per-table balancing turned on. I get mostly data local mappers with only a few rack local (maybe 10 of the 250 mappers).
>>>>>>> 
>>>>>>> Currently the table is a wide table schema, with lists of data structures stored as columns with column prefixes grouping the data structures (e.g. 1_name, 1_address, 1_city, 2_name, 2_address, 2_city). I was thinking of moving those data structures to protobuf which would cut down on the number of columns. The downside is I can't filter on one value with that, but it is a tradeoff I would make for performance. I was also considering restructuring the table into a tall table.
>>>>>>> 
>>>>>>> Something interesting is that my old regionserver machines had five 15k SCSI drives instead of 2 SSDs, and performance was about the same. Also, my old network was 1gbit, now it is 10gbit. So neither network nor disk I/O appear to be the bottleneck. The CPU is rather high for the regionserver so it seems like the best candidate to investigate. I will try profiling it tomorrow and will report back. I may revisit compression on vs off since that is adding load to the CPU.
>>>>>>> 
>>>>>>> I'll also come up with a sample program that generates data similar to my table.
>>>>>>> 
>>>>>>> 
>>>>>>> On Apr 30, 2013, at 10:01 PM, lars hofhansl <la...@apache.org> wrote:
>>>>>>> 
>>>>>>>> Your average row is 35k so scanner caching would not make a huge difference, although I would have expected some improvements by setting it to 10 or 50 since you have a wide 10ge pipe.
>>>>>>>> 
>>>>>>>> I assume your table is split sufficiently to touch all RegionServer... Do you see the same load/IO on all region servers?
>>>>>>>> 
>>>>>>>> A bunch of scan improvements went into HBase since 0.94.2.
>>>>>>>> I blogged about some of these changes here: http://hadoop-hbase.blogspot.com/2012/12/hbase-profiling.html
>>>>>>>> 
>>>>>>>> In your case - since you have many columns, each of which carry the rowkey - you might benefit a lot from HBASE-7279.
>>>>>>>> 
>>>>>>>> In the end HBase *is* slower than straight HDFS for full scans. How could it not be?
>>>>>>>> So I would start by looking at HDFS first. Make sure Nagle's is disbaled in both HBase and HDFS.
>>>>>>>> 
>>>>>>>> And lastly SSDs are somewhat new territory for HBase. Maybe Andy Purtell is listening, I think he did some tests with HBase on SSDs.
>>>>>>>> With rotating media you typically see an improvement with compression. With SSDs the added CPU needed for decompression might outweigh the benefits.
>>>>>>>> 
>>>>>>>> At the risk of starting a larger discussion here, I would posit that HBase's LSM based design, which trades random IO with sequential IO, might be a bit more questionable on SSDs.
>>>>>>>> 
>>>>>>>> If you can, it would be nice to run a profiler against one of the RegionServers (or maybe do it with the single RS setup) and see where it is bottlenecked.
>>>>>>>> (And if you send me a sample program to generate some data - not 700g, though :) - I'll try to do a bit of profiling during the next days as my day job permits, but I do not have any machines with SSDs).
>>>>>>>> 
>>>>>>>> -- Lars
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> ________________________________
>>>>>>>> From: Bryan Keller <br...@gmail.com>
>>>>>>>> To: user@hbase.apache.org
>>>>>>>> Sent: Tuesday, April 30, 2013 9:31 PM
>>>>>>>> Subject: Re: Poor HBase map-reduce scan performance
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Yes, I have tried various settings for setCaching() and I have setCacheBlocks(false)
>>>>>>>> 
>>>>>>>> On Apr 30, 2013, at 9:17 PM, Ted Yu <yu...@gmail.com> wrote:
>>>>>>>> 
>>>>>>>>> From http://hbase.apache.org/book.html#mapreduce.example :
>>>>>>>>> 
>>>>>>>>> scan.setCaching(500);        // 1 is the default in Scan, which will
>>>>>>>>> be bad for MapReduce jobs
>>>>>>>>> scan.setCacheBlocks(false);  // don't set to true for MR jobs
>>>>>>>>> 
>>>>>>>>> I guess you have used the above setting.
>>>>>>>>> 
>>>>>>>>> 0.94.x releases are compatible. Have you considered upgrading to, say
>>>>>>>>> 0.94.7 which was recently released ?
>>>>>>>>> 
>>>>>>>>> Cheers
>>>>>>>>> 
>>>>>>>>> On Tue, Apr 30, 2013 at 9:01 PM, Bryan Keller <bryanck@gm
>>>>>> 
>

Re: Poor HBase map-reduce scan performance

Posted by Bryan Keller <br...@gmail.com>.

FYI, I ran tests with compression on and off.

With a plain HDFS sequence file and compression off, I am getting very good I/O numbers, roughly 75% of theoretical max for reads. With snappy compression on with a sequence file, I/O speed is about 3x slower. However the file size is 3x smaller so it takes about the same time to scan.

With HBase, the results are equivalent (just much slower than a sequence file). Scanning a compressed table is about 3x slower I/O than an uncompressed table, but the table is 3x smaller, so the time to scan is about the same. Scanning an HBase table takes about 3x as long as scanning the sequence file export of the table, either compressed or uncompressed. The sequence file export file size ends up being just barely larger than the table, either compressed or uncompressed

So in sum, compression slows down I/O 3x, but the file is 3x smaller so the time to scan is about the same. Adding in HBase slows things down another 3x. So I'm seeing 9x faster I/O scanning an uncompressed sequence file vs scanning a compressed table.


On May 8, 2013, at 10:15 AM, Bryan Keller <br...@gmail.com> wrote:

> Thanks for the offer Lars! I haven't made much progress speeding things up.
> 
> I finally put together a test program that populates a table that is similar to my production dataset. I have a readme that should describe things, hopefully enough to make it useable. There is code to populate a test table, code to scan the table, and code to scan sequence files from an export (to compare HBase w/ raw HDFS). I use a gradle build script.
> 
> You can find the code here:
> 
> https://dl.dropboxusercontent.com/u/6880177/hbasetest.zip
> 
> 
> On May 4, 2013, at 6:33 PM, lars hofhansl <la...@apache.org> wrote:
> 
>> The blockbuffers are not reused, but that by itself should not be a problem as they are all the same size (at least I have never identified that as one in my profiling sessions).
>> 
>> My offer still stands to do some profiling myself if there is an easy way to generate data of similar shape.
>> 
>> -- Lars
>> 
>> 
>> 
>> ________________________________
>> From: Bryan Keller <br...@gmail.com>
>> To: user@hbase.apache.org 
>> Sent: Friday, May 3, 2013 3:44 AM
>> Subject: Re: Poor HBase map-reduce scan performance
>> 
>> 
>> Actually I'm not too confident in my results re block size, they may have been related to major compaction. I'm going to rerun before drawing any conclusions.
>> 
>> On May 3, 2013, at 12:17 AM, Bryan Keller <br...@gmail.com> wrote:
>> 
>>> I finally made some progress. I tried a very large HBase block size (16mb), and it significantly improved scan performance. I went from 45-50 min to 24 min. Not great but much better. Before I had it set to 128k. Scanning an equivalent sequence file takes 10 min. My random read performance will probably suffer with such a large block size (theoretically), so I probably can't keep it this big. I care about random read performance too. I've read having a block size this big is not recommended, is that correct?
>>> 
>>> I haven't dug too deeply into the code, are the block buffers reused or is each new block read a new allocation? Perhaps a buffer pool could help here if there isn't one already. When doing a scan, HBase could reuse previously allocated block buffers instead of allocating a new one for each block. Then block size shouldn't affect scan performance much.
>>> 
>>> I'm not using a block encoder. Also, I'm still sifting through the profiler results, I'll see if I can make more sense of it and run some more experiments.
>>> 
>>> On May 2, 2013, at 5:46 PM, lars hofhansl <la...@apache.org> wrote:
>>> 
>>>> Interesting. If you can try 0.94.7 (but it'll probably not have changed that much from 0.94.4)
>>>> 
>>>> 
>>>> Do you have enabled one of the block encoders (FAST_DIFF, etc)? If so, try without. They currently need to reallocate a ByteBuffer for each single KV.
>>>> (Sine you see ScannerV2 rather than EncodedScannerV2 you probably have not enabled encoding, just checking).
>>>> 
>>>> 
>>>> And do you have a stack trace for the ByteBuffer.allocate(). That is a strange one since it never came up in my profiling (unless you enabled block encoding).
>>>> (You can get traces from VisualVM by creating a snapshot, but you'd have to drill in to find the allocate()).
>>>> 
>>>> 
>>>> During normal scanning (again, without encoding) there should be no allocation happening except for blocks read from disk (and they should all be the same size, thus allocation should be cheap).
>>>> 
>>>> -- Lars
>>>> 
>>>> 
>>>> 
>>>> ________________________________
>>>> From: Bryan Keller <br...@gmail.com>
>>>> To: user@hbase.apache.org 
>>>> Sent: Thursday, May 2, 2013 10:54 AM
>>>> Subject: Re: Poor HBase map-reduce scan performance
>>>> 
>>>> 
>>>> I ran one of my regionservers through VisualVM. It looks like the top hot spots are HFileReaderV2$ScannerV2.getKeyValue() and ByteBuffer.allocate(). It appears at first glance that memory allocations may be an issue. Decompression was next below that but less of an issue it seems.
>>>> 
>>>> Would changing the block size, either HDFS or HBase, help here?
>>>> 
>>>> Also, if anyone has tips on how else to profile, that would be appreciated. VisualVM can produce a lot of noise that is hard to sift through.
>>>> 
>>>> 
>>>> On May 1, 2013, at 9:49 PM, Bryan Keller <br...@gmail.com> wrote:
>>>> 
>>>>> I used exactly 0.94.4, pulled from the tag in subversion.
>>>>> 
>>>>> On May 1, 2013, at 9:41 PM, lars hofhansl <la...@apache.org> wrote:
>>>>> 
>>>>>> Hmm... Did you actually use exactly version 0.94.4, or the latest 0.94.7.
>>>>>> I would be very curious to see profiling data.
>>>>>> 
>>>>>> -- Lars
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> ----- Original Message -----
>>>>>> From: Bryan Keller <br...@gmail.com>
>>>>>> To: "user@hbase.apache.org" <us...@hbase.apache.org>
>>>>>> Cc: 
>>>>>> Sent: Wednesday, May 1, 2013 6:01 PM
>>>>>> Subject: Re: Poor HBase map-reduce scan performance
>>>>>> 
>>>>>> I tried running my test with 0.94.4, unfortunately performance was about the same. I'm planning on profiling the regionserver and trying some other things tonight and tomorrow and will report back.
>>>>>> 
>>>>>> On May 1, 2013, at 8:00 AM, Bryan Keller <br...@gmail.com> wrote:
>>>>>> 
>>>>>>> Yes I would like to try this, if you can point me to the pom.xml patch that would save me some time.
>>>>>>> 
>>>>>>> On Tuesday, April 30, 2013, lars hofhansl wrote:
>>>>>>> If you can, try 0.94.4+; it should significantly reduce the amount of bytes copied around in RAM during scanning, especially if you have wide rows and/or large key portions. That in turns makes scans scale better across cores, since RAM is shared resource between cores (much like disk).
>>>>>>> 
>>>>>>> 
>>>>>>> It's not hard to build the latest HBase against Cloudera's version of Hadoop. I can send along a simple patch to pom.xml to do that.
>>>>>>> 
>>>>>>> -- Lars
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> ________________________________
>>>>>>>  From: Bryan Keller <br...@gmail.com>
>>>>>>> To: user@hbase.apache.org
>>>>>>> Sent: Tuesday, April 30, 2013 11:02 PM
>>>>>>> Subject: Re: Poor HBase map-reduce scan performance
>>>>>>> 
>>>>>>> 
>>>>>>> The table has hashed keys so rows are evenly distributed amongst the regionservers, and load on each regionserver is pretty much the same. I also have per-table balancing turned on. I get mostly data local mappers with only a few rack local (maybe 10 of the 250 mappers).
>>>>>>> 
>>>>>>> Currently the table is a wide table schema, with lists of data structures stored as columns with column prefixes grouping the data structures (e.g. 1_name, 1_address, 1_city, 2_name, 2_address, 2_city). I was thinking of moving those data structures to protobuf which would cut down on the number of columns. The downside is I can't filter on one value with that, but it is a tradeoff I would make for performance. I was also considering restructuring the table into a tall table.
>>>>>>> 
>>>>>>> Something interesting is that my old regionserver machines had five 15k SCSI drives instead of 2 SSDs, and performance was about the same. Also, my old network was 1gbit, now it is 10gbit. So neither network nor disk I/O appear to be the bottleneck. The CPU is rather high for the regionserver so it seems like the best candidate to investigate. I will try profiling it tomorrow and will report back. I may revisit compression on vs off since that is adding load to the CPU.
>>>>>>> 
>>>>>>> I'll also come up with a sample program that generates data similar to my table.
>>>>>>> 
>>>>>>> 
>>>>>>> On Apr 30, 2013, at 10:01 PM, lars hofhansl <la...@apache.org> wrote:
>>>>>>> 
>>>>>>>> Your average row is 35k so scanner caching would not make a huge difference, although I would have expected some improvements by setting it to 10 or 50 since you have a wide 10ge pipe.
>>>>>>>> 
>>>>>>>> I assume your table is split sufficiently to touch all RegionServer... Do you see the same load/IO on all region servers?
>>>>>>>> 
>>>>>>>> A bunch of scan improvements went into HBase since 0.94.2.
>>>>>>>> I blogged about some of these changes here: http://hadoop-hbase.blogspot.com/2012/12/hbase-profiling.html
>>>>>>>> 
>>>>>>>> In your case - since you have many columns, each of which carry the rowkey - you might benefit a lot from HBASE-7279.
>>>>>>>> 
>>>>>>>> In the end HBase *is* slower than straight HDFS for full scans. How could it not be?
>>>>>>>> So I would start by looking at HDFS first. Make sure Nagle's is disbaled in both HBase and HDFS.
>>>>>>>> 
>>>>>>>> And lastly SSDs are somewhat new territory for HBase. Maybe Andy Purtell is listening, I think he did some tests with HBase on SSDs.
>>>>>>>> With rotating media you typically see an improvement with compression. With SSDs the added CPU needed for decompression might outweigh the benefits.
>>>>>>>> 
>>>>>>>> At the risk of starting a larger discussion here, I would posit that HBase's LSM based design, which trades random IO with sequential IO, might be a bit more questionable on SSDs.
>>>>>>>> 
>>>>>>>> If you can, it would be nice to run a profiler against one of the RegionServers (or maybe do it with the single RS setup) and see where it is bottlenecked.
>>>>>>>> (And if you send me a sample program to generate some data - not 700g, though :) - I'll try to do a bit of profiling during the next days as my day job permits, but I do not have any machines with SSDs).
>>>>>>>> 
>>>>>>>> -- Lars
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> ________________________________
>>>>>>>> From: Bryan Keller <br...@gmail.com>
>>>>>>>> To: user@hbase.apache.org
>>>>>>>> Sent: Tuesday, April 30, 2013 9:31 PM
>>>>>>>> Subject: Re: Poor HBase map-reduce scan performance
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Yes, I have tried various settings for setCaching() and I have setCacheBlocks(false)
>>>>>>>> 
>>>>>>>> On Apr 30, 2013, at 9:17 PM, Ted Yu <yu...@gmail.com> wrote:
>>>>>>>> 
>>>>>>>>> From http://hbase.apache.org/book.html#mapreduce.example :
>>>>>>>>> 
>>>>>>>>> scan.setCaching(500);        // 1 is the default in Scan, which will
>>>>>>>>> be bad for MapReduce jobs
>>>>>>>>> scan.setCacheBlocks(false);  // don't set to true for MR jobs
>>>>>>>>> 
>>>>>>>>> I guess you have used the above setting.
>>>>>>>>> 
>>>>>>>>> 0.94.x releases are compatible. Have you considered upgrading to, say
>>>>>>>>> 0.94.7 which was recently released ?
>>>>>>>>> 
>>>>>>>>> Cheers
>>>>>>>>> 
>>>>>>>>> On Tue, Apr 30, 2013 at 9:01 PM, Bryan Keller <bryanck@gm
>>>>>> 
>

Re: Poor HBase map-reduce scan performance

Posted by Bryan Keller <br...@gmail.com>.

Thanks for the offer Lars! I haven't made much progress speeding things up.

I finally put together a test program that populates a table that is similar to my production dataset. I have a readme that should describe things, hopefully enough to make it useable. There is code to populate a test table, code to scan the table, and code to scan sequence files from an export (to compare HBase w/ raw HDFS). I use a gradle build script.

You can find the code here:

https://dl.dropboxusercontent.com/u/6880177/hbasetest.zip


On May 4, 2013, at 6:33 PM, lars hofhansl <la...@apache.org> wrote:

> The blockbuffers are not reused, but that by itself should not be a problem as they are all the same size (at least I have never identified that as one in my profiling sessions).
> 
> My offer still stands to do some profiling myself if there is an easy way to generate data of similar shape.
> 
> -- Lars
> 
> 
> 
> ________________________________
> From: Bryan Keller <br...@gmail.com>
> To: user@hbase.apache.org 
> Sent: Friday, May 3, 2013 3:44 AM
> Subject: Re: Poor HBase map-reduce scan performance
> 
> 
> Actually I'm not too confident in my results re block size, they may have been related to major compaction. I'm going to rerun before drawing any conclusions.
> 
> On May 3, 2013, at 12:17 AM, Bryan Keller <br...@gmail.com> wrote:
> 
>> I finally made some progress. I tried a very large HBase block size (16mb), and it significantly improved scan performance. I went from 45-50 min to 24 min. Not great but much better. Before I had it set to 128k. Scanning an equivalent sequence file takes 10 min. My random read performance will probably suffer with such a large block size (theoretically), so I probably can't keep it this big. I care about random read performance too. I've read having a block size this big is not recommended, is that correct?
>> 
>> I haven't dug too deeply into the code, are the block buffers reused or is each new block read a new allocation? Perhaps a buffer pool could help here if there isn't one already. When doing a scan, HBase could reuse previously allocated block buffers instead of allocating a new one for each block. Then block size shouldn't affect scan performance much.
>> 
>> I'm not using a block encoder. Also, I'm still sifting through the profiler results, I'll see if I can make more sense of it and run some more experiments.
>> 
>> On May 2, 2013, at 5:46 PM, lars hofhansl <la...@apache.org> wrote:
>> 
>>> Interesting. If you can try 0.94.7 (but it'll probably not have changed that much from 0.94.4)
>>> 
>>> 
>>> Do you have enabled one of the block encoders (FAST_DIFF, etc)? If so, try without. They currently need to reallocate a ByteBuffer for each single KV.
>>> (Sine you see ScannerV2 rather than EncodedScannerV2 you probably have not enabled encoding, just checking).
>>> 
>>> 
>>> And do you have a stack trace for the ByteBuffer.allocate(). That is a strange one since it never came up in my profiling (unless you enabled block encoding).
>>> (You can get traces from VisualVM by creating a snapshot, but you'd have to drill in to find the allocate()).
>>> 
>>> 
>>> During normal scanning (again, without encoding) there should be no allocation happening except for blocks read from disk (and they should all be the same size, thus allocation should be cheap).
>>> 
>>> -- Lars
>>> 
>>> 
>>> 
>>> ________________________________
>>> From: Bryan Keller <br...@gmail.com>
>>> To: user@hbase.apache.org 
>>> Sent: Thursday, May 2, 2013 10:54 AM
>>> Subject: Re: Poor HBase map-reduce scan performance
>>> 
>>> 
>>> I ran one of my regionservers through VisualVM. It looks like the top hot spots are HFileReaderV2$ScannerV2.getKeyValue() and ByteBuffer.allocate(). It appears at first glance that memory allocations may be an issue. Decompression was next below that but less of an issue it seems.
>>> 
>>> Would changing the block size, either HDFS or HBase, help here?
>>> 
>>> Also, if anyone has tips on how else to profile, that would be appreciated. VisualVM can produce a lot of noise that is hard to sift through.
>>> 
>>> 
>>> On May 1, 2013, at 9:49 PM, Bryan Keller <br...@gmail.com> wrote:
>>> 
>>>> I used exactly 0.94.4, pulled from the tag in subversion.
>>>> 
>>>> On May 1, 2013, at 9:41 PM, lars hofhansl <la...@apache.org> wrote:
>>>> 
>>>>> Hmm... Did you actually use exactly version 0.94.4, or the latest 0.94.7.
>>>>> I would be very curious to see profiling data.
>>>>> 
>>>>> -- Lars
>>>>> 
>>>>> 
>>>>> 
>>>>> ----- Original Message -----
>>>>> From: Bryan Keller <br...@gmail.com>
>>>>> To: "user@hbase.apache.org" <us...@hbase.apache.org>
>>>>> Cc: 
>>>>> Sent: Wednesday, May 1, 2013 6:01 PM
>>>>> Subject: Re: Poor HBase map-reduce scan performance
>>>>> 
>>>>> I tried running my test with 0.94.4, unfortunately performance was about the same. I'm planning on profiling the regionserver and trying some other things tonight and tomorrow and will report back.
>>>>> 
>>>>> On May 1, 2013, at 8:00 AM, Bryan Keller <br...@gmail.com> wrote:
>>>>> 
>>>>>> Yes I would like to try this, if you can point me to the pom.xml patch that would save me some time.
>>>>>> 
>>>>>> On Tuesday, April 30, 2013, lars hofhansl wrote:
>>>>>> If you can, try 0.94.4+; it should significantly reduce the amount of bytes copied around in RAM during scanning, especially if you have wide rows and/or large key portions. That in turns makes scans scale better across cores, since RAM is shared resource between cores (much like disk).
>>>>>> 
>>>>>> 
>>>>>> It's not hard to build the latest HBase against Cloudera's version of Hadoop. I can send along a simple patch to pom.xml to do that.
>>>>>> 
>>>>>> -- Lars
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> ________________________________
>>>>>>   From: Bryan Keller <br...@gmail.com>
>>>>>> To: user@hbase.apache.org
>>>>>> Sent: Tuesday, April 30, 2013 11:02 PM
>>>>>> Subject: Re: Poor HBase map-reduce scan performance
>>>>>> 
>>>>>> 
>>>>>> The table has hashed keys so rows are evenly distributed amongst the regionservers, and load on each regionserver is pretty much the same. I also have per-table balancing turned on. I get mostly data local mappers with only a few rack local (maybe 10 of the 250 mappers).
>>>>>> 
>>>>>> Currently the table is a wide table schema, with lists of data structures stored as columns with column prefixes grouping the data structures (e.g. 1_name, 1_address, 1_city, 2_name, 2_address, 2_city). I was thinking of moving those data structures to protobuf which would cut down on the number of columns. The downside is I can't filter on one value with that, but it is a tradeoff I would make for performance. I was also considering restructuring the table into a tall table.
>>>>>> 
>>>>>> Something interesting is that my old regionserver machines had five 15k SCSI drives instead of 2 SSDs, and performance was about the same. Also, my old network was 1gbit, now it is 10gbit. So neither network nor disk I/O appear to be the bottleneck. The CPU is rather high for the regionserver so it seems like the best candidate to investigate. I will try profiling it tomorrow and will report back. I may revisit compression on vs off since that is adding load to the CPU.
>>>>>> 
>>>>>> I'll also come up with a sample program that generates data similar to my table.
>>>>>> 
>>>>>> 
>>>>>> On Apr 30, 2013, at 10:01 PM, lars hofhansl <la...@apache.org> wrote:
>>>>>> 
>>>>>>> Your average row is 35k so scanner caching would not make a huge difference, although I would have expected some improvements by setting it to 10 or 50 since you have a wide 10ge pipe.
>>>>>>> 
>>>>>>> I assume your table is split sufficiently to touch all RegionServer... Do you see the same load/IO on all region servers?
>>>>>>> 
>>>>>>> A bunch of scan improvements went into HBase since 0.94.2.
>>>>>>> I blogged about some of these changes here: http://hadoop-hbase.blogspot.com/2012/12/hbase-profiling.html
>>>>>>> 
>>>>>>> In your case - since you have many columns, each of which carry the rowkey - you might benefit a lot from HBASE-7279.
>>>>>>> 
>>>>>>> In the end HBase *is* slower than straight HDFS for full scans. How could it not be?
>>>>>>> So I would start by looking at HDFS first. Make sure Nagle's is disbaled in both HBase and HDFS.
>>>>>>> 
>>>>>>> And lastly SSDs are somewhat new territory for HBase. Maybe Andy Purtell is listening, I think he did some tests with HBase on SSDs.
>>>>>>> With rotating media you typically see an improvement with compression. With SSDs the added CPU needed for decompression might outweigh the benefits.
>>>>>>> 
>>>>>>> At the risk of starting a larger discussion here, I would posit that HBase's LSM based design, which trades random IO with sequential IO, might be a bit more questionable on SSDs.
>>>>>>> 
>>>>>>> If you can, it would be nice to run a profiler against one of the RegionServers (or maybe do it with the single RS setup) and see where it is bottlenecked.
>>>>>>> (And if you send me a sample program to generate some data - not 700g, though :) - I'll try to do a bit of profiling during the next days as my day job permits, but I do not have any machines with SSDs).
>>>>>>> 
>>>>>>> -- Lars
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> ________________________________
>>>>>>> From: Bryan Keller <br...@gmail.com>
>>>>>>> To: user@hbase.apache.org
>>>>>>> Sent: Tuesday, April 30, 2013 9:31 PM
>>>>>>> Subject: Re: Poor HBase map-reduce scan performance
>>>>>>> 
>>>>>>> 
>>>>>>> Yes, I have tried various settings for setCaching() and I have setCacheBlocks(false)
>>>>>>> 
>>>>>>> On Apr 30, 2013, at 9:17 PM, Ted Yu <yu...@gmail.com> wrote:
>>>>>>> 
>>>>>>>> From http://hbase.apache.org/book.html#mapreduce.example :
>>>>>>>> 
>>>>>>>> scan.setCaching(500);        // 1 is the default in Scan, which will
>>>>>>>> be bad for MapReduce jobs
>>>>>>>> scan.setCacheBlocks(false);  // don't set to true for MR jobs
>>>>>>>> 
>>>>>>>> I guess you have used the above setting.
>>>>>>>> 
>>>>>>>> 0.94.x releases are compatible. Have you considered upgrading to, say
>>>>>>>> 0.94.7 which was recently released ?
>>>>>>>> 
>>>>>>>> Cheers
>>>>>>>> 
>>>>>>>> On Tue, Apr 30, 2013 at 9:01 PM, Bryan Keller <bryanck@gm
>>>>>

Re: Poor HBase map-reduce scan performance

Posted by lars hofhansl <la...@apache.org>.

The blockbuffers are not reused, but that by itself should not be a problem as they are all the same size (at least I have never identified that as one in my profiling sessions).

My offer still stands to do some profiling myself if there is an easy way to generate data of similar shape.

-- Lars



________________________________
 From: Bryan Keller <br...@gmail.com>
To: user@hbase.apache.org 
Sent: Friday, May 3, 2013 3:44 AM
Subject: Re: Poor HBase map-reduce scan performance
 

Actually I'm not too confident in my results re block size, they may have been related to major compaction. I'm going to rerun before drawing any conclusions.

On May 3, 2013, at 12:17 AM, Bryan Keller <br...@gmail.com> wrote:

> I finally made some progress. I tried a very large HBase block size (16mb), and it significantly improved scan performance. I went from 45-50 min to 24 min. Not great but much better. Before I had it set to 128k. Scanning an equivalent sequence file takes 10 min. My random read performance will probably suffer with such a large block size (theoretically), so I probably can't keep it this big. I care about random read performance too. I've read having a block size this big is not recommended, is that correct?
> 
> I haven't dug too deeply into the code, are the block buffers reused or is each new block read a new allocation? Perhaps a buffer pool could help here if there isn't one already. When doing a scan, HBase could reuse previously allocated block buffers instead of allocating a new one for each block. Then block size shouldn't affect scan performance much.
> 
> I'm not using a block encoder. Also, I'm still sifting through the profiler results, I'll see if I can make more sense of it and run some more experiments.
> 
> On May 2, 2013, at 5:46 PM, lars hofhansl <la...@apache.org> wrote:
> 
>> Interesting. If you can try 0.94.7 (but it'll probably not have changed that much from 0.94.4)
>> 
>> 
>> Do you have enabled one of the block encoders (FAST_DIFF, etc)? If so, try without. They currently need to reallocate a ByteBuffer for each single KV.
>> (Sine you see ScannerV2 rather than EncodedScannerV2 you probably have not enabled encoding, just checking).
>> 
>> 
>> And do you have a stack trace for the ByteBuffer.allocate(). That is a strange one since it never came up in my profiling (unless you enabled block encoding).
>> (You can get traces from VisualVM by creating a snapshot, but you'd have to drill in to find the allocate()).
>> 
>> 
>> During normal scanning (again, without encoding) there should be no allocation happening except for blocks read from disk (and they should all be the same size, thus allocation should be cheap).
>> 
>> -- Lars
>> 
>> 
>> 
>> ________________________________
>> From: Bryan Keller <br...@gmail.com>
>> To: user@hbase.apache.org 
>> Sent: Thursday, May 2, 2013 10:54 AM
>> Subject: Re: Poor HBase map-reduce scan performance
>> 
>> 
>> I ran one of my regionservers through VisualVM. It looks like the top hot spots are HFileReaderV2$ScannerV2.getKeyValue() and ByteBuffer.allocate(). It appears at first glance that memory allocations may be an issue. Decompression was next below that but less of an issue it seems.
>> 
>> Would changing the block size, either HDFS or HBase, help here?
>> 
>> Also, if anyone has tips on how else to profile, that would be appreciated. VisualVM can produce a lot of noise that is hard to sift through.
>> 
>> 
>> On May 1, 2013, at 9:49 PM, Bryan Keller <br...@gmail.com> wrote:
>> 
>>> I used exactly 0.94.4, pulled from the tag in subversion.
>>> 
>>> On May 1, 2013, at 9:41 PM, lars hofhansl <la...@apache.org> wrote:
>>> 
>>>> Hmm... Did you actually use exactly version 0.94.4, or the latest 0.94.7.
>>>> I would be very curious to see profiling data.
>>>> 
>>>> -- Lars
>>>> 
>>>> 
>>>> 
>>>> ----- Original Message -----
>>>> From: Bryan Keller <br...@gmail.com>
>>>> To: "user@hbase.apache.org" <us...@hbase.apache.org>
>>>> Cc: 
>>>> Sent: Wednesday, May 1, 2013 6:01 PM
>>>> Subject: Re: Poor HBase map-reduce scan performance
>>>> 
>>>> I tried running my test with 0.94.4, unfortunately performance was about the same. I'm planning on profiling the regionserver and trying some other things tonight and tomorrow and will report back.
>>>> 
>>>> On May 1, 2013, at 8:00 AM, Bryan Keller <br...@gmail.com> wrote:
>>>> 
>>>>> Yes I would like to try this, if you can point me to the pom.xml patch that would save me some time.
>>>>> 
>>>>> On Tuesday, April 30, 2013, lars hofhansl wrote:
>>>>> If you can, try 0.94.4+; it should significantly reduce the amount of bytes copied around in RAM during scanning, especially if you have wide rows and/or large key portions. That in turns makes scans scale better across cores, since RAM is shared resource between cores (much like disk).
>>>>> 
>>>>> 
>>>>> It's not hard to build the latest HBase against Cloudera's version of Hadoop. I can send along a simple patch to pom.xml to do that.
>>>>> 
>>>>> -- Lars
>>>>> 
>>>>> 
>>>>> 
>>>>> ________________________________
>>>>>  From: Bryan Keller <br...@gmail.com>
>>>>> To: user@hbase.apache.org
>>>>> Sent: Tuesday, April 30, 2013 11:02 PM
>>>>> Subject: Re: Poor HBase map-reduce scan performance
>>>>> 
>>>>> 
>>>>> The table has hashed keys so rows are evenly distributed amongst the regionservers, and load on each regionserver is pretty much the same. I also have per-table balancing turned on. I get mostly data local mappers with only a few rack local (maybe 10 of the 250 mappers).
>>>>> 
>>>>> Currently the table is a wide table schema, with lists of data structures stored as columns with column prefixes grouping the data structures (e.g. 1_name, 1_address, 1_city, 2_name, 2_address, 2_city). I was thinking of moving those data structures to protobuf which would cut down on the number of columns. The downside is I can't filter on one value with that, but it is a tradeoff I would make for performance. I was also considering restructuring the table into a tall table.
>>>>> 
>>>>> Something interesting is that my old regionserver machines had five 15k SCSI drives instead of 2 SSDs, and performance was about the same. Also, my old network was 1gbit, now it is 10gbit. So neither network nor disk I/O appear to be the bottleneck. The CPU is rather high for the regionserver so it seems like the best candidate to investigate. I will try profiling it tomorrow and will report back. I may revisit compression on vs off since that is adding load to the CPU.
>>>>> 
>>>>> I'll also come up with a sample program that generates data similar to my table.
>>>>> 
>>>>> 
>>>>> On Apr 30, 2013, at 10:01 PM, lars hofhansl <la...@apache.org> wrote:
>>>>> 
>>>>>> Your average row is 35k so scanner caching would not make a huge difference, although I would have expected some improvements by setting it to 10 or 50 since you have a wide 10ge pipe.
>>>>>> 
>>>>>> I assume your table is split sufficiently to touch all RegionServer... Do you see the same load/IO on all region servers?
>>>>>> 
>>>>>> A bunch of scan improvements went into HBase since 0.94.2.
>>>>>> I blogged about some of these changes here: http://hadoop-hbase.blogspot.com/2012/12/hbase-profiling.html
>>>>>> 
>>>>>> In your case - since you have many columns, each of which carry the rowkey - you might benefit a lot from HBASE-7279.
>>>>>> 
>>>>>> In the end HBase *is* slower than straight HDFS for full scans. How could it not be?
>>>>>> So I would start by looking at HDFS first. Make sure Nagle's is disbaled in both HBase and HDFS.
>>>>>> 
>>>>>> And lastly SSDs are somewhat new territory for HBase. Maybe Andy Purtell is listening, I think he did some tests with HBase on SSDs.
>>>>>> With rotating media you typically see an improvement with compression. With SSDs the added CPU needed for decompression might outweigh the benefits.
>>>>>> 
>>>>>> At the risk of starting a larger discussion here, I would posit that HBase's LSM based design, which trades random IO with sequential IO, might be a bit more questionable on SSDs.
>>>>>> 
>>>>>> If you can, it would be nice to run a profiler against one of the RegionServers (or maybe do it with the single RS setup) and see where it is bottlenecked.
>>>>>> (And if you send me a sample program to generate some data - not 700g, though :) - I'll try to do a bit of profiling during the next days as my day job permits, but I do not have any machines with SSDs).
>>>>>> 
>>>>>> -- Lars
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> ________________________________
>>>>>> From: Bryan Keller <br...@gmail.com>
>>>>>> To: user@hbase.apache.org
>>>>>> Sent: Tuesday, April 30, 2013 9:31 PM
>>>>>> Subject: Re: Poor HBase map-reduce scan performance
>>>>>> 
>>>>>> 
>>>>>> Yes, I have tried various settings for setCaching() and I have setCacheBlocks(false)
>>>>>> 
>>>>>> On Apr 30, 2013, at 9:17 PM, Ted Yu <yu...@gmail.com> wrote:
>>>>>> 
>>>>>>> From http://hbase.apache.org/book.html#mapreduce.example :
>>>>>>> 
>>>>>>> scan.setCaching(500);        // 1 is the default in Scan, which will
>>>>>>> be bad for MapReduce jobs
>>>>>>> scan.setCacheBlocks(false);  // don't set to true for MR jobs
>>>>>>> 
>>>>>>> I guess you have used the above setting.
>>>>>>> 
>>>>>>> 0.94.x releases are compatible. Have you considered upgrading to, say
>>>>>>> 0.94.7 which was recently released ?
>>>>>>> 
>>>>>>> Cheers
>>>>>>> 
>>>>>>> On Tue, Apr 30, 2013 at 9:01 PM, Bryan Keller <bryanck@gm
>>>> 
>

Re: Poor HBase map-reduce scan performance

Posted by Bryan Keller <br...@gmail.com>.

Actually I'm not too confident in my results re block size, they may have been related to major compaction. I'm going to rerun before drawing any conclusions.

On May 3, 2013, at 12:17 AM, Bryan Keller <br...@gmail.com> wrote:

> I finally made some progress. I tried a very large HBase block size (16mb), and it significantly improved scan performance. I went from 45-50 min to 24 min. Not great but much better. Before I had it set to 128k. Scanning an equivalent sequence file takes 10 min. My random read performance will probably suffer with such a large block size (theoretically), so I probably can't keep it this big. I care about random read performance too. I've read having a block size this big is not recommended, is that correct?
> 
> I haven't dug too deeply into the code, are the block buffers reused or is each new block read a new allocation? Perhaps a buffer pool could help here if there isn't one already. When doing a scan, HBase could reuse previously allocated block buffers instead of allocating a new one for each block. Then block size shouldn't affect scan performance much.
> 
> I'm not using a block encoder. Also, I'm still sifting through the profiler results, I'll see if I can make more sense of it and run some more experiments.
> 
> On May 2, 2013, at 5:46 PM, lars hofhansl <la...@apache.org> wrote:
> 
>> Interesting. If you can try 0.94.7 (but it'll probably not have changed that much from 0.94.4)
>> 
>> 
>> Do you have enabled one of the block encoders (FAST_DIFF, etc)? If so, try without. They currently need to reallocate a ByteBuffer for each single KV.
>> (Sine you see ScannerV2 rather than EncodedScannerV2 you probably have not enabled encoding, just checking).
>> 
>> 
>> And do you have a stack trace for the ByteBuffer.allocate(). That is a strange one since it never came up in my profiling (unless you enabled block encoding).
>> (You can get traces from VisualVM by creating a snapshot, but you'd have to drill in to find the allocate()).
>> 
>> 
>> During normal scanning (again, without encoding) there should be no allocation happening except for blocks read from disk (and they should all be the same size, thus allocation should be cheap).
>> 
>> -- Lars
>> 
>> 
>> 
>> ________________________________
>> From: Bryan Keller <br...@gmail.com>
>> To: user@hbase.apache.org 
>> Sent: Thursday, May 2, 2013 10:54 AM
>> Subject: Re: Poor HBase map-reduce scan performance
>> 
>> 
>> I ran one of my regionservers through VisualVM. It looks like the top hot spots are HFileReaderV2$ScannerV2.getKeyValue() and ByteBuffer.allocate(). It appears at first glance that memory allocations may be an issue. Decompression was next below that but less of an issue it seems.
>> 
>> Would changing the block size, either HDFS or HBase, help here?
>> 
>> Also, if anyone has tips on how else to profile, that would be appreciated. VisualVM can produce a lot of noise that is hard to sift through.
>> 
>> 
>> On May 1, 2013, at 9:49 PM, Bryan Keller <br...@gmail.com> wrote:
>> 
>>> I used exactly 0.94.4, pulled from the tag in subversion.
>>> 
>>> On May 1, 2013, at 9:41 PM, lars hofhansl <la...@apache.org> wrote:
>>> 
>>>> Hmm... Did you actually use exactly version 0.94.4, or the latest 0.94.7.
>>>> I would be very curious to see profiling data.
>>>> 
>>>> -- Lars
>>>> 
>>>> 
>>>> 
>>>> ----- Original Message -----
>>>> From: Bryan Keller <br...@gmail.com>
>>>> To: "user@hbase.apache.org" <us...@hbase.apache.org>
>>>> Cc: 
>>>> Sent: Wednesday, May 1, 2013 6:01 PM
>>>> Subject: Re: Poor HBase map-reduce scan performance
>>>> 
>>>> I tried running my test with 0.94.4, unfortunately performance was about the same. I'm planning on profiling the regionserver and trying some other things tonight and tomorrow and will report back.
>>>> 
>>>> On May 1, 2013, at 8:00 AM, Bryan Keller <br...@gmail.com> wrote:
>>>> 
>>>>> Yes I would like to try this, if you can point me to the pom.xml patch that would save me some time.
>>>>> 
>>>>> On Tuesday, April 30, 2013, lars hofhansl wrote:
>>>>> If you can, try 0.94.4+; it should significantly reduce the amount of bytes copied around in RAM during scanning, especially if you have wide rows and/or large key portions. That in turns makes scans scale better across cores, since RAM is shared resource between cores (much like disk).
>>>>> 
>>>>> 
>>>>> It's not hard to build the latest HBase against Cloudera's version of Hadoop. I can send along a simple patch to pom.xml to do that.
>>>>> 
>>>>> -- Lars
>>>>> 
>>>>> 
>>>>> 
>>>>> ________________________________
>>>>>  From: Bryan Keller <br...@gmail.com>
>>>>> To: user@hbase.apache.org
>>>>> Sent: Tuesday, April 30, 2013 11:02 PM
>>>>> Subject: Re: Poor HBase map-reduce scan performance
>>>>> 
>>>>> 
>>>>> The table has hashed keys so rows are evenly distributed amongst the regionservers, and load on each regionserver is pretty much the same. I also have per-table balancing turned on. I get mostly data local mappers with only a few rack local (maybe 10 of the 250 mappers).
>>>>> 
>>>>> Currently the table is a wide table schema, with lists of data structures stored as columns with column prefixes grouping the data structures (e.g. 1_name, 1_address, 1_city, 2_name, 2_address, 2_city). I was thinking of moving those data structures to protobuf which would cut down on the number of columns. The downside is I can't filter on one value with that, but it is a tradeoff I would make for performance. I was also considering restructuring the table into a tall table.
>>>>> 
>>>>> Something interesting is that my old regionserver machines had five 15k SCSI drives instead of 2 SSDs, and performance was about the same. Also, my old network was 1gbit, now it is 10gbit. So neither network nor disk I/O appear to be the bottleneck. The CPU is rather high for the regionserver so it seems like the best candidate to investigate. I will try profiling it tomorrow and will report back. I may revisit compression on vs off since that is adding load to the CPU.
>>>>> 
>>>>> I'll also come up with a sample program that generates data similar to my table.
>>>>> 
>>>>> 
>>>>> On Apr 30, 2013, at 10:01 PM, lars hofhansl <la...@apache.org> wrote:
>>>>> 
>>>>>> Your average row is 35k so scanner caching would not make a huge difference, although I would have expected some improvements by setting it to 10 or 50 since you have a wide 10ge pipe.
>>>>>> 
>>>>>> I assume your table is split sufficiently to touch all RegionServer... Do you see the same load/IO on all region servers?
>>>>>> 
>>>>>> A bunch of scan improvements went into HBase since 0.94.2.
>>>>>> I blogged about some of these changes here: http://hadoop-hbase.blogspot.com/2012/12/hbase-profiling.html
>>>>>> 
>>>>>> In your case - since you have many columns, each of which carry the rowkey - you might benefit a lot from HBASE-7279.
>>>>>> 
>>>>>> In the end HBase *is* slower than straight HDFS for full scans. How could it not be?
>>>>>> So I would start by looking at HDFS first. Make sure Nagle's is disbaled in both HBase and HDFS.
>>>>>> 
>>>>>> And lastly SSDs are somewhat new territory for HBase. Maybe Andy Purtell is listening, I think he did some tests with HBase on SSDs.
>>>>>> With rotating media you typically see an improvement with compression. With SSDs the added CPU needed for decompression might outweigh the benefits.
>>>>>> 
>>>>>> At the risk of starting a larger discussion here, I would posit that HBase's LSM based design, which trades random IO with sequential IO, might be a bit more questionable on SSDs.
>>>>>> 
>>>>>> If you can, it would be nice to run a profiler against one of the RegionServers (or maybe do it with the single RS setup) and see where it is bottlenecked.
>>>>>> (And if you send me a sample program to generate some data - not 700g, though :) - I'll try to do a bit of profiling during the next days as my day job permits, but I do not have any machines with SSDs).
>>>>>> 
>>>>>> -- Lars
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> ________________________________
>>>>>> From: Bryan Keller <br...@gmail.com>
>>>>>> To: user@hbase.apache.org
>>>>>> Sent: Tuesday, April 30, 2013 9:31 PM
>>>>>> Subject: Re: Poor HBase map-reduce scan performance
>>>>>> 
>>>>>> 
>>>>>> Yes, I have tried various settings for setCaching() and I have setCacheBlocks(false)
>>>>>> 
>>>>>> On Apr 30, 2013, at 9:17 PM, Ted Yu <yu...@gmail.com> wrote:
>>>>>> 
>>>>>>> From http://hbase.apache.org/book.html#mapreduce.example :
>>>>>>> 
>>>>>>> scan.setCaching(500);        // 1 is the default in Scan, which will
>>>>>>> be bad for MapReduce jobs
>>>>>>> scan.setCacheBlocks(false);  // don't set to true for MR jobs
>>>>>>> 
>>>>>>> I guess you have used the above setting.
>>>>>>> 
>>>>>>> 0.94.x releases are compatible. Have you considered upgrading to, say
>>>>>>> 0.94.7 which was recently released ?
>>>>>>> 
>>>>>>> Cheers
>>>>>>> 
>>>>>>> On Tue, Apr 30, 2013 at 9:01 PM, Bryan Keller <bryanck@gm
>>>> 
>

Re: Poor HBase map-reduce scan performance

Posted by Michael Segel <mi...@hotmail.com>.

You really don't want to mess around with the block size.

Sure larger blocks are better for sequential scans, but the minute you do a lot of random ad hoc fetches... you're kinda screwed. 


On May 3, 2013, at 2:17 AM, Bryan Keller <br...@gmail.com> wrote:

> I finally made some progress. I tried a very large HBase block size (16mb), and it significantly improved scan performance. I went from 45-50 min to 24 min. Not great but much better. Before I had it set to 128k. Scanning an equivalent sequence file takes 10 min. My random read performance will probably suffer with such a large block size (theoretically), so I probably can't keep it this big. I care about random read performance too. I've read having a block size this big is not recommended, is that correct?
> 
> I haven't dug too deeply into the code, are the block buffers reused or is each new block read a new allocation? Perhaps a buffer pool could help here if there isn't one already. When doing a scan, HBase could reuse previously allocated block buffers instead of allocating a new one for each block. Then block size shouldn't affect scan performance much.
> 
> I'm not using a block encoder. Also, I'm still sifting through the profiler results, I'll see if I can make more sense of it and run some more experiments.
> 
> On May 2, 2013, at 5:46 PM, lars hofhansl <la...@apache.org> wrote:
> 
>> Interesting. If you can try 0.94.7 (but it'll probably not have changed that much from 0.94.4)
>> 
>> 
>> Do you have enabled one of the block encoders (FAST_DIFF, etc)? If so, try without. They currently need to reallocate a ByteBuffer for each single KV.
>> (Sine you see ScannerV2 rather than EncodedScannerV2 you probably have not enabled encoding, just checking).
>> 
>> 
>> And do you have a stack trace for the ByteBuffer.allocate(). That is a strange one since it never came up in my profiling (unless you enabled block encoding).
>> (You can get traces from VisualVM by creating a snapshot, but you'd have to drill in to find the allocate()).
>> 
>> 
>> During normal scanning (again, without encoding) there should be no allocation happening except for blocks read from disk (and they should all be the same size, thus allocation should be cheap).
>> 
>> -- Lars
>> 
>> 
>> 
>> ________________________________
>> From: Bryan Keller <br...@gmail.com>
>> To: user@hbase.apache.org 
>> Sent: Thursday, May 2, 2013 10:54 AM
>> Subject: Re: Poor HBase map-reduce scan performance
>> 
>> 
>> I ran one of my regionservers through VisualVM. It looks like the top hot spots are HFileReaderV2$ScannerV2.getKeyValue() and ByteBuffer.allocate(). It appears at first glance that memory allocations may be an issue. Decompression was next below that but less of an issue it seems.
>> 
>> Would changing the block size, either HDFS or HBase, help here?
>> 
>> Also, if anyone has tips on how else to profile, that would be appreciated. VisualVM can produce a lot of noise that is hard to sift through.
>> 
>> 
>> On May 1, 2013, at 9:49 PM, Bryan Keller <br...@gmail.com> wrote:
>> 
>>> I used exactly 0.94.4, pulled from the tag in subversion.
>>> 
>>> On May 1, 2013, at 9:41 PM, lars hofhansl <la...@apache.org> wrote:
>>> 
>>>> Hmm... Did you actually use exactly version 0.94.4, or the latest 0.94.7.
>>>> I would be very curious to see profiling data.
>>>> 
>>>> -- Lars
>>>> 
>>>> 
>>>> 
>>>> ----- Original Message -----
>>>> From: Bryan Keller <br...@gmail.com>
>>>> To: "user@hbase.apache.org" <us...@hbase.apache.org>
>>>> Cc: 
>>>> Sent: Wednesday, May 1, 2013 6:01 PM
>>>> Subject: Re: Poor HBase map-reduce scan performance
>>>> 
>>>> I tried running my test with 0.94.4, unfortunately performance was about the same. I'm planning on profiling the regionserver and trying some other things tonight and tomorrow and will report back.
>>>> 
>>>> On May 1, 2013, at 8:00 AM, Bryan Keller <br...@gmail.com> wrote:
>>>> 
>>>>> Yes I would like to try this, if you can point me to the pom.xml patch that would save me some time.
>>>>> 
>>>>> On Tuesday, April 30, 2013, lars hofhansl wrote:
>>>>> If you can, try 0.94.4+; it should significantly reduce the amount of bytes copied around in RAM during scanning, especially if you have wide rows and/or large key portions. That in turns makes scans scale better across cores, since RAM is shared resource between cores (much like disk).
>>>>> 
>>>>> 
>>>>> It's not hard to build the latest HBase against Cloudera's version of Hadoop. I can send along a simple patch to pom.xml to do that.
>>>>> 
>>>>> -- Lars
>>>>> 
>>>>> 
>>>>> 
>>>>> ________________________________
>>>>>  From: Bryan Keller <br...@gmail.com>
>>>>> To: user@hbase.apache.org
>>>>> Sent: Tuesday, April 30, 2013 11:02 PM
>>>>> Subject: Re: Poor HBase map-reduce scan performance
>>>>> 
>>>>> 
>>>>> The table has hashed keys so rows are evenly distributed amongst the regionservers, and load on each regionserver is pretty much the same. I also have per-table balancing turned on. I get mostly data local mappers with only a few rack local (maybe 10 of the 250 mappers).
>>>>> 
>>>>> Currently the table is a wide table schema, with lists of data structures stored as columns with column prefixes grouping the data structures (e.g. 1_name, 1_address, 1_city, 2_name, 2_address, 2_city). I was thinking of moving those data structures to protobuf which would cut down on the number of columns. The downside is I can't filter on one value with that, but it is a tradeoff I would make for performance. I was also considering restructuring the table into a tall table.
>>>>> 
>>>>> Something interesting is that my old regionserver machines had five 15k SCSI drives instead of 2 SSDs, and performance was about the same. Also, my old network was 1gbit, now it is 10gbit. So neither network nor disk I/O appear to be the bottleneck. The CPU is rather high for the regionserver so it seems like the best candidate to investigate. I will try profiling it tomorrow and will report back. I may revisit compression on vs off since that is adding load to the CPU.
>>>>> 
>>>>> I'll also come up with a sample program that generates data similar to my table.
>>>>> 
>>>>> 
>>>>> On Apr 30, 2013, at 10:01 PM, lars hofhansl <la...@apache.org> wrote:
>>>>> 
>>>>>> Your average row is 35k so scanner caching would not make a huge difference, although I would have expected some improvements by setting it to 10 or 50 since you have a wide 10ge pipe.
>>>>>> 
>>>>>> I assume your table is split sufficiently to touch all RegionServer... Do you see the same load/IO on all region servers?
>>>>>> 
>>>>>> A bunch of scan improvements went into HBase since 0.94.2.
>>>>>> I blogged about some of these changes here: http://hadoop-hbase.blogspot.com/2012/12/hbase-profiling.html
>>>>>> 
>>>>>> In your case - since you have many columns, each of which carry the rowkey - you might benefit a lot from HBASE-7279.
>>>>>> 
>>>>>> In the end HBase *is* slower than straight HDFS for full scans. How could it not be?
>>>>>> So I would start by looking at HDFS first. Make sure Nagle's is disbaled in both HBase and HDFS.
>>>>>> 
>>>>>> And lastly SSDs are somewhat new territory for HBase. Maybe Andy Purtell is listening, I think he did some tests with HBase on SSDs.
>>>>>> With rotating media you typically see an improvement with compression. With SSDs the added CPU needed for decompression might outweigh the benefits.
>>>>>> 
>>>>>> At the risk of starting a larger discussion here, I would posit that HBase's LSM based design, which trades random IO with sequential IO, might be a bit more questionable on SSDs.
>>>>>> 
>>>>>> If you can, it would be nice to run a profiler against one of the RegionServers (or maybe do it with the single RS setup) and see where it is bottlenecked.
>>>>>> (And if you send me a sample program to generate some data - not 700g, though :) - I'll try to do a bit of profiling during the next days as my day job permits, but I do not have any machines with SSDs).
>>>>>> 
>>>>>> -- Lars
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> ________________________________
>>>>>> From: Bryan Keller <br...@gmail.com>
>>>>>> To: user@hbase.apache.org
>>>>>> Sent: Tuesday, April 30, 2013 9:31 PM
>>>>>> Subject: Re: Poor HBase map-reduce scan performance
>>>>>> 
>>>>>> 
>>>>>> Yes, I have tried various settings for setCaching() and I have setCacheBlocks(false)
>>>>>> 
>>>>>> On Apr 30, 2013, at 9:17 PM, Ted Yu <yu...@gmail.com> wrote:
>>>>>> 
>>>>>>> From http://hbase.apache.org/book.html#mapreduce.example :
>>>>>>> 
>>>>>>> scan.setCaching(500);        // 1 is the default in Scan, which will
>>>>>>> be bad for MapReduce jobs
>>>>>>> scan.setCacheBlocks(false);  // don't set to true for MR jobs
>>>>>>> 
>>>>>>> I guess you have used the above setting.
>>>>>>> 
>>>>>>> 0.94.x releases are compatible. Have you considered upgrading to, say
>>>>>>> 0.94.7 which was recently released ?
>>>>>>> 
>>>>>>> Cheers
>>>>>>> 
>>>>>>> On Tue, Apr 30, 2013 at 9:01 PM, Bryan Keller <bryanck@gm
>>>> 
> 
> 

The opinions expressed here are mine, while they may reflect a cognitive thought, that is purely accidental. 
Use at your own risk. 
Michael Segel
michael_segel (AT) hotmail.com

Re: Poor HBase map-reduce scan performance

Posted by Bryan Keller <br...@gmail.com>.

I finally made some progress. I tried a very large HBase block size (16mb), and it significantly improved scan performance. I went from 45-50 min to 24 min. Not great but much better. Before I had it set to 128k. Scanning an equivalent sequence file takes 10 min. My random read performance will probably suffer with such a large block size (theoretically), so I probably can't keep it this big. I care about random read performance too. I've read having a block size this big is not recommended, is that correct?

I haven't dug too deeply into the code, are the block buffers reused or is each new block read a new allocation? Perhaps a buffer pool could help here if there isn't one already. When doing a scan, HBase could reuse previously allocated block buffers instead of allocating a new one for each block. Then block size shouldn't affect scan performance much.

I'm not using a block encoder. Also, I'm still sifting through the profiler results, I'll see if I can make more sense of it and run some more experiments.

On May 2, 2013, at 5:46 PM, lars hofhansl <la...@apache.org> wrote:

> Interesting. If you can try 0.94.7 (but it'll probably not have changed that much from 0.94.4)
> 
> 
> Do you have enabled one of the block encoders (FAST_DIFF, etc)? If so, try without. They currently need to reallocate a ByteBuffer for each single KV.
> (Sine you see ScannerV2 rather than EncodedScannerV2 you probably have not enabled encoding, just checking).
> 
> 
> And do you have a stack trace for the ByteBuffer.allocate(). That is a strange one since it never came up in my profiling (unless you enabled block encoding).
> (You can get traces from VisualVM by creating a snapshot, but you'd have to drill in to find the allocate()).
> 
> 
> During normal scanning (again, without encoding) there should be no allocation happening except for blocks read from disk (and they should all be the same size, thus allocation should be cheap).
> 
> -- Lars
> 
> 
> 
> ________________________________
> From: Bryan Keller <br...@gmail.com>
> To: user@hbase.apache.org 
> Sent: Thursday, May 2, 2013 10:54 AM
> Subject: Re: Poor HBase map-reduce scan performance
> 
> 
> I ran one of my regionservers through VisualVM. It looks like the top hot spots are HFileReaderV2$ScannerV2.getKeyValue() and ByteBuffer.allocate(). It appears at first glance that memory allocations may be an issue. Decompression was next below that but less of an issue it seems.
> 
> Would changing the block size, either HDFS or HBase, help here?
> 
> Also, if anyone has tips on how else to profile, that would be appreciated. VisualVM can produce a lot of noise that is hard to sift through.
> 
> 
> On May 1, 2013, at 9:49 PM, Bryan Keller <br...@gmail.com> wrote:
> 
>> I used exactly 0.94.4, pulled from the tag in subversion.
>> 
>> On May 1, 2013, at 9:41 PM, lars hofhansl <la...@apache.org> wrote:
>> 
>>> Hmm... Did you actually use exactly version 0.94.4, or the latest 0.94.7.
>>> I would be very curious to see profiling data.
>>> 
>>> -- Lars
>>> 
>>> 
>>> 
>>> ----- Original Message -----
>>> From: Bryan Keller <br...@gmail.com>
>>> To: "user@hbase.apache.org" <us...@hbase.apache.org>
>>> Cc: 
>>> Sent: Wednesday, May 1, 2013 6:01 PM
>>> Subject: Re: Poor HBase map-reduce scan performance
>>> 
>>> I tried running my test with 0.94.4, unfortunately performance was about the same. I'm planning on profiling the regionserver and trying some other things tonight and tomorrow and will report back.
>>> 
>>> On May 1, 2013, at 8:00 AM, Bryan Keller <br...@gmail.com> wrote:
>>> 
>>>> Yes I would like to try this, if you can point me to the pom.xml patch that would save me some time.
>>>> 
>>>> On Tuesday, April 30, 2013, lars hofhansl wrote:
>>>> If you can, try 0.94.4+; it should significantly reduce the amount of bytes copied around in RAM during scanning, especially if you have wide rows and/or large key portions. That in turns makes scans scale better across cores, since RAM is shared resource between cores (much like disk).
>>>> 
>>>> 
>>>> It's not hard to build the latest HBase against Cloudera's version of Hadoop. I can send along a simple patch to pom.xml to do that.
>>>> 
>>>> -- Lars
>>>> 
>>>> 
>>>> 
>>>> ________________________________
>>>>   From: Bryan Keller <br...@gmail.com>
>>>> To: user@hbase.apache.org
>>>> Sent: Tuesday, April 30, 2013 11:02 PM
>>>> Subject: Re: Poor HBase map-reduce scan performance
>>>> 
>>>> 
>>>> The table has hashed keys so rows are evenly distributed amongst the regionservers, and load on each regionserver is pretty much the same. I also have per-table balancing turned on. I get mostly data local mappers with only a few rack local (maybe 10 of the 250 mappers).
>>>> 
>>>> Currently the table is a wide table schema, with lists of data structures stored as columns with column prefixes grouping the data structures (e.g. 1_name, 1_address, 1_city, 2_name, 2_address, 2_city). I was thinking of moving those data structures to protobuf which would cut down on the number of columns. The downside is I can't filter on one value with that, but it is a tradeoff I would make for performance. I was also considering restructuring the table into a tall table.
>>>> 
>>>> Something interesting is that my old regionserver machines had five 15k SCSI drives instead of 2 SSDs, and performance was about the same. Also, my old network was 1gbit, now it is 10gbit. So neither network nor disk I/O appear to be the bottleneck. The CPU is rather high for the regionserver so it seems like the best candidate to investigate. I will try profiling it tomorrow and will report back. I may revisit compression on vs off since that is adding load to the CPU.
>>>> 
>>>> I'll also come up with a sample program that generates data similar to my table.
>>>> 
>>>> 
>>>> On Apr 30, 2013, at 10:01 PM, lars hofhansl <la...@apache.org> wrote:
>>>> 
>>>>> Your average row is 35k so scanner caching would not make a huge difference, although I would have expected some improvements by setting it to 10 or 50 since you have a wide 10ge pipe.
>>>>> 
>>>>> I assume your table is split sufficiently to touch all RegionServer... Do you see the same load/IO on all region servers?
>>>>> 
>>>>> A bunch of scan improvements went into HBase since 0.94.2.
>>>>> I blogged about some of these changes here: http://hadoop-hbase.blogspot.com/2012/12/hbase-profiling.html
>>>>> 
>>>>> In your case - since you have many columns, each of which carry the rowkey - you might benefit a lot from HBASE-7279.
>>>>> 
>>>>> In the end HBase *is* slower than straight HDFS for full scans. How could it not be?
>>>>> So I would start by looking at HDFS first. Make sure Nagle's is disbaled in both HBase and HDFS.
>>>>> 
>>>>> And lastly SSDs are somewhat new territory for HBase. Maybe Andy Purtell is listening, I think he did some tests with HBase on SSDs.
>>>>> With rotating media you typically see an improvement with compression. With SSDs the added CPU needed for decompression might outweigh the benefits.
>>>>> 
>>>>> At the risk of starting a larger discussion here, I would posit that HBase's LSM based design, which trades random IO with sequential IO, might be a bit more questionable on SSDs.
>>>>> 
>>>>> If you can, it would be nice to run a profiler against one of the RegionServers (or maybe do it with the single RS setup) and see where it is bottlenecked.
>>>>> (And if you send me a sample program to generate some data - not 700g, though :) - I'll try to do a bit of profiling during the next days as my day job permits, but I do not have any machines with SSDs).
>>>>> 
>>>>> -- Lars
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> ________________________________
>>>>> From: Bryan Keller <br...@gmail.com>
>>>>> To: user@hbase.apache.org
>>>>> Sent: Tuesday, April 30, 2013 9:31 PM
>>>>> Subject: Re: Poor HBase map-reduce scan performance
>>>>> 
>>>>> 
>>>>> Yes, I have tried various settings for setCaching() and I have setCacheBlocks(false)
>>>>> 
>>>>> On Apr 30, 2013, at 9:17 PM, Ted Yu <yu...@gmail.com> wrote:
>>>>> 
>>>>>> From http://hbase.apache.org/book.html#mapreduce.example :
>>>>>> 
>>>>>> scan.setCaching(500);        // 1 is the default in Scan, which will
>>>>>> be bad for MapReduce jobs
>>>>>> scan.setCacheBlocks(false);  // don't set to true for MR jobs
>>>>>> 
>>>>>> I guess you have used the above setting.
>>>>>> 
>>>>>> 0.94.x releases are compatible. Have you considered upgrading to, say
>>>>>> 0.94.7 which was recently released ?
>>>>>> 
>>>>>> Cheers
>>>>>> 
>>>>>> On Tue, Apr 30, 2013 at 9:01 PM, Bryan Keller <bryanck@gm
>>>

Re: Poor HBase map-reduce scan performance

Posted by lars hofhansl <la...@apache.org>.

Interesting. If you can try 0.94.7 (but it'll probably not have changed that much from 0.94.4)


Do you have enabled one of the block encoders (FAST_DIFF, etc)? If so, try without. They currently need to reallocate a ByteBuffer for each single KV.
(Sine you see ScannerV2 rather than EncodedScannerV2 you probably have not enabled encoding, just checking).


And do you have a stack trace for the ByteBuffer.allocate(). That is a strange one since it never came up in my profiling (unless you enabled block encoding).
(You can get traces from VisualVM by creating a snapshot, but you'd have to drill in to find the allocate()).


During normal scanning (again, without encoding) there should be no allocation happening except for blocks read from disk (and they should all be the same size, thus allocation should be cheap).

-- Lars



________________________________
 From: Bryan Keller <br...@gmail.com>
To: user@hbase.apache.org 
Sent: Thursday, May 2, 2013 10:54 AM
Subject: Re: Poor HBase map-reduce scan performance
 

I ran one of my regionservers through VisualVM. It looks like the top hot spots are HFileReaderV2$ScannerV2.getKeyValue() and ByteBuffer.allocate(). It appears at first glance that memory allocations may be an issue. Decompression was next below that but less of an issue it seems.

Would changing the block size, either HDFS or HBase, help here?

Also, if anyone has tips on how else to profile, that would be appreciated. VisualVM can produce a lot of noise that is hard to sift through.


On May 1, 2013, at 9:49 PM, Bryan Keller <br...@gmail.com> wrote:

> I used exactly 0.94.4, pulled from the tag in subversion.
> 
> On May 1, 2013, at 9:41 PM, lars hofhansl <la...@apache.org> wrote:
> 
>> Hmm... Did you actually use exactly version 0.94.4, or the latest 0.94.7.
>> I would be very curious to see profiling data.
>> 
>> -- Lars
>> 
>> 
>> 
>> ----- Original Message -----
>> From: Bryan Keller <br...@gmail.com>
>> To: "user@hbase.apache.org" <us...@hbase.apache.org>
>> Cc: 
>> Sent: Wednesday, May 1, 2013 6:01 PM
>> Subject: Re: Poor HBase map-reduce scan performance
>> 
>> I tried running my test with 0.94.4, unfortunately performance was about the same. I'm planning on profiling the regionserver and trying some other things tonight and tomorrow and will report back.
>> 
>> On May 1, 2013, at 8:00 AM, Bryan Keller <br...@gmail.com> wrote:
>> 
>>> Yes I would like to try this, if you can point me to the pom.xml patch that would save me some time.
>>> 
>>> On Tuesday, April 30, 2013, lars hofhansl wrote:
>>> If you can, try 0.94.4+; it should significantly reduce the amount of bytes copied around in RAM during scanning, especially if you have wide rows and/or large key portions. That in turns makes scans scale better across cores, since RAM is shared resource between cores (much like disk).
>>> 
>>> 
>>> It's not hard to build the latest HBase against Cloudera's version of Hadoop. I can send along a simple patch to pom.xml to do that.
>>> 
>>> -- Lars
>>> 
>>> 
>>> 
>>> ________________________________
>>>  From: Bryan Keller <br...@gmail.com>
>>> To: user@hbase.apache.org
>>> Sent: Tuesday, April 30, 2013 11:02 PM
>>> Subject: Re: Poor HBase map-reduce scan performance
>>> 
>>> 
>>> The table has hashed keys so rows are evenly distributed amongst the regionservers, and load on each regionserver is pretty much the same. I also have per-table balancing turned on. I get mostly data local mappers with only a few rack local (maybe 10 of the 250 mappers).
>>> 
>>> Currently the table is a wide table schema, with lists of data structures stored as columns with column prefixes grouping the data structures (e.g. 1_name, 1_address, 1_city, 2_name, 2_address, 2_city). I was thinking of moving those data structures to protobuf which would cut down on the number of columns. The downside is I can't filter on one value with that, but it is a tradeoff I would make for performance. I was also considering restructuring the table into a tall table.
>>> 
>>> Something interesting is that my old regionserver machines had five 15k SCSI drives instead of 2 SSDs, and performance was about the same. Also, my old network was 1gbit, now it is 10gbit. So neither network nor disk I/O appear to be the bottleneck. The CPU is rather high for the regionserver so it seems like the best candidate to investigate. I will try profiling it tomorrow and will report back. I may revisit compression on vs off since that is adding load to the CPU.
>>> 
>>> I'll also come up with a sample program that generates data similar to my table.
>>> 
>>> 
>>> On Apr 30, 2013, at 10:01 PM, lars hofhansl <la...@apache.org> wrote:
>>> 
>>>> Your average row is 35k so scanner caching would not make a huge difference, although I would have expected some improvements by setting it to 10 or 50 since you have a wide 10ge pipe.
>>>> 
>>>> I assume your table is split sufficiently to touch all RegionServer... Do you see the same load/IO on all region servers?
>>>> 
>>>> A bunch of scan improvements went into HBase since 0.94.2.
>>>> I blogged about some of these changes here: http://hadoop-hbase.blogspot.com/2012/12/hbase-profiling.html
>>>> 
>>>> In your case - since you have many columns, each of which carry the rowkey - you might benefit a lot from HBASE-7279.
>>>> 
>>>> In the end HBase *is* slower than straight HDFS for full scans. How could it not be?
>>>> So I would start by looking at HDFS first. Make sure Nagle's is disbaled in both HBase and HDFS.
>>>> 
>>>> And lastly SSDs are somewhat new territory for HBase. Maybe Andy Purtell is listening, I think he did some tests with HBase on SSDs.
>>>> With rotating media you typically see an improvement with compression. With SSDs the added CPU needed for decompression might outweigh the benefits.
>>>> 
>>>> At the risk of starting a larger discussion here, I would posit that HBase's LSM based design, which trades random IO with sequential IO, might be a bit more questionable on SSDs.
>>>> 
>>>> If you can, it would be nice to run a profiler against one of the RegionServers (or maybe do it with the single RS setup) and see where it is bottlenecked.
>>>> (And if you send me a sample program to generate some data - not 700g, though :) - I'll try to do a bit of profiling during the next days as my day job permits, but I do not have any machines with SSDs).
>>>> 
>>>> -- Lars
>>>> 
>>>> 
>>>> 
>>>> 
>>>> ________________________________
>>>> From: Bryan Keller <br...@gmail.com>
>>>> To: user@hbase.apache.org
>>>> Sent: Tuesday, April 30, 2013 9:31 PM
>>>> Subject: Re: Poor HBase map-reduce scan performance
>>>> 
>>>> 
>>>> Yes, I have tried various settings for setCaching() and I have setCacheBlocks(false)
>>>> 
>>>> On Apr 30, 2013, at 9:17 PM, Ted Yu <yu...@gmail.com> wrote:
>>>> 
>>>>> From http://hbase.apache.org/book.html#mapreduce.example :
>>>>> 
>>>>> scan.setCaching(500);        // 1 is the default in Scan, which will
>>>>> be bad for MapReduce jobs
>>>>> scan.setCacheBlocks(false);  // don't set to true for MR jobs
>>>>> 
>>>>> I guess you have used the above setting.
>>>>> 
>>>>> 0.94.x releases are compatible. Have you considered upgrading to, say
>>>>> 0.94.7 which was recently released ?
>>>>> 
>>>>> Cheers
>>>>> 
>>>>> On Tue, Apr 30, 2013 at 9:01 PM, Bryan Keller <bryanck@gm
>> 
>

Re: Poor HBase map-reduce scan performance

Posted by Bryan Keller <br...@gmail.com>.

I ran one of my regionservers through VisualVM. It looks like the top hot spots are HFileReaderV2$ScannerV2.getKeyValue() and ByteBuffer.allocate(). It appears at first glance that memory allocations may be an issue. Decompression was next below that but less of an issue it seems.

Would changing the block size, either HDFS or HBase, help here?

Also, if anyone has tips on how else to profile, that would be appreciated. VisualVM can produce a lot of noise that is hard to sift through.


On May 1, 2013, at 9:49 PM, Bryan Keller <br...@gmail.com> wrote:

> I used exactly 0.94.4, pulled from the tag in subversion.
> 
> On May 1, 2013, at 9:41 PM, lars hofhansl <la...@apache.org> wrote:
> 
>> Hmm... Did you actually use exactly version 0.94.4, or the latest 0.94.7.
>> I would be very curious to see profiling data.
>> 
>> -- Lars
>> 
>> 
>> 
>> ----- Original Message -----
>> From: Bryan Keller <br...@gmail.com>
>> To: "user@hbase.apache.org" <us...@hbase.apache.org>
>> Cc: 
>> Sent: Wednesday, May 1, 2013 6:01 PM
>> Subject: Re: Poor HBase map-reduce scan performance
>> 
>> I tried running my test with 0.94.4, unfortunately performance was about the same. I'm planning on profiling the regionserver and trying some other things tonight and tomorrow and will report back.
>> 
>> On May 1, 2013, at 8:00 AM, Bryan Keller <br...@gmail.com> wrote:
>> 
>>> Yes I would like to try this, if you can point me to the pom.xml patch that would save me some time.
>>> 
>>> On Tuesday, April 30, 2013, lars hofhansl wrote:
>>> If you can, try 0.94.4+; it should significantly reduce the amount of bytes copied around in RAM during scanning, especially if you have wide rows and/or large key portions. That in turns makes scans scale better across cores, since RAM is shared resource between cores (much like disk).
>>> 
>>> 
>>> It's not hard to build the latest HBase against Cloudera's version of Hadoop. I can send along a simple patch to pom.xml to do that.
>>> 
>>> -- Lars
>>> 
>>> 
>>> 
>>> ________________________________
>>>  From: Bryan Keller <br...@gmail.com>
>>> To: user@hbase.apache.org
>>> Sent: Tuesday, April 30, 2013 11:02 PM
>>> Subject: Re: Poor HBase map-reduce scan performance
>>> 
>>> 
>>> The table has hashed keys so rows are evenly distributed amongst the regionservers, and load on each regionserver is pretty much the same. I also have per-table balancing turned on. I get mostly data local mappers with only a few rack local (maybe 10 of the 250 mappers).
>>> 
>>> Currently the table is a wide table schema, with lists of data structures stored as columns with column prefixes grouping the data structures (e.g. 1_name, 1_address, 1_city, 2_name, 2_address, 2_city). I was thinking of moving those data structures to protobuf which would cut down on the number of columns. The downside is I can't filter on one value with that, but it is a tradeoff I would make for performance. I was also considering restructuring the table into a tall table.
>>> 
>>> Something interesting is that my old regionserver machines had five 15k SCSI drives instead of 2 SSDs, and performance was about the same. Also, my old network was 1gbit, now it is 10gbit. So neither network nor disk I/O appear to be the bottleneck. The CPU is rather high for the regionserver so it seems like the best candidate to investigate. I will try profiling it tomorrow and will report back. I may revisit compression on vs off since that is adding load to the CPU.
>>> 
>>> I'll also come up with a sample program that generates data similar to my table.
>>> 
>>> 
>>> On Apr 30, 2013, at 10:01 PM, lars hofhansl <la...@apache.org> wrote:
>>> 
>>>> Your average row is 35k so scanner caching would not make a huge difference, although I would have expected some improvements by setting it to 10 or 50 since you have a wide 10ge pipe.
>>>> 
>>>> I assume your table is split sufficiently to touch all RegionServer... Do you see the same load/IO on all region servers?
>>>> 
>>>> A bunch of scan improvements went into HBase since 0.94.2.
>>>> I blogged about some of these changes here: http://hadoop-hbase.blogspot.com/2012/12/hbase-profiling.html
>>>> 
>>>> In your case - since you have many columns, each of which carry the rowkey - you might benefit a lot from HBASE-7279.
>>>> 
>>>> In the end HBase *is* slower than straight HDFS for full scans. How could it not be?
>>>> So I would start by looking at HDFS first. Make sure Nagle's is disbaled in both HBase and HDFS.
>>>> 
>>>> And lastly SSDs are somewhat new territory for HBase. Maybe Andy Purtell is listening, I think he did some tests with HBase on SSDs.
>>>> With rotating media you typically see an improvement with compression. With SSDs the added CPU needed for decompression might outweigh the benefits.
>>>> 
>>>> At the risk of starting a larger discussion here, I would posit that HBase's LSM based design, which trades random IO with sequential IO, might be a bit more questionable on SSDs.
>>>> 
>>>> If you can, it would be nice to run a profiler against one of the RegionServers (or maybe do it with the single RS setup) and see where it is bottlenecked.
>>>> (And if you send me a sample program to generate some data - not 700g, though :) - I'll try to do a bit of profiling during the next days as my day job permits, but I do not have any machines with SSDs).
>>>> 
>>>> -- Lars
>>>> 
>>>> 
>>>> 
>>>> 
>>>> ________________________________
>>>> From: Bryan Keller <br...@gmail.com>
>>>> To: user@hbase.apache.org
>>>> Sent: Tuesday, April 30, 2013 9:31 PM
>>>> Subject: Re: Poor HBase map-reduce scan performance
>>>> 
>>>> 
>>>> Yes, I have tried various settings for setCaching() and I have setCacheBlocks(false)
>>>> 
>>>> On Apr 30, 2013, at 9:17 PM, Ted Yu <yu...@gmail.com> wrote:
>>>> 
>>>>> From http://hbase.apache.org/book.html#mapreduce.example :
>>>>> 
>>>>> scan.setCaching(500);        // 1 is the default in Scan, which will
>>>>> be bad for MapReduce jobs
>>>>> scan.setCacheBlocks(false);  // don't set to true for MR jobs
>>>>> 
>>>>> I guess you have used the above setting.
>>>>> 
>>>>> 0.94.x releases are compatible. Have you considered upgrading to, say
>>>>> 0.94.7 which was recently released ?
>>>>> 
>>>>> Cheers
>>>>> 
>>>>> On Tue, Apr 30, 2013 at 9:01 PM, Bryan Keller <bryanck@gm
>> 
>

Re: Poor HBase map-reduce scan performance

Posted by Bryan Keller <br...@gmail.com>.

I used exactly 0.94.4, pulled from the tag in subversion.

On May 1, 2013, at 9:41 PM, lars hofhansl <la...@apache.org> wrote:

> Hmm... Did you actually use exactly version 0.94.4, or the latest 0.94.7.
> I would be very curious to see profiling data.
> 
> -- Lars
> 
> 
> 
> ----- Original Message -----
> From: Bryan Keller <br...@gmail.com>
> To: "user@hbase.apache.org" <us...@hbase.apache.org>
> Cc: 
> Sent: Wednesday, May 1, 2013 6:01 PM
> Subject: Re: Poor HBase map-reduce scan performance
> 
> I tried running my test with 0.94.4, unfortunately performance was about the same. I'm planning on profiling the regionserver and trying some other things tonight and tomorrow and will report back.
> 
> On May 1, 2013, at 8:00 AM, Bryan Keller <br...@gmail.com> wrote:
> 
>> Yes I would like to try this, if you can point me to the pom.xml patch that would save me some time.
>> 
>> On Tuesday, April 30, 2013, lars hofhansl wrote:
>> If you can, try 0.94.4+; it should significantly reduce the amount of bytes copied around in RAM during scanning, especially if you have wide rows and/or large key portions. That in turns makes scans scale better across cores, since RAM is shared resource between cores (much like disk).
>> 
>> 
>> It's not hard to build the latest HBase against Cloudera's version of Hadoop. I can send along a simple patch to pom.xml to do that.
>> 
>> -- Lars
>> 
>> 
>> 
>> ________________________________
>>   From: Bryan Keller <br...@gmail.com>
>> To: user@hbase.apache.org
>> Sent: Tuesday, April 30, 2013 11:02 PM
>> Subject: Re: Poor HBase map-reduce scan performance
>> 
>> 
>> The table has hashed keys so rows are evenly distributed amongst the regionservers, and load on each regionserver is pretty much the same. I also have per-table balancing turned on. I get mostly data local mappers with only a few rack local (maybe 10 of the 250 mappers).
>> 
>> Currently the table is a wide table schema, with lists of data structures stored as columns with column prefixes grouping the data structures (e.g. 1_name, 1_address, 1_city, 2_name, 2_address, 2_city). I was thinking of moving those data structures to protobuf which would cut down on the number of columns. The downside is I can't filter on one value with that, but it is a tradeoff I would make for performance. I was also considering restructuring the table into a tall table.
>> 
>> Something interesting is that my old regionserver machines had five 15k SCSI drives instead of 2 SSDs, and performance was about the same. Also, my old network was 1gbit, now it is 10gbit. So neither network nor disk I/O appear to be the bottleneck. The CPU is rather high for the regionserver so it seems like the best candidate to investigate. I will try profiling it tomorrow and will report back. I may revisit compression on vs off since that is adding load to the CPU.
>> 
>> I'll also come up with a sample program that generates data similar to my table.
>> 
>> 
>> On Apr 30, 2013, at 10:01 PM, lars hofhansl <la...@apache.org> wrote:
>> 
>>> Your average row is 35k so scanner caching would not make a huge difference, although I would have expected some improvements by setting it to 10 or 50 since you have a wide 10ge pipe.
>>> 
>>> I assume your table is split sufficiently to touch all RegionServer... Do you see the same load/IO on all region servers?
>>> 
>>> A bunch of scan improvements went into HBase since 0.94.2.
>>> I blogged about some of these changes here: http://hadoop-hbase.blogspot.com/2012/12/hbase-profiling.html
>>> 
>>> In your case - since you have many columns, each of which carry the rowkey - you might benefit a lot from HBASE-7279.
>>> 
>>> In the end HBase *is* slower than straight HDFS for full scans. How could it not be?
>>> So I would start by looking at HDFS first. Make sure Nagle's is disbaled in both HBase and HDFS.
>>> 
>>> And lastly SSDs are somewhat new territory for HBase. Maybe Andy Purtell is listening, I think he did some tests with HBase on SSDs.
>>> With rotating media you typically see an improvement with compression. With SSDs the added CPU needed for decompression might outweigh the benefits.
>>> 
>>> At the risk of starting a larger discussion here, I would posit that HBase's LSM based design, which trades random IO with sequential IO, might be a bit more questionable on SSDs.
>>> 
>>> If you can, it would be nice to run a profiler against one of the RegionServers (or maybe do it with the single RS setup) and see where it is bottlenecked.
>>> (And if you send me a sample program to generate some data - not 700g, though :) - I'll try to do a bit of profiling during the next days as my day job permits, but I do not have any machines with SSDs).
>>> 
>>> -- Lars
>>> 
>>> 
>>> 
>>> 
>>> ________________________________
>>> From: Bryan Keller <br...@gmail.com>
>>> To: user@hbase.apache.org
>>> Sent: Tuesday, April 30, 2013 9:31 PM
>>> Subject: Re: Poor HBase map-reduce scan performance
>>> 
>>> 
>>> Yes, I have tried various settings for setCaching() and I have setCacheBlocks(false)
>>> 
>>> On Apr 30, 2013, at 9:17 PM, Ted Yu <yu...@gmail.com> wrote:
>>> 
>>>> From http://hbase.apache.org/book.html#mapreduce.example :
>>>> 
>>>> scan.setCaching(500);        // 1 is the default in Scan, which will
>>>> be bad for MapReduce jobs
>>>> scan.setCacheBlocks(false);  // don't set to true for MR jobs
>>>> 
>>>> I guess you have used the above setting.
>>>> 
>>>> 0.94.x releases are compatible. Have you considered upgrading to, say
>>>> 0.94.7 which was recently released ?
>>>> 
>>>> Cheers
>>>> 
>>>> On Tue, Apr 30, 2013 at 9:01 PM, Bryan Keller <bryanck@gm
>

Re: Poor HBase map-reduce scan performance

Posted by lars hofhansl <la...@apache.org>.

Hmm... Did you actually use exactly version 0.94.4, or the latest 0.94.7.
I would be very curious to see profiling data.

-- Lars



----- Original Message -----
From: Bryan Keller <br...@gmail.com>
To: "user@hbase.apache.org" <us...@hbase.apache.org>
Cc: 
Sent: Wednesday, May 1, 2013 6:01 PM
Subject: Re: Poor HBase map-reduce scan performance

I tried running my test with 0.94.4, unfortunately performance was about the same. I'm planning on profiling the regionserver and trying some other things tonight and tomorrow and will report back.

On May 1, 2013, at 8:00 AM, Bryan Keller <br...@gmail.com> wrote:

> Yes I would like to try this, if you can point me to the pom.xml patch that would save me some time.
> 
> On Tuesday, April 30, 2013, lars hofhansl wrote:
> If you can, try 0.94.4+; it should significantly reduce the amount of bytes copied around in RAM during scanning, especially if you have wide rows and/or large key portions. That in turns makes scans scale better across cores, since RAM is shared resource between cores (much like disk).
> 
> 
> It's not hard to build the latest HBase against Cloudera's version of Hadoop. I can send along a simple patch to pom.xml to do that.
> 
> -- Lars
> 
> 
> 
> ________________________________
>  From: Bryan Keller <br...@gmail.com>
> To: user@hbase.apache.org
> Sent: Tuesday, April 30, 2013 11:02 PM
> Subject: Re: Poor HBase map-reduce scan performance
> 
> 
> The table has hashed keys so rows are evenly distributed amongst the regionservers, and load on each regionserver is pretty much the same. I also have per-table balancing turned on. I get mostly data local mappers with only a few rack local (maybe 10 of the 250 mappers).
> 
> Currently the table is a wide table schema, with lists of data structures stored as columns with column prefixes grouping the data structures (e.g. 1_name, 1_address, 1_city, 2_name, 2_address, 2_city). I was thinking of moving those data structures to protobuf which would cut down on the number of columns. The downside is I can't filter on one value with that, but it is a tradeoff I would make for performance. I was also considering restructuring the table into a tall table.
> 
> Something interesting is that my old regionserver machines had five 15k SCSI drives instead of 2 SSDs, and performance was about the same. Also, my old network was 1gbit, now it is 10gbit. So neither network nor disk I/O appear to be the bottleneck. The CPU is rather high for the regionserver so it seems like the best candidate to investigate. I will try profiling it tomorrow and will report back. I may revisit compression on vs off since that is adding load to the CPU.
> 
> I'll also come up with a sample program that generates data similar to my table.
> 
> 
> On Apr 30, 2013, at 10:01 PM, lars hofhansl <la...@apache.org> wrote:
> 
> > Your average row is 35k so scanner caching would not make a huge difference, although I would have expected some improvements by setting it to 10 or 50 since you have a wide 10ge pipe.
> >
> > I assume your table is split sufficiently to touch all RegionServer... Do you see the same load/IO on all region servers?
> >
> > A bunch of scan improvements went into HBase since 0.94.2.
> > I blogged about some of these changes here: http://hadoop-hbase.blogspot.com/2012/12/hbase-profiling.html
> >
> > In your case - since you have many columns, each of which carry the rowkey - you might benefit a lot from HBASE-7279.
> >
> > In the end HBase *is* slower than straight HDFS for full scans. How could it not be?
> > So I would start by looking at HDFS first. Make sure Nagle's is disbaled in both HBase and HDFS.
> >
> > And lastly SSDs are somewhat new territory for HBase. Maybe Andy Purtell is listening, I think he did some tests with HBase on SSDs.
> > With rotating media you typically see an improvement with compression. With SSDs the added CPU needed for decompression might outweigh the benefits.
> >
> > At the risk of starting a larger discussion here, I would posit that HBase's LSM based design, which trades random IO with sequential IO, might be a bit more questionable on SSDs.
> >
> > If you can, it would be nice to run a profiler against one of the RegionServers (or maybe do it with the single RS setup) and see where it is bottlenecked.
> > (And if you send me a sample program to generate some data - not 700g, though :) - I'll try to do a bit of profiling during the next days as my day job permits, but I do not have any machines with SSDs).
> >
> > -- Lars
> >
> >
> >
> >
> > ________________________________
> > From: Bryan Keller <br...@gmail.com>
> > To: user@hbase.apache.org
> > Sent: Tuesday, April 30, 2013 9:31 PM
> > Subject: Re: Poor HBase map-reduce scan performance
> >
> >
> > Yes, I have tried various settings for setCaching() and I have setCacheBlocks(false)
> >
> > On Apr 30, 2013, at 9:17 PM, Ted Yu <yu...@gmail.com> wrote:
> >
> >> >From http://hbase.apache.org/book.html#mapreduce.example :
> >>
> >> scan.setCaching(500);        // 1 is the default in Scan, which will
> >> be bad for MapReduce jobs
> >> scan.setCacheBlocks(false);  // don't set to true for MR jobs
> >>
> >> I guess you have used the above setting.
> >>
> >> 0.94.x releases are compatible. Have you considered upgrading to, say
> >> 0.94.7 which was recently released ?
> >>
> >> Cheers
> >>
> >> On Tue, Apr 30, 2013 at 9:01 PM, Bryan Keller <bryanck@gm

Re: Poor HBase map-reduce scan performance

Posted by Bryan Keller <br...@gmail.com>.

I tried running my test with 0.94.4, unfortunately performance was about the same. I'm planning on profiling the regionserver and trying some other things tonight and tomorrow and will report back.

On May 1, 2013, at 8:00 AM, Bryan Keller <br...@gmail.com> wrote:

> Yes I would like to try this, if you can point me to the pom.xml patch that would save me some time.
> 
> On Tuesday, April 30, 2013, lars hofhansl wrote:
> If you can, try 0.94.4+; it should significantly reduce the amount of bytes copied around in RAM during scanning, especially if you have wide rows and/or large key portions. That in turns makes scans scale better across cores, since RAM is shared resource between cores (much like disk).
> 
> 
> It's not hard to build the latest HBase against Cloudera's version of Hadoop. I can send along a simple patch to pom.xml to do that.
> 
> -- Lars
> 
> 
> 
> ________________________________
>  From: Bryan Keller <br...@gmail.com>
> To: user@hbase.apache.org
> Sent: Tuesday, April 30, 2013 11:02 PM
> Subject: Re: Poor HBase map-reduce scan performance
> 
> 
> The table has hashed keys so rows are evenly distributed amongst the regionservers, and load on each regionserver is pretty much the same. I also have per-table balancing turned on. I get mostly data local mappers with only a few rack local (maybe 10 of the 250 mappers).
> 
> Currently the table is a wide table schema, with lists of data structures stored as columns with column prefixes grouping the data structures (e.g. 1_name, 1_address, 1_city, 2_name, 2_address, 2_city). I was thinking of moving those data structures to protobuf which would cut down on the number of columns. The downside is I can't filter on one value with that, but it is a tradeoff I would make for performance. I was also considering restructuring the table into a tall table.
> 
> Something interesting is that my old regionserver machines had five 15k SCSI drives instead of 2 SSDs, and performance was about the same. Also, my old network was 1gbit, now it is 10gbit. So neither network nor disk I/O appear to be the bottleneck. The CPU is rather high for the regionserver so it seems like the best candidate to investigate. I will try profiling it tomorrow and will report back. I may revisit compression on vs off since that is adding load to the CPU.
> 
> I'll also come up with a sample program that generates data similar to my table.
> 
> 
> On Apr 30, 2013, at 10:01 PM, lars hofhansl <la...@apache.org> wrote:
> 
> > Your average row is 35k so scanner caching would not make a huge difference, although I would have expected some improvements by setting it to 10 or 50 since you have a wide 10ge pipe.
> >
> > I assume your table is split sufficiently to touch all RegionServer... Do you see the same load/IO on all region servers?
> >
> > A bunch of scan improvements went into HBase since 0.94.2.
> > I blogged about some of these changes here: http://hadoop-hbase.blogspot.com/2012/12/hbase-profiling.html
> >
> > In your case - since you have many columns, each of which carry the rowkey - you might benefit a lot from HBASE-7279.
> >
> > In the end HBase *is* slower than straight HDFS for full scans. How could it not be?
> > So I would start by looking at HDFS first. Make sure Nagle's is disbaled in both HBase and HDFS.
> >
> > And lastly SSDs are somewhat new territory for HBase. Maybe Andy Purtell is listening, I think he did some tests with HBase on SSDs.
> > With rotating media you typically see an improvement with compression. With SSDs the added CPU needed for decompression might outweigh the benefits.
> >
> > At the risk of starting a larger discussion here, I would posit that HBase's LSM based design, which trades random IO with sequential IO, might be a bit more questionable on SSDs.
> >
> > If you can, it would be nice to run a profiler against one of the RegionServers (or maybe do it with the single RS setup) and see where it is bottlenecked.
> > (And if you send me a sample program to generate some data - not 700g, though :) - I'll try to do a bit of profiling during the next days as my day job permits, but I do not have any machines with SSDs).
> >
> > -- Lars
> >
> >
> >
> >
> > ________________________________
> > From: Bryan Keller <br...@gmail.com>
> > To: user@hbase.apache.org
> > Sent: Tuesday, April 30, 2013 9:31 PM
> > Subject: Re: Poor HBase map-reduce scan performance
> >
> >
> > Yes, I have tried various settings for setCaching() and I have setCacheBlocks(false)
> >
> > On Apr 30, 2013, at 9:17 PM, Ted Yu <yu...@gmail.com> wrote:
> >
> >> From http://hbase.apache.org/book.html#mapreduce.example :
> >>
> >> scan.setCaching(500);        // 1 is the default in Scan, which will
> >> be bad for MapReduce jobs
> >> scan.setCacheBlocks(false);  // don't set to true for MR jobs
> >>
> >> I guess you have used the above setting.
> >>
> >> 0.94.x releases are compatible. Have you considered upgrading to, say
> >> 0.94.7 which was recently released ?
> >>
> >> Cheers
> >>
> >> On Tue, Apr 30, 2013 at 9:01 PM, Bryan Keller <bryanck@gm

Re: Poor HBase map-reduce scan performance

Posted by Bryan Keller <br...@gmail.com>.

Yes I would like to try this, if you can point me to the pom.xml patch that
would save me some time.

On Tuesday, April 30, 2013, lars hofhansl wrote:

> If you can, try 0.94.4+; it should significantly reduce the amount of
> bytes copied around in RAM during scanning, especially if you have wide
> rows and/or large key portions. That in turns makes scans scale better
> across cores, since RAM is shared resource between cores (much like disk).
>
>
> It's not hard to build the latest HBase against Cloudera's version of
> Hadoop. I can send along a simple patch to pom.xml to do that.
>
> -- Lars
>
>
>
> ________________________________
>  From: Bryan Keller <bryanck@gmail.com <javascript:;>>
> To: user@hbase.apache.org <javascript:;>
> Sent: Tuesday, April 30, 2013 11:02 PM
> Subject: Re: Poor HBase map-reduce scan performance
>
>
> The table has hashed keys so rows are evenly distributed amongst the
> regionservers, and load on each regionserver is pretty much the same. I
> also have per-table balancing turned on. I get mostly data local mappers
> with only a few rack local (maybe 10 of the 250 mappers).
>
> Currently the table is a wide table schema, with lists of data structures
> stored as columns with column prefixes grouping the data structures (e.g.
> 1_name, 1_address, 1_city, 2_name, 2_address, 2_city). I was thinking of
> moving those data structures to protobuf which would cut down on the number
> of columns. The downside is I can't filter on one value with that, but it
> is a tradeoff I would make for performance. I was also considering
> restructuring the table into a tall table.
>
> Something interesting is that my old regionserver machines had five 15k
> SCSI drives instead of 2 SSDs, and performance was about the same. Also, my
> old network was 1gbit, now it is 10gbit. So neither network nor disk I/O
> appear to be the bottleneck. The CPU is rather high for the regionserver so
> it seems like the best candidate to investigate. I will try profiling it
> tomorrow and will report back. I may revisit compression on vs off since
> that is adding load to the CPU.
>
> I'll also come up with a sample program that generates data similar to my
> table.
>
>
> On Apr 30, 2013, at 10:01 PM, lars hofhansl <la...@apache.org> wrote:
>
> > Your average row is 35k so scanner caching would not make a huge
> difference, although I would have expected some improvements by setting it
> to 10 or 50 since you have a wide 10ge pipe.
> >
> > I assume your table is split sufficiently to touch all RegionServer...
> Do you see the same load/IO on all region servers?
> >
> > A bunch of scan improvements went into HBase since 0.94.2.
> > I blogged about some of these changes here:
> http://hadoop-hbase.blogspot.com/2012/12/hbase-profiling.html
> >
> > In your case - since you have many columns, each of which carry the
> rowkey - you might benefit a lot from HBASE-7279.
> >
> > In the end HBase *is* slower than straight HDFS for full scans. How
> could it not be?
> > So I would start by looking at HDFS first. Make sure Nagle's is disbaled
> in both HBase and HDFS.
> >
> > And lastly SSDs are somewhat new territory for HBase. Maybe Andy Purtell
> is listening, I think he did some tests with HBase on SSDs.
> > With rotating media you typically see an improvement with compression.
> With SSDs the added CPU needed for decompression might outweigh the
> benefits.
> >
> > At the risk of starting a larger discussion here, I would posit that
> HBase's LSM based design, which trades random IO with sequential IO, might
> be a bit more questionable on SSDs.
> >
> > If you can, it would be nice to run a profiler against one of the
> RegionServers (or maybe do it with the single RS setup) and see where it is
> bottlenecked.
> > (And if you send me a sample program to generate some data - not 700g,
> though :) - I'll try to do a bit of profiling during the next days as my
> day job permits, but I do not have any machines with SSDs).
> >
> > -- Lars
> >
> >
> >
> >
> > ________________________________
> > From: Bryan Keller <br...@gmail.com>
> > To: user@hbase.apache.org
> > Sent: Tuesday, April 30, 2013 9:31 PM
> > Subject: Re: Poor HBase map-reduce scan performance
> >
> >
> > Yes, I have tried various settings for setCaching() and I have
> setCacheBlocks(false)
> >
> > On Apr 30, 2013, at 9:17 PM, Ted Yu <yu...@gmail.com> wrote:
> >
> >> From http://hbase.apache.org/book.html#mapreduce.example :
> >>
> >> scan.setCaching(500);        // 1 is the default in Scan, which will
> >> be bad for MapReduce jobs
> >> scan.setCacheBlocks(false);  // don't set to true for MR jobs
> >>
> >> I guess you have used the above setting.
> >>
> >> 0.94.x releases are compatible. Have you considered upgrading to, say
> >> 0.94.7 which was recently released ?
> >>
> >> Cheers
> >>
> >> On Tue, Apr 30, 2013 at 9:01 PM, Bryan Keller <bryanck@gm

Re: Poor HBase map-reduce scan performance

Posted by ramkrishna vasudevan <ra...@gmail.com>.

Sorry.  I think someone hijacked this thread and I replied to this.
Naidu,
Request you to post a new thread if you have queries and do not hijack the
thread.

Regards
Ram


On Wed, May 1, 2013 at 12:57 PM, ramkrishna vasudevan <
ramkrishna.s.vasudevan@gmail.com> wrote:

> This happens when your java process is running in debug mode and
> suspend='Y' option is selected.
>
> Regards
> Ram
>
>
> On Wed, May 1, 2013 at 12:55 PM, Naidu MS <sanyasinaidu.malla433@gmail.com
> > wrote:
>
>> Hi i have two questions regarding hdfs and jps utility
>>
>> I am new to Hadoop and started leraning hadoop from the past week
>>
>> 1.when ever i start start-all.sh and jps in console it showing the
>> processes started
>>
>> *naidu@naidu:~/work/hadoop-1.0.4/bin$ jps*
>> *22283 NameNode*
>> *23516 TaskTracker*
>> *26711 Jps*
>> *22541 DataNode*
>> *23255 JobTracker*
>> *22813 SecondaryNameNode*
>> *Could not synchronize with target*
>>
>> But along with the list of process stared it always showing *" Could not
>> synchronize with target" *in the jps output. What is meant by "Could not
>> synchronize with target"?  Can some one explain why this is happening?
>>
>>
>> 2.Is it possible to format namenode multiple  times? When i enter the
>>  namenode -format command, it not formatting the name node and showing the
>> following ouput.
>>
>> *naidu@naidu:~/work/hadoop-1.0.4/bin$ hadoop namenode -format*
>> *Warning: $HADOOP_HOME is deprecated.*
>> *
>> *
>> *13/05/01 12:08:04 INFO namenode.NameNode: STARTUP_MSG: *
>> */*************************************************************
>> *STARTUP_MSG: Starting NameNode*
>> *STARTUP_MSG:   host = naidu/127.0.0.1*
>> *STARTUP_MSG:   args = [-format]*
>> *STARTUP_MSG:   version = 1.0.4*
>> *STARTUP_MSG:   build =
>> https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.0 -r
>> 1393290; compiled by 'hortonfo' on Wed Oct  3 05:13:58 UTC 2012*
>> *************************************************************/*
>> *Re-format filesystem in /home/naidu/dfs/namenode ? (Y or N) y*
>> *Format aborted in /home/naidu/dfs/namenode*
>> *13/05/01 12:08:05 INFO namenode.NameNode: SHUTDOWN_MSG: *
>> */*************************************************************
>> *SHUTDOWN_MSG: Shutting down NameNode at naidu/127.0.0.1*
>> *
>> *
>> *************************************************************/*
>>
>> Can someone help me in understanding this? Why is it not possible to
>> format
>> name node multiple times?
>>
>>
>> On Wed, May 1, 2013 at 12:22 PM, Matt Corgan <mc...@hotpads.com> wrote:
>>
>> > Not that it's a long-term solution, but try major-compacting before
>> running
>> > the benchmark.  If the LSM tree is CPU bound in merging HFiles/KeyValues
>> > through the PriorityQueue, then reducing to a single file per region
>> should
>> > help.  The merging of HFiles during a scan is not heavily optimized yet.
>> >
>> >
>> > On Tue, Apr 30, 2013 at 11:21 PM, lars hofhansl <la...@apache.org>
>> wrote:
>> >
>> > > If you can, try 0.94.4+; it should significantly reduce the amount of
>> > > bytes copied around in RAM during scanning, especially if you have
>> wide
>> > > rows and/or large key portions. That in turns makes scans scale better
>> > > across cores, since RAM is shared resource between cores (much like
>> > disk).
>> > >
>> > >
>> > > It's not hard to build the latest HBase against Cloudera's version of
>> > > Hadoop. I can send along a simple patch to pom.xml to do that.
>> > >
>> > > -- Lars
>> > >
>> > >
>> > >
>> > > ________________________________
>> > >  From: Bryan Keller <br...@gmail.com>
>> > > To: user@hbase.apache.org
>> > > Sent: Tuesday, April 30, 2013 11:02 PM
>> > > Subject: Re: Poor HBase map-reduce scan performance
>> > >
>> > >
>> > > The table has hashed keys so rows are evenly distributed amongst the
>> > > regionservers, and load on each regionserver is pretty much the same.
>> I
>> > > also have per-table balancing turned on. I get mostly data local
>> mappers
>> > > with only a few rack local (maybe 10 of the 250 mappers).
>> > >
>> > > Currently the table is a wide table schema, with lists of data
>> structures
>> > > stored as columns with column prefixes grouping the data structures
>> (e.g.
>> > > 1_name, 1_address, 1_city, 2_name, 2_address, 2_city). I was thinking
>> of
>> > > moving those data structures to protobuf which would cut down on the
>> > number
>> > > of columns. The downside is I can't filter on one value with that,
>> but it
>> > > is a tradeoff I would make for performance. I was also considering
>> > > restructuring the table into a tall table.
>> > >
>> > > Something interesting is that my old regionserver machines had five
>> 15k
>> > > SCSI drives instead of 2 SSDs, and performance was about the same.
>> Also,
>> > my
>> > > old network was 1gbit, now it is 10gbit. So neither network nor disk
>> I/O
>> > > appear to be the bottleneck. The CPU is rather high for the
>> regionserver
>> > so
>> > > it seems like the best candidate to investigate. I will try profiling
>> it
>> > > tomorrow and will report back. I may revisit compression on vs off
>> since
>> > > that is adding load to the CPU.
>> > >
>> > > I'll also come up with a sample program that generates data similar
>> to my
>> > > table.
>> > >
>> > >
>> > > On Apr 30, 2013, at 10:01 PM, lars hofhansl <la...@apache.org> wrote:
>> > >
>> > > > Your average row is 35k so scanner caching would not make a huge
>> > > difference, although I would have expected some improvements by
>> setting
>> > it
>> > > to 10 or 50 since you have a wide 10ge pipe.
>> > > >
>> > > > I assume your table is split sufficiently to touch all
>> RegionServer...
>> > > Do you see the same load/IO on all region servers?
>> > > >
>> > > > A bunch of scan improvements went into HBase since 0.94.2.
>> > > > I blogged about some of these changes here:
>> > > http://hadoop-hbase.blogspot.com/2012/12/hbase-profiling.html
>> > > >
>> > > > In your case - since you have many columns, each of which carry the
>> > > rowkey - you might benefit a lot from HBASE-7279.
>> > > >
>> > > > In the end HBase *is* slower than straight HDFS for full scans. How
>> > > could it not be?
>> > > > So I would start by looking at HDFS first. Make sure Nagle's is
>> > disbaled
>> > > in both HBase and HDFS.
>> > > >
>> > > > And lastly SSDs are somewhat new territory for HBase. Maybe Andy
>> > Purtell
>> > > is listening, I think he did some tests with HBase on SSDs.
>> > > > With rotating media you typically see an improvement with
>> compression.
>> > > With SSDs the added CPU needed for decompression might outweigh the
>> > > benefits.
>> > > >
>> > > > At the risk of starting a larger discussion here, I would posit that
>> > > HBase's LSM based design, which trades random IO with sequential IO,
>> > might
>> > > be a bit more questionable on SSDs.
>> > > >
>> > > > If you can, it would be nice to run a profiler against one of the
>> > > RegionServers (or maybe do it with the single RS setup) and see where
>> it
>> > is
>> > > bottlenecked.
>> > > > (And if you send me a sample program to generate some data - not
>> 700g,
>> > > though :) - I'll try to do a bit of profiling during the next days as
>> my
>> > > day job permits, but I do not have any machines with SSDs).
>> > > >
>> > > > -- Lars
>> > > >
>> > > >
>> > > >
>> > > >
>> > > > ________________________________
>> > > > From: Bryan Keller <br...@gmail.com>
>> > > > To: user@hbase.apache.org
>> > > > Sent: Tuesday, April 30, 2013 9:31 PM
>> > > > Subject: Re: Poor HBase map-reduce scan performance
>> > > >
>> > > >
>> > > > Yes, I have tried various settings for setCaching() and I have
>> > > setCacheBlocks(false)
>> > > >
>> > > > On Apr 30, 2013, at 9:17 PM, Ted Yu <yu...@gmail.com> wrote:
>> > > >
>> > > >> From http://hbase.apache.org/book.html#mapreduce.example :
>> > > >>
>> > > >> scan.setCaching(500);        // 1 is the default in Scan, which
>> will
>> > > >> be bad for MapReduce jobs
>> > > >> scan.setCacheBlocks(false);  // don't set to true for MR jobs
>> > > >>
>> > > >> I guess you have used the above setting.
>> > > >>
>> > > >> 0.94.x releases are compatible. Have you considered upgrading to,
>> say
>> > > >> 0.94.7 which was recently released ?
>> > > >>
>> > > >> Cheers
>> > > >>
>> > > >> On Tue, Apr 30, 2013 at 9:01 PM, Bryan Keller <br...@gmail.com>
>> > > wrote:
>> > > >>
>> > > >>> I have been attempting to speed up my HBase map-reduce scans for a
>> > > while
>> > > >>> now. I have tried just about everything without much luck. I'm
>> > running
>> > > out
>> > > >>> of ideas and was hoping for some suggestions. This is HBase 0.94.2
>> > and
>> > > >>> Hadoop 2.0.0 (CDH4.2.1).
>> > > >>>
>> > > >>> The table I'm scanning:
>> > > >>> 20 mil rows
>> > > >>> Hundreds of columns/row
>> > > >>> Column keys can be 30-40 bytes
>> > > >>> Column values are generally not large, 1k would be on the large
>> side
>> > > >>> 250 regions
>> > > >>> Snappy compression
>> > > >>> 8gb region size
>> > > >>> 512mb memstore flush
>> > > >>> 128k block size
>> > > >>> 700gb of data on HDFS
>> > > >>>
>> > > >>> My cluster has 8 datanodes which are also regionservers. Each has
>> 8
>> > > cores
>> > > >>> (16 HT), 64gb RAM, and 2 SSDs. The network is 10gbit. I have a
>> > separate
>> > > >>> machine acting as namenode, HMaster, and zookeeper (single
>> > instance). I
>> > > >>> have disk local reads turned on.
>> > > >>>
>> > > >>> I'm seeing around 5 gbit/sec on average network IO. Each disk is
>> > > getting
>> > > >>> 400mb/sec read IO. Theoretically I could get 400mb/sec * 16 =
>> > > 6.4gb/sec.
>> > > >>>
>> > > >>> Using Hadoop's TestDFSIO tool, I'm seeing around 1.4gb/sec read
>> > speed.
>> > > Not
>> > > >>> really that great compared to the theoretical I/O. However this is
>> > far
>> > > >>> better than I am seeing with HBase map-reduce scans of my table.
>> > > >>>
>> > > >>> I have a simple no-op map-only job (using TableInputFormat) that
>> > scans
>> > > the
>> > > >>> table and does nothing with data. This takes 45 minutes. That's
>> about
>> > > >>> 260mb/sec read speed. This is over 5x slower than straight HDFS.
>> > > >>> Basically, with HBase I'm seeing read performance of my 16 SSD
>> > cluster
>> > > >>> performing nearly 35% slower than a single SSD.
>> > > >>>
>> > > >>> Here are some things I have changed to no avail:
>> > > >>> Scan caching values
>> > > >>> HDFS block sizes
>> > > >>> HBase block sizes
>> > > >>> Region file sizes
>> > > >>> Memory settings
>> > > >>> GC settings
>> > > >>> Number of mappers/node
>> > > >>> Compressed vs not compressed
>> > > >>>
>> > > >>> One thing I notice is that the regionserver is using quite a bit
>> of
>> > CPU
>> > > >>> during the map reduce job. When dumping the jstack of the
>> process, it
>> > > seems
>> > > >>> like it is usually in some type of memory allocation or
>> decompression
>> > > >>> routine which didn't seem abnormal.
>> > > >>>
>> > > >>> I can't seem to pinpoint the bottleneck. CPU use by the
>> regionserver
>> > is
>> > > >>> high but not maxed out. Disk I/O and network I/O are low, IO wait
>> is
>> > > low.
>> > > >>> I'm on the verge of just writing the dataset out to sequence files
>> > > once a
>> > > >>> day for scan purposes. Is that what others are doing?
>> > >
>> >
>>
>
>

Re: Poor HBase map-reduce scan performance

Posted by ramkrishna vasudevan <ra...@gmail.com>.

This happens when your java process is running in debug mode and
suspend='Y' option is selected.

Regards
Ram


On Wed, May 1, 2013 at 12:55 PM, Naidu MS
<sa...@gmail.com>wrote:

> Hi i have two questions regarding hdfs and jps utility
>
> I am new to Hadoop and started leraning hadoop from the past week
>
> 1.when ever i start start-all.sh and jps in console it showing the
> processes started
>
> *naidu@naidu:~/work/hadoop-1.0.4/bin$ jps*
> *22283 NameNode*
> *23516 TaskTracker*
> *26711 Jps*
> *22541 DataNode*
> *23255 JobTracker*
> *22813 SecondaryNameNode*
> *Could not synchronize with target*
>
> But along with the list of process stared it always showing *" Could not
> synchronize with target" *in the jps output. What is meant by "Could not
> synchronize with target"?  Can some one explain why this is happening?
>
>
> 2.Is it possible to format namenode multiple  times? When i enter the
>  namenode -format command, it not formatting the name node and showing the
> following ouput.
>
> *naidu@naidu:~/work/hadoop-1.0.4/bin$ hadoop namenode -format*
> *Warning: $HADOOP_HOME is deprecated.*
> *
> *
> *13/05/01 12:08:04 INFO namenode.NameNode: STARTUP_MSG: *
> */*************************************************************
> *STARTUP_MSG: Starting NameNode*
> *STARTUP_MSG:   host = naidu/127.0.0.1*
> *STARTUP_MSG:   args = [-format]*
> *STARTUP_MSG:   version = 1.0.4*
> *STARTUP_MSG:   build =
> https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.0 -r
> 1393290; compiled by 'hortonfo' on Wed Oct  3 05:13:58 UTC 2012*
> *************************************************************/*
> *Re-format filesystem in /home/naidu/dfs/namenode ? (Y or N) y*
> *Format aborted in /home/naidu/dfs/namenode*
> *13/05/01 12:08:05 INFO namenode.NameNode: SHUTDOWN_MSG: *
> */*************************************************************
> *SHUTDOWN_MSG: Shutting down NameNode at naidu/127.0.0.1*
> *
> *
> *************************************************************/*
>
> Can someone help me in understanding this? Why is it not possible to format
> name node multiple times?
>
>
> On Wed, May 1, 2013 at 12:22 PM, Matt Corgan <mc...@hotpads.com> wrote:
>
> > Not that it's a long-term solution, but try major-compacting before
> running
> > the benchmark.  If the LSM tree is CPU bound in merging HFiles/KeyValues
> > through the PriorityQueue, then reducing to a single file per region
> should
> > help.  The merging of HFiles during a scan is not heavily optimized yet.
> >
> >
> > On Tue, Apr 30, 2013 at 11:21 PM, lars hofhansl <la...@apache.org>
> wrote:
> >
> > > If you can, try 0.94.4+; it should significantly reduce the amount of
> > > bytes copied around in RAM during scanning, especially if you have wide
> > > rows and/or large key portions. That in turns makes scans scale better
> > > across cores, since RAM is shared resource between cores (much like
> > disk).
> > >
> > >
> > > It's not hard to build the latest HBase against Cloudera's version of
> > > Hadoop. I can send along a simple patch to pom.xml to do that.
> > >
> > > -- Lars
> > >
> > >
> > >
> > > ________________________________
> > >  From: Bryan Keller <br...@gmail.com>
> > > To: user@hbase.apache.org
> > > Sent: Tuesday, April 30, 2013 11:02 PM
> > > Subject: Re: Poor HBase map-reduce scan performance
> > >
> > >
> > > The table has hashed keys so rows are evenly distributed amongst the
> > > regionservers, and load on each regionserver is pretty much the same. I
> > > also have per-table balancing turned on. I get mostly data local
> mappers
> > > with only a few rack local (maybe 10 of the 250 mappers).
> > >
> > > Currently the table is a wide table schema, with lists of data
> structures
> > > stored as columns with column prefixes grouping the data structures
> (e.g.
> > > 1_name, 1_address, 1_city, 2_name, 2_address, 2_city). I was thinking
> of
> > > moving those data structures to protobuf which would cut down on the
> > number
> > > of columns. The downside is I can't filter on one value with that, but
> it
> > > is a tradeoff I would make for performance. I was also considering
> > > restructuring the table into a tall table.
> > >
> > > Something interesting is that my old regionserver machines had five 15k
> > > SCSI drives instead of 2 SSDs, and performance was about the same.
> Also,
> > my
> > > old network was 1gbit, now it is 10gbit. So neither network nor disk
> I/O
> > > appear to be the bottleneck. The CPU is rather high for the
> regionserver
> > so
> > > it seems like the best candidate to investigate. I will try profiling
> it
> > > tomorrow and will report back. I may revisit compression on vs off
> since
> > > that is adding load to the CPU.
> > >
> > > I'll also come up with a sample program that generates data similar to
> my
> > > table.
> > >
> > >
> > > On Apr 30, 2013, at 10:01 PM, lars hofhansl <la...@apache.org> wrote:
> > >
> > > > Your average row is 35k so scanner caching would not make a huge
> > > difference, although I would have expected some improvements by setting
> > it
> > > to 10 or 50 since you have a wide 10ge pipe.
> > > >
> > > > I assume your table is split sufficiently to touch all
> RegionServer...
> > > Do you see the same load/IO on all region servers?
> > > >
> > > > A bunch of scan improvements went into HBase since 0.94.2.
> > > > I blogged about some of these changes here:
> > > http://hadoop-hbase.blogspot.com/2012/12/hbase-profiling.html
> > > >
> > > > In your case - since you have many columns, each of which carry the
> > > rowkey - you might benefit a lot from HBASE-7279.
> > > >
> > > > In the end HBase *is* slower than straight HDFS for full scans. How
> > > could it not be?
> > > > So I would start by looking at HDFS first. Make sure Nagle's is
> > disbaled
> > > in both HBase and HDFS.
> > > >
> > > > And lastly SSDs are somewhat new territory for HBase. Maybe Andy
> > Purtell
> > > is listening, I think he did some tests with HBase on SSDs.
> > > > With rotating media you typically see an improvement with
> compression.
> > > With SSDs the added CPU needed for decompression might outweigh the
> > > benefits.
> > > >
> > > > At the risk of starting a larger discussion here, I would posit that
> > > HBase's LSM based design, which trades random IO with sequential IO,
> > might
> > > be a bit more questionable on SSDs.
> > > >
> > > > If you can, it would be nice to run a profiler against one of the
> > > RegionServers (or maybe do it with the single RS setup) and see where
> it
> > is
> > > bottlenecked.
> > > > (And if you send me a sample program to generate some data - not
> 700g,
> > > though :) - I'll try to do a bit of profiling during the next days as
> my
> > > day job permits, but I do not have any machines with SSDs).
> > > >
> > > > -- Lars
> > > >
> > > >
> > > >
> > > >
> > > > ________________________________
> > > > From: Bryan Keller <br...@gmail.com>
> > > > To: user@hbase.apache.org
> > > > Sent: Tuesday, April 30, 2013 9:31 PM
> > > > Subject: Re: Poor HBase map-reduce scan performance
> > > >
> > > >
> > > > Yes, I have tried various settings for setCaching() and I have
> > > setCacheBlocks(false)
> > > >
> > > > On Apr 30, 2013, at 9:17 PM, Ted Yu <yu...@gmail.com> wrote:
> > > >
> > > >> From http://hbase.apache.org/book.html#mapreduce.example :
> > > >>
> > > >> scan.setCaching(500);        // 1 is the default in Scan, which will
> > > >> be bad for MapReduce jobs
> > > >> scan.setCacheBlocks(false);  // don't set to true for MR jobs
> > > >>
> > > >> I guess you have used the above setting.
> > > >>
> > > >> 0.94.x releases are compatible. Have you considered upgrading to,
> say
> > > >> 0.94.7 which was recently released ?
> > > >>
> > > >> Cheers
> > > >>
> > > >> On Tue, Apr 30, 2013 at 9:01 PM, Bryan Keller <br...@gmail.com>
> > > wrote:
> > > >>
> > > >>> I have been attempting to speed up my HBase map-reduce scans for a
> > > while
> > > >>> now. I have tried just about everything without much luck. I'm
> > running
> > > out
> > > >>> of ideas and was hoping for some suggestions. This is HBase 0.94.2
> > and
> > > >>> Hadoop 2.0.0 (CDH4.2.1).
> > > >>>
> > > >>> The table I'm scanning:
> > > >>> 20 mil rows
> > > >>> Hundreds of columns/row
> > > >>> Column keys can be 30-40 bytes
> > > >>> Column values are generally not large, 1k would be on the large
> side
> > > >>> 250 regions
> > > >>> Snappy compression
> > > >>> 8gb region size
> > > >>> 512mb memstore flush
> > > >>> 128k block size
> > > >>> 700gb of data on HDFS
> > > >>>
> > > >>> My cluster has 8 datanodes which are also regionservers. Each has 8
> > > cores
> > > >>> (16 HT), 64gb RAM, and 2 SSDs. The network is 10gbit. I have a
> > separate
> > > >>> machine acting as namenode, HMaster, and zookeeper (single
> > instance). I
> > > >>> have disk local reads turned on.
> > > >>>
> > > >>> I'm seeing around 5 gbit/sec on average network IO. Each disk is
> > > getting
> > > >>> 400mb/sec read IO. Theoretically I could get 400mb/sec * 16 =
> > > 6.4gb/sec.
> > > >>>
> > > >>> Using Hadoop's TestDFSIO tool, I'm seeing around 1.4gb/sec read
> > speed.
> > > Not
> > > >>> really that great compared to the theoretical I/O. However this is
> > far
> > > >>> better than I am seeing with HBase map-reduce scans of my table.
> > > >>>
> > > >>> I have a simple no-op map-only job (using TableInputFormat) that
> > scans
> > > the
> > > >>> table and does nothing with data. This takes 45 minutes. That's
> about
> > > >>> 260mb/sec read speed. This is over 5x slower than straight HDFS.
> > > >>> Basically, with HBase I'm seeing read performance of my 16 SSD
> > cluster
> > > >>> performing nearly 35% slower than a single SSD.
> > > >>>
> > > >>> Here are some things I have changed to no avail:
> > > >>> Scan caching values
> > > >>> HDFS block sizes
> > > >>> HBase block sizes
> > > >>> Region file sizes
> > > >>> Memory settings
> > > >>> GC settings
> > > >>> Number of mappers/node
> > > >>> Compressed vs not compressed
> > > >>>
> > > >>> One thing I notice is that the regionserver is using quite a bit of
> > CPU
> > > >>> during the map reduce job. When dumping the jstack of the process,
> it
> > > seems
> > > >>> like it is usually in some type of memory allocation or
> decompression
> > > >>> routine which didn't seem abnormal.
> > > >>>
> > > >>> I can't seem to pinpoint the bottleneck. CPU use by the
> regionserver
> > is
> > > >>> high but not maxed out. Disk I/O and network I/O are low, IO wait
> is
> > > low.
> > > >>> I'm on the verge of just writing the dataset out to sequence files
> > > once a
> > > >>> day for scan purposes. Is that what others are doing?
> > >
> >
>

Re: Poor HBase map-reduce scan performance

Posted by Naidu MS <sa...@gmail.com>.

Hi i have two questions regarding hdfs and jps utility

I am new to Hadoop and started leraning hadoop from the past week

1.when ever i start start-all.sh and jps in console it showing the
processes started

*naidu@naidu:~/work/hadoop-1.0.4/bin$ jps*
*22283 NameNode*
*23516 TaskTracker*
*26711 Jps*
*22541 DataNode*
*23255 JobTracker*
*22813 SecondaryNameNode*
*Could not synchronize with target*

But along with the list of process stared it always showing *" Could not
synchronize with target" *in the jps output. What is meant by "Could not
synchronize with target"?  Can some one explain why this is happening?


2.Is it possible to format namenode multiple  times? When i enter the
 namenode -format command, it not formatting the name node and showing the
following ouput.

*naidu@naidu:~/work/hadoop-1.0.4/bin$ hadoop namenode -format*
*Warning: $HADOOP_HOME is deprecated.*
*
*
*13/05/01 12:08:04 INFO namenode.NameNode: STARTUP_MSG: *
*/*************************************************************
*STARTUP_MSG: Starting NameNode*
*STARTUP_MSG:   host = naidu/127.0.0.1*
*STARTUP_MSG:   args = [-format]*
*STARTUP_MSG:   version = 1.0.4*
*STARTUP_MSG:   build =
https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.0 -r
1393290; compiled by 'hortonfo' on Wed Oct  3 05:13:58 UTC 2012*
*************************************************************/*
*Re-format filesystem in /home/naidu/dfs/namenode ? (Y or N) y*
*Format aborted in /home/naidu/dfs/namenode*
*13/05/01 12:08:05 INFO namenode.NameNode: SHUTDOWN_MSG: *
*/*************************************************************
*SHUTDOWN_MSG: Shutting down NameNode at naidu/127.0.0.1*
*
*
*************************************************************/*

Can someone help me in understanding this? Why is it not possible to format
name node multiple times?


On Wed, May 1, 2013 at 12:22 PM, Matt Corgan <mc...@hotpads.com> wrote:

> Not that it's a long-term solution, but try major-compacting before running
> the benchmark.  If the LSM tree is CPU bound in merging HFiles/KeyValues
> through the PriorityQueue, then reducing to a single file per region should
> help.  The merging of HFiles during a scan is not heavily optimized yet.
>
>
> On Tue, Apr 30, 2013 at 11:21 PM, lars hofhansl <la...@apache.org> wrote:
>
> > If you can, try 0.94.4+; it should significantly reduce the amount of
> > bytes copied around in RAM during scanning, especially if you have wide
> > rows and/or large key portions. That in turns makes scans scale better
> > across cores, since RAM is shared resource between cores (much like
> disk).
> >
> >
> > It's not hard to build the latest HBase against Cloudera's version of
> > Hadoop. I can send along a simple patch to pom.xml to do that.
> >
> > -- Lars
> >
> >
> >
> > ________________________________
> >  From: Bryan Keller <br...@gmail.com>
> > To: user@hbase.apache.org
> > Sent: Tuesday, April 30, 2013 11:02 PM
> > Subject: Re: Poor HBase map-reduce scan performance
> >
> >
> > The table has hashed keys so rows are evenly distributed amongst the
> > regionservers, and load on each regionserver is pretty much the same. I
> > also have per-table balancing turned on. I get mostly data local mappers
> > with only a few rack local (maybe 10 of the 250 mappers).
> >
> > Currently the table is a wide table schema, with lists of data structures
> > stored as columns with column prefixes grouping the data structures (e.g.
> > 1_name, 1_address, 1_city, 2_name, 2_address, 2_city). I was thinking of
> > moving those data structures to protobuf which would cut down on the
> number
> > of columns. The downside is I can't filter on one value with that, but it
> > is a tradeoff I would make for performance. I was also considering
> > restructuring the table into a tall table.
> >
> > Something interesting is that my old regionserver machines had five 15k
> > SCSI drives instead of 2 SSDs, and performance was about the same. Also,
> my
> > old network was 1gbit, now it is 10gbit. So neither network nor disk I/O
> > appear to be the bottleneck. The CPU is rather high for the regionserver
> so
> > it seems like the best candidate to investigate. I will try profiling it
> > tomorrow and will report back. I may revisit compression on vs off since
> > that is adding load to the CPU.
> >
> > I'll also come up with a sample program that generates data similar to my
> > table.
> >
> >
> > On Apr 30, 2013, at 10:01 PM, lars hofhansl <la...@apache.org> wrote:
> >
> > > Your average row is 35k so scanner caching would not make a huge
> > difference, although I would have expected some improvements by setting
> it
> > to 10 or 50 since you have a wide 10ge pipe.
> > >
> > > I assume your table is split sufficiently to touch all RegionServer...
> > Do you see the same load/IO on all region servers?
> > >
> > > A bunch of scan improvements went into HBase since 0.94.2.
> > > I blogged about some of these changes here:
> > http://hadoop-hbase.blogspot.com/2012/12/hbase-profiling.html
> > >
> > > In your case - since you have many columns, each of which carry the
> > rowkey - you might benefit a lot from HBASE-7279.
> > >
> > > In the end HBase *is* slower than straight HDFS for full scans. How
> > could it not be?
> > > So I would start by looking at HDFS first. Make sure Nagle's is
> disbaled
> > in both HBase and HDFS.
> > >
> > > And lastly SSDs are somewhat new territory for HBase. Maybe Andy
> Purtell
> > is listening, I think he did some tests with HBase on SSDs.
> > > With rotating media you typically see an improvement with compression.
> > With SSDs the added CPU needed for decompression might outweigh the
> > benefits.
> > >
> > > At the risk of starting a larger discussion here, I would posit that
> > HBase's LSM based design, which trades random IO with sequential IO,
> might
> > be a bit more questionable on SSDs.
> > >
> > > If you can, it would be nice to run a profiler against one of the
> > RegionServers (or maybe do it with the single RS setup) and see where it
> is
> > bottlenecked.
> > > (And if you send me a sample program to generate some data - not 700g,
> > though :) - I'll try to do a bit of profiling during the next days as my
> > day job permits, but I do not have any machines with SSDs).
> > >
> > > -- Lars
> > >
> > >
> > >
> > >
> > > ________________________________
> > > From: Bryan Keller <br...@gmail.com>
> > > To: user@hbase.apache.org
> > > Sent: Tuesday, April 30, 2013 9:31 PM
> > > Subject: Re: Poor HBase map-reduce scan performance
> > >
> > >
> > > Yes, I have tried various settings for setCaching() and I have
> > setCacheBlocks(false)
> > >
> > > On Apr 30, 2013, at 9:17 PM, Ted Yu <yu...@gmail.com> wrote:
> > >
> > >> From http://hbase.apache.org/book.html#mapreduce.example :
> > >>
> > >> scan.setCaching(500);        // 1 is the default in Scan, which will
> > >> be bad for MapReduce jobs
> > >> scan.setCacheBlocks(false);  // don't set to true for MR jobs
> > >>
> > >> I guess you have used the above setting.
> > >>
> > >> 0.94.x releases are compatible. Have you considered upgrading to, say
> > >> 0.94.7 which was recently released ?
> > >>
> > >> Cheers
> > >>
> > >> On Tue, Apr 30, 2013 at 9:01 PM, Bryan Keller <br...@gmail.com>
> > wrote:
> > >>
> > >>> I have been attempting to speed up my HBase map-reduce scans for a
> > while
> > >>> now. I have tried just about everything without much luck. I'm
> running
> > out
> > >>> of ideas and was hoping for some suggestions. This is HBase 0.94.2
> and
> > >>> Hadoop 2.0.0 (CDH4.2.1).
> > >>>
> > >>> The table I'm scanning:
> > >>> 20 mil rows
> > >>> Hundreds of columns/row
> > >>> Column keys can be 30-40 bytes
> > >>> Column values are generally not large, 1k would be on the large side
> > >>> 250 regions
> > >>> Snappy compression
> > >>> 8gb region size
> > >>> 512mb memstore flush
> > >>> 128k block size
> > >>> 700gb of data on HDFS
> > >>>
> > >>> My cluster has 8 datanodes which are also regionservers. Each has 8
> > cores
> > >>> (16 HT), 64gb RAM, and 2 SSDs. The network is 10gbit. I have a
> separate
> > >>> machine acting as namenode, HMaster, and zookeeper (single
> instance). I
> > >>> have disk local reads turned on.
> > >>>
> > >>> I'm seeing around 5 gbit/sec on average network IO. Each disk is
> > getting
> > >>> 400mb/sec read IO. Theoretically I could get 400mb/sec * 16 =
> > 6.4gb/sec.
> > >>>
> > >>> Using Hadoop's TestDFSIO tool, I'm seeing around 1.4gb/sec read
> speed.
> > Not
> > >>> really that great compared to the theoretical I/O. However this is
> far
> > >>> better than I am seeing with HBase map-reduce scans of my table.
> > >>>
> > >>> I have a simple no-op map-only job (using TableInputFormat) that
> scans
> > the
> > >>> table and does nothing with data. This takes 45 minutes. That's about
> > >>> 260mb/sec read speed. This is over 5x slower than straight HDFS.
> > >>> Basically, with HBase I'm seeing read performance of my 16 SSD
> cluster
> > >>> performing nearly 35% slower than a single SSD.
> > >>>
> > >>> Here are some things I have changed to no avail:
> > >>> Scan caching values
> > >>> HDFS block sizes
> > >>> HBase block sizes
> > >>> Region file sizes
> > >>> Memory settings
> > >>> GC settings
> > >>> Number of mappers/node
> > >>> Compressed vs not compressed
> > >>>
> > >>> One thing I notice is that the regionserver is using quite a bit of
> CPU
> > >>> during the map reduce job. When dumping the jstack of the process, it
> > seems
> > >>> like it is usually in some type of memory allocation or decompression
> > >>> routine which didn't seem abnormal.
> > >>>
> > >>> I can't seem to pinpoint the bottleneck. CPU use by the regionserver
> is
> > >>> high but not maxed out. Disk I/O and network I/O are low, IO wait is
> > low.
> > >>> I'm on the verge of just writing the dataset out to sequence files
> > once a
> > >>> day for scan purposes. Is that what others are doing?
> >
>

Re: Poor HBase map-reduce scan performance

Posted by Bryan Keller <br...@gmail.com>.

Yes I have monitored GC, CPU, disk and network IO, anything else I could think of. Only the CPU usage by the regionserver is on the high side.

I mentioned data local jobs make up generally 240 of the 250 mappers (96%) - I get this information from the jobtracker. Does the JMX console give more accurate information?

On May 1, 2013, at 3:56 AM, Jean-Marc Spaggiari <je...@spaggiari.org> wrote:

> @Lars, how have your calculated the 35K/row size? I'm not able to find the
> same number.
> 
> @Bryan, Matt's idea below is good. With the hadoop test you always had data
> locality. Which your HBase test, maybe not. Can you take a look at the JMX
> console and tell us your locality % ? Also, over those 45 minutes, have you
> monitored the CPWIO, GC activities, etc. to see if any of those might have
> impacted the performances?
> 
> JM
> 
> 2013/5/1 Matt Corgan <mc...@hotpads.com>
> 
>> Not that it's a long-term solution, but try major-compacting before running
>> the benchmark.  If the LSM tree is CPU bound in merging HFiles/KeyValues
>> through the PriorityQueue, then reducing to a single file per region should
>> help.  The merging of HFiles during a scan is not heavily optimized yet.
>> 
>> 
>> On Tue, Apr 30, 2013 at 11:21 PM, lars hofhansl <la...@apache.org> wrote:
>> 
>>> If you can, try 0.94.4+; it should significantly reduce the amount of
>>> bytes copied around in RAM during scanning, especially if you have wide
>>> rows and/or large key portions. That in turns makes scans scale better
>>> across cores, since RAM is shared resource between cores (much like
>> disk).
>>> 
>>> 
>>> It's not hard to build the latest HBase against Cloudera's version of
>>> Hadoop. I can send along a simple patch to pom.xml to do that.
>>> 
>>> -- Lars
>>> 
>>> 
>>> 
>>> ________________________________
>>> From: Bryan Keller <br...@gmail.com>
>>> To: user@hbase.apache.org
>>> Sent: Tuesday, April 30, 2013 11:02 PM
>>> Subject: Re: Poor HBase map-reduce scan performance
>>> 
>>> 
>>> The table has hashed keys so rows are evenly distributed amongst the
>>> regionservers, and load on each regionserver is pretty much the same. I
>>> also have per-table balancing turned on. I get mostly data local mappers
>>> with only a few rack local (maybe 10 of the 250 mappers).
>>> 
>>> Currently the table is a wide table schema, with lists of data structures
>>> stored as columns with column prefixes grouping the data structures (e.g.
>>> 1_name, 1_address, 1_city, 2_name, 2_address, 2_city). I was thinking of
>>> moving those data structures to protobuf which would cut down on the
>> number
>>> of columns. The downside is I can't filter on one value with that, but it
>>> is a tradeoff I would make for performance. I was also considering
>>> restructuring the table into a tall table.
>>> 
>>> Something interesting is that my old regionserver machines had five 15k
>>> SCSI drives instead of 2 SSDs, and performance was about the same. Also,
>> my
>>> old network was 1gbit, now it is 10gbit. So neither network nor disk I/O
>>> appear to be the bottleneck. The CPU is rather high for the regionserver
>> so
>>> it seems like the best candidate to investigate. I will try profiling it
>>> tomorrow and will report back. I may revisit compression on vs off since
>>> that is adding load to the CPU.
>>> 
>>> I'll also come up with a sample program that generates data similar to my
>>> table.
>>> 
>>> 
>>> On Apr 30, 2013, at 10:01 PM, lars hofhansl <la...@apache.org> wrote:
>>> 
>>>> Your average row is 35k so scanner caching would not make a huge
>>> difference, although I would have expected some improvements by setting
>> it
>>> to 10 or 50 since you have a wide 10ge pipe.
>>>> 
>>>> I assume your table is split sufficiently to touch all RegionServer...
>>> Do you see the same load/IO on all region servers?
>>>> 
>>>> A bunch of scan improvements went into HBase since 0.94.2.
>>>> I blogged about some of these changes here:
>>> http://hadoop-hbase.blogspot.com/2012/12/hbase-profiling.html
>>>> 
>>>> In your case - since you have many columns, each of which carry the
>>> rowkey - you might benefit a lot from HBASE-7279.
>>>> 
>>>> In the end HBase *is* slower than straight HDFS for full scans. How
>>> could it not be?
>>>> So I would start by looking at HDFS first. Make sure Nagle's is
>> disbaled
>>> in both HBase and HDFS.
>>>> 
>>>> And lastly SSDs are somewhat new territory for HBase. Maybe Andy
>> Purtell
>>> is listening, I think he did some tests with HBase on SSDs.
>>>> With rotating media you typically see an improvement with compression.
>>> With SSDs the added CPU needed for decompression might outweigh the
>>> benefits.
>>>> 
>>>> At the risk of starting a larger discussion here, I would posit that
>>> HBase's LSM based design, which trades random IO with sequential IO,
>> might
>>> be a bit more questionable on SSDs.
>>>> 
>>>> If you can, it would be nice to run a profiler against one of the
>>> RegionServers (or maybe do it with the single RS setup) and see where it
>> is
>>> bottlenecked.
>>>> (And if you send me a sample program to generate some data - not 700g,
>>> though :) - I'll try to do a bit of profiling during the next days as my
>>> day job permits, but I do not have any machines with SSDs).
>>>> 
>>>> -- Lars
>>>> 
>>>> 
>>>> 
>>>> 
>>>> ________________________________
>>>> From: Bryan Keller <br...@gmail.com>
>>>> To: user@hbase.apache.org
>>>> Sent: Tuesday, April 30, 2013 9:31 PM
>>>> Subject: Re: Poor HBase map-reduce scan performance
>>>> 
>>>> 
>>>> Yes, I have tried various settings for setCaching() and I have
>>> setCacheBlocks(false)
>>>> 
>>>> On Apr 30, 2013, at 9:17 PM, Ted Yu <yu...@gmail.com> wrote:
>>>> 
>>>>> From http://hbase.apache.org/book.html#mapreduce.example :
>>>>> 
>>>>> scan.setCaching(500);        // 1 is the default in Scan, which will
>>>>> be bad for MapReduce jobs
>>>>> scan.setCacheBlocks(false);  // don't set to true for MR jobs
>>>>> 
>>>>> I guess you have used the above setting.
>>>>> 
>>>>> 0.94.x releases are compatible. Have you considered upgrading to, say
>>>>> 0.94.7 which was recently released ?
>>>>> 
>>>>> Cheers
>>>>> 
>>>>> On Tue, Apr 30, 2013 at 9:01 PM, Bryan Keller <br...@gmail.com>
>>> wrote:
>>>>> 
>>>>>> I have been attempting to speed up my HBase map-reduce scans for a
>>> while
>>>>>> now. I have tried just about everything without much luck. I'm
>> running
>>> out
>>>>>> of ideas and was hoping for some suggestions. This is HBase 0.94.2
>> and
>>>>>> Hadoop 2.0.0 (CDH4.2.1).
>>>>>> 
>>>>>> The table I'm scanning:
>>>>>> 20 mil rows
>>>>>> Hundreds of columns/row
>>>>>> Column keys can be 30-40 bytes
>>>>>> Column values are generally not large, 1k would be on the large side
>>>>>> 250 regions
>>>>>> Snappy compression
>>>>>> 8gb region size
>>>>>> 512mb memstore flush
>>>>>> 128k block size
>>>>>> 700gb of data on HDFS
>>>>>> 
>>>>>> My cluster has 8 datanodes which are also regionservers. Each has 8
>>> cores
>>>>>> (16 HT), 64gb RAM, and 2 SSDs. The network is 10gbit. I have a
>> separate
>>>>>> machine acting as namenode, HMaster, and zookeeper (single
>> instance). I
>>>>>> have disk local reads turned on.
>>>>>> 
>>>>>> I'm seeing around 5 gbit/sec on average network IO. Each disk is
>>> getting
>>>>>> 400mb/sec read IO. Theoretically I could get 400mb/sec * 16 =
>>> 6.4gb/sec.
>>>>>> 
>>>>>> Using Hadoop's TestDFSIO tool, I'm seeing around 1.4gb/sec read
>> speed.
>>> Not
>>>>>> really that great compared to the theoretical I/O. However this is
>> far
>>>>>> better than I am seeing with HBase map-reduce scans of my table.
>>>>>> 
>>>>>> I have a simple no-op map-only job (using TableInputFormat) that
>> scans
>>> the
>>>>>> table and does nothing with data. This takes 45 minutes. That's about
>>>>>> 260mb/sec read speed. This is over 5x slower than straight HDFS.
>>>>>> Basically, with HBase I'm seeing read performance of my 16 SSD
>> cluster
>>>>>> performing nearly 35% slower than a single SSD.
>>>>>> 
>>>>>> Here are some things I have changed to no avail:
>>>>>> Scan caching values
>>>>>> HDFS block sizes
>>>>>> HBase block sizes
>>>>>> Region file sizes
>>>>>> Memory settings
>>>>>> GC settings
>>>>>> Number of mappers/node
>>>>>> Compressed vs not compressed
>>>>>> 
>>>>>> One thing I notice is that the regionserver is using quite a bit of
>> CPU
>>>>>> during the map reduce job. When dumping the jstack of the process, it
>>> seems
>>>>>> like it is usually in some type of memory allocation or decompression
>>>>>> routine which didn't seem abnormal.
>>>>>> 
>>>>>> I can't seem to pinpoint the bottleneck. CPU use by the regionserver
>> is
>>>>>> high but not maxed out. Disk I/O and network I/O are low, IO wait is
>>> low.
>>>>>> I'm on the verge of just writing the dataset out to sequence files
>>> once a
>>>>>> day for scan purposes. Is that what others are doing?
>>> 
>>

Re: Poor HBase map-reduce scan performance

Posted by Jean-Marc Spaggiari <je...@spaggiari.org>.

@Lars, how have your calculated the 35K/row size? I'm not able to find the
same number.

@Bryan, Matt's idea below is good. With the hadoop test you always had data
locality. Which your HBase test, maybe not. Can you take a look at the JMX
console and tell us your locality % ? Also, over those 45 minutes, have you
monitored the CPWIO, GC activities, etc. to see if any of those might have
impacted the performances?

JM

2013/5/1 Matt Corgan <mc...@hotpads.com>

> Not that it's a long-term solution, but try major-compacting before running
> the benchmark.  If the LSM tree is CPU bound in merging HFiles/KeyValues
> through the PriorityQueue, then reducing to a single file per region should
> help.  The merging of HFiles during a scan is not heavily optimized yet.
>
>
> On Tue, Apr 30, 2013 at 11:21 PM, lars hofhansl <la...@apache.org> wrote:
>
> > If you can, try 0.94.4+; it should significantly reduce the amount of
> > bytes copied around in RAM during scanning, especially if you have wide
> > rows and/or large key portions. That in turns makes scans scale better
> > across cores, since RAM is shared resource between cores (much like
> disk).
> >
> >
> > It's not hard to build the latest HBase against Cloudera's version of
> > Hadoop. I can send along a simple patch to pom.xml to do that.
> >
> > -- Lars
> >
> >
> >
> > ________________________________
> >  From: Bryan Keller <br...@gmail.com>
> > To: user@hbase.apache.org
> > Sent: Tuesday, April 30, 2013 11:02 PM
> > Subject: Re: Poor HBase map-reduce scan performance
> >
> >
> > The table has hashed keys so rows are evenly distributed amongst the
> > regionservers, and load on each regionserver is pretty much the same. I
> > also have per-table balancing turned on. I get mostly data local mappers
> > with only a few rack local (maybe 10 of the 250 mappers).
> >
> > Currently the table is a wide table schema, with lists of data structures
> > stored as columns with column prefixes grouping the data structures (e.g.
> > 1_name, 1_address, 1_city, 2_name, 2_address, 2_city). I was thinking of
> > moving those data structures to protobuf which would cut down on the
> number
> > of columns. The downside is I can't filter on one value with that, but it
> > is a tradeoff I would make for performance. I was also considering
> > restructuring the table into a tall table.
> >
> > Something interesting is that my old regionserver machines had five 15k
> > SCSI drives instead of 2 SSDs, and performance was about the same. Also,
> my
> > old network was 1gbit, now it is 10gbit. So neither network nor disk I/O
> > appear to be the bottleneck. The CPU is rather high for the regionserver
> so
> > it seems like the best candidate to investigate. I will try profiling it
> > tomorrow and will report back. I may revisit compression on vs off since
> > that is adding load to the CPU.
> >
> > I'll also come up with a sample program that generates data similar to my
> > table.
> >
> >
> > On Apr 30, 2013, at 10:01 PM, lars hofhansl <la...@apache.org> wrote:
> >
> > > Your average row is 35k so scanner caching would not make a huge
> > difference, although I would have expected some improvements by setting
> it
> > to 10 or 50 since you have a wide 10ge pipe.
> > >
> > > I assume your table is split sufficiently to touch all RegionServer...
> > Do you see the same load/IO on all region servers?
> > >
> > > A bunch of scan improvements went into HBase since 0.94.2.
> > > I blogged about some of these changes here:
> > http://hadoop-hbase.blogspot.com/2012/12/hbase-profiling.html
> > >
> > > In your case - since you have many columns, each of which carry the
> > rowkey - you might benefit a lot from HBASE-7279.
> > >
> > > In the end HBase *is* slower than straight HDFS for full scans. How
> > could it not be?
> > > So I would start by looking at HDFS first. Make sure Nagle's is
> disbaled
> > in both HBase and HDFS.
> > >
> > > And lastly SSDs are somewhat new territory for HBase. Maybe Andy
> Purtell
> > is listening, I think he did some tests with HBase on SSDs.
> > > With rotating media you typically see an improvement with compression.
> > With SSDs the added CPU needed for decompression might outweigh the
> > benefits.
> > >
> > > At the risk of starting a larger discussion here, I would posit that
> > HBase's LSM based design, which trades random IO with sequential IO,
> might
> > be a bit more questionable on SSDs.
> > >
> > > If you can, it would be nice to run a profiler against one of the
> > RegionServers (or maybe do it with the single RS setup) and see where it
> is
> > bottlenecked.
> > > (And if you send me a sample program to generate some data - not 700g,
> > though :) - I'll try to do a bit of profiling during the next days as my
> > day job permits, but I do not have any machines with SSDs).
> > >
> > > -- Lars
> > >
> > >
> > >
> > >
> > > ________________________________
> > > From: Bryan Keller <br...@gmail.com>
> > > To: user@hbase.apache.org
> > > Sent: Tuesday, April 30, 2013 9:31 PM
> > > Subject: Re: Poor HBase map-reduce scan performance
> > >
> > >
> > > Yes, I have tried various settings for setCaching() and I have
> > setCacheBlocks(false)
> > >
> > > On Apr 30, 2013, at 9:17 PM, Ted Yu <yu...@gmail.com> wrote:
> > >
> > >> From http://hbase.apache.org/book.html#mapreduce.example :
> > >>
> > >> scan.setCaching(500);        // 1 is the default in Scan, which will
> > >> be bad for MapReduce jobs
> > >> scan.setCacheBlocks(false);  // don't set to true for MR jobs
> > >>
> > >> I guess you have used the above setting.
> > >>
> > >> 0.94.x releases are compatible. Have you considered upgrading to, say
> > >> 0.94.7 which was recently released ?
> > >>
> > >> Cheers
> > >>
> > >> On Tue, Apr 30, 2013 at 9:01 PM, Bryan Keller <br...@gmail.com>
> > wrote:
> > >>
> > >>> I have been attempting to speed up my HBase map-reduce scans for a
> > while
> > >>> now. I have tried just about everything without much luck. I'm
> running
> > out
> > >>> of ideas and was hoping for some suggestions. This is HBase 0.94.2
> and
> > >>> Hadoop 2.0.0 (CDH4.2.1).
> > >>>
> > >>> The table I'm scanning:
> > >>> 20 mil rows
> > >>> Hundreds of columns/row
> > >>> Column keys can be 30-40 bytes
> > >>> Column values are generally not large, 1k would be on the large side
> > >>> 250 regions
> > >>> Snappy compression
> > >>> 8gb region size
> > >>> 512mb memstore flush
> > >>> 128k block size
> > >>> 700gb of data on HDFS
> > >>>
> > >>> My cluster has 8 datanodes which are also regionservers. Each has 8
> > cores
> > >>> (16 HT), 64gb RAM, and 2 SSDs. The network is 10gbit. I have a
> separate
> > >>> machine acting as namenode, HMaster, and zookeeper (single
> instance). I
> > >>> have disk local reads turned on.
> > >>>
> > >>> I'm seeing around 5 gbit/sec on average network IO. Each disk is
> > getting
> > >>> 400mb/sec read IO. Theoretically I could get 400mb/sec * 16 =
> > 6.4gb/sec.
> > >>>
> > >>> Using Hadoop's TestDFSIO tool, I'm seeing around 1.4gb/sec read
> speed.
> > Not
> > >>> really that great compared to the theoretical I/O. However this is
> far
> > >>> better than I am seeing with HBase map-reduce scans of my table.
> > >>>
> > >>> I have a simple no-op map-only job (using TableInputFormat) that
> scans
> > the
> > >>> table and does nothing with data. This takes 45 minutes. That's about
> > >>> 260mb/sec read speed. This is over 5x slower than straight HDFS.
> > >>> Basically, with HBase I'm seeing read performance of my 16 SSD
> cluster
> > >>> performing nearly 35% slower than a single SSD.
> > >>>
> > >>> Here are some things I have changed to no avail:
> > >>> Scan caching values
> > >>> HDFS block sizes
> > >>> HBase block sizes
> > >>> Region file sizes
> > >>> Memory settings
> > >>> GC settings
> > >>> Number of mappers/node
> > >>> Compressed vs not compressed
> > >>>
> > >>> One thing I notice is that the regionserver is using quite a bit of
> CPU
> > >>> during the map reduce job. When dumping the jstack of the process, it
> > seems
> > >>> like it is usually in some type of memory allocation or decompression
> > >>> routine which didn't seem abnormal.
> > >>>
> > >>> I can't seem to pinpoint the bottleneck. CPU use by the regionserver
> is
> > >>> high but not maxed out. Disk I/O and network I/O are low, IO wait is
> > low.
> > >>> I'm on the verge of just writing the dataset out to sequence files
> > once a
> > >>> day for scan purposes. Is that what others are doing?
> >
>

Re: Poor HBase map-reduce scan performance

Posted by Matt Corgan <mc...@hotpads.com>.

Not that it's a long-term solution, but try major-compacting before running
the benchmark.  If the LSM tree is CPU bound in merging HFiles/KeyValues
through the PriorityQueue, then reducing to a single file per region should
help.  The merging of HFiles during a scan is not heavily optimized yet.


On Tue, Apr 30, 2013 at 11:21 PM, lars hofhansl <la...@apache.org> wrote:

> If you can, try 0.94.4+; it should significantly reduce the amount of
> bytes copied around in RAM during scanning, especially if you have wide
> rows and/or large key portions. That in turns makes scans scale better
> across cores, since RAM is shared resource between cores (much like disk).
>
>
> It's not hard to build the latest HBase against Cloudera's version of
> Hadoop. I can send along a simple patch to pom.xml to do that.
>
> -- Lars
>
>
>
> ________________________________
>  From: Bryan Keller <br...@gmail.com>
> To: user@hbase.apache.org
> Sent: Tuesday, April 30, 2013 11:02 PM
> Subject: Re: Poor HBase map-reduce scan performance
>
>
> The table has hashed keys so rows are evenly distributed amongst the
> regionservers, and load on each regionserver is pretty much the same. I
> also have per-table balancing turned on. I get mostly data local mappers
> with only a few rack local (maybe 10 of the 250 mappers).
>
> Currently the table is a wide table schema, with lists of data structures
> stored as columns with column prefixes grouping the data structures (e.g.
> 1_name, 1_address, 1_city, 2_name, 2_address, 2_city). I was thinking of
> moving those data structures to protobuf which would cut down on the number
> of columns. The downside is I can't filter on one value with that, but it
> is a tradeoff I would make for performance. I was also considering
> restructuring the table into a tall table.
>
> Something interesting is that my old regionserver machines had five 15k
> SCSI drives instead of 2 SSDs, and performance was about the same. Also, my
> old network was 1gbit, now it is 10gbit. So neither network nor disk I/O
> appear to be the bottleneck. The CPU is rather high for the regionserver so
> it seems like the best candidate to investigate. I will try profiling it
> tomorrow and will report back. I may revisit compression on vs off since
> that is adding load to the CPU.
>
> I'll also come up with a sample program that generates data similar to my
> table.
>
>
> On Apr 30, 2013, at 10:01 PM, lars hofhansl <la...@apache.org> wrote:
>
> > Your average row is 35k so scanner caching would not make a huge
> difference, although I would have expected some improvements by setting it
> to 10 or 50 since you have a wide 10ge pipe.
> >
> > I assume your table is split sufficiently to touch all RegionServer...
> Do you see the same load/IO on all region servers?
> >
> > A bunch of scan improvements went into HBase since 0.94.2.
> > I blogged about some of these changes here:
> http://hadoop-hbase.blogspot.com/2012/12/hbase-profiling.html
> >
> > In your case - since you have many columns, each of which carry the
> rowkey - you might benefit a lot from HBASE-7279.
> >
> > In the end HBase *is* slower than straight HDFS for full scans. How
> could it not be?
> > So I would start by looking at HDFS first. Make sure Nagle's is disbaled
> in both HBase and HDFS.
> >
> > And lastly SSDs are somewhat new territory for HBase. Maybe Andy Purtell
> is listening, I think he did some tests with HBase on SSDs.
> > With rotating media you typically see an improvement with compression.
> With SSDs the added CPU needed for decompression might outweigh the
> benefits.
> >
> > At the risk of starting a larger discussion here, I would posit that
> HBase's LSM based design, which trades random IO with sequential IO, might
> be a bit more questionable on SSDs.
> >
> > If you can, it would be nice to run a profiler against one of the
> RegionServers (or maybe do it with the single RS setup) and see where it is
> bottlenecked.
> > (And if you send me a sample program to generate some data - not 700g,
> though :) - I'll try to do a bit of profiling during the next days as my
> day job permits, but I do not have any machines with SSDs).
> >
> > -- Lars
> >
> >
> >
> >
> > ________________________________
> > From: Bryan Keller <br...@gmail.com>
> > To: user@hbase.apache.org
> > Sent: Tuesday, April 30, 2013 9:31 PM
> > Subject: Re: Poor HBase map-reduce scan performance
> >
> >
> > Yes, I have tried various settings for setCaching() and I have
> setCacheBlocks(false)
> >
> > On Apr 30, 2013, at 9:17 PM, Ted Yu <yu...@gmail.com> wrote:
> >
> >> From http://hbase.apache.org/book.html#mapreduce.example :
> >>
> >> scan.setCaching(500);        // 1 is the default in Scan, which will
> >> be bad for MapReduce jobs
> >> scan.setCacheBlocks(false);  // don't set to true for MR jobs
> >>
> >> I guess you have used the above setting.
> >>
> >> 0.94.x releases are compatible. Have you considered upgrading to, say
> >> 0.94.7 which was recently released ?
> >>
> >> Cheers
> >>
> >> On Tue, Apr 30, 2013 at 9:01 PM, Bryan Keller <br...@gmail.com>
> wrote:
> >>
> >>> I have been attempting to speed up my HBase map-reduce scans for a
> while
> >>> now. I have tried just about everything without much luck. I'm running
> out
> >>> of ideas and was hoping for some suggestions. This is HBase 0.94.2 and
> >>> Hadoop 2.0.0 (CDH4.2.1).
> >>>
> >>> The table I'm scanning:
> >>> 20 mil rows
> >>> Hundreds of columns/row
> >>> Column keys can be 30-40 bytes
> >>> Column values are generally not large, 1k would be on the large side
> >>> 250 regions
> >>> Snappy compression
> >>> 8gb region size
> >>> 512mb memstore flush
> >>> 128k block size
> >>> 700gb of data on HDFS
> >>>
> >>> My cluster has 8 datanodes which are also regionservers. Each has 8
> cores
> >>> (16 HT), 64gb RAM, and 2 SSDs. The network is 10gbit. I have a separate
> >>> machine acting as namenode, HMaster, and zookeeper (single instance). I
> >>> have disk local reads turned on.
> >>>
> >>> I'm seeing around 5 gbit/sec on average network IO. Each disk is
> getting
> >>> 400mb/sec read IO. Theoretically I could get 400mb/sec * 16 =
> 6.4gb/sec.
> >>>
> >>> Using Hadoop's TestDFSIO tool, I'm seeing around 1.4gb/sec read speed.
> Not
> >>> really that great compared to the theoretical I/O. However this is far
> >>> better than I am seeing with HBase map-reduce scans of my table.
> >>>
> >>> I have a simple no-op map-only job (using TableInputFormat) that scans
> the
> >>> table and does nothing with data. This takes 45 minutes. That's about
> >>> 260mb/sec read speed. This is over 5x slower than straight HDFS.
> >>> Basically, with HBase I'm seeing read performance of my 16 SSD cluster
> >>> performing nearly 35% slower than a single SSD.
> >>>
> >>> Here are some things I have changed to no avail:
> >>> Scan caching values
> >>> HDFS block sizes
> >>> HBase block sizes
> >>> Region file sizes
> >>> Memory settings
> >>> GC settings
> >>> Number of mappers/node
> >>> Compressed vs not compressed
> >>>
> >>> One thing I notice is that the regionserver is using quite a bit of CPU
> >>> during the map reduce job. When dumping the jstack of the process, it
> seems
> >>> like it is usually in some type of memory allocation or decompression
> >>> routine which didn't seem abnormal.
> >>>
> >>> I can't seem to pinpoint the bottleneck. CPU use by the regionserver is
> >>> high but not maxed out. Disk I/O and network I/O are low, IO wait is
> low.
> >>> I'm on the verge of just writing the dataset out to sequence files
> once a
> >>> day for scan purposes. Is that what others are doing?
>

Re: Poor HBase map-reduce scan performance

Posted by lars hofhansl <la...@apache.org>.

If you can, try 0.94.4+; it should significantly reduce the amount of bytes copied around in RAM during scanning, especially if you have wide rows and/or large key portions. That in turns makes scans scale better across cores, since RAM is shared resource between cores (much like disk).


It's not hard to build the latest HBase against Cloudera's version of Hadoop. I can send along a simple patch to pom.xml to do that.

-- Lars



________________________________
 From: Bryan Keller <br...@gmail.com>
To: user@hbase.apache.org 
Sent: Tuesday, April 30, 2013 11:02 PM
Subject: Re: Poor HBase map-reduce scan performance
 

The table has hashed keys so rows are evenly distributed amongst the regionservers, and load on each regionserver is pretty much the same. I also have per-table balancing turned on. I get mostly data local mappers with only a few rack local (maybe 10 of the 250 mappers).

Currently the table is a wide table schema, with lists of data structures stored as columns with column prefixes grouping the data structures (e.g. 1_name, 1_address, 1_city, 2_name, 2_address, 2_city). I was thinking of moving those data structures to protobuf which would cut down on the number of columns. The downside is I can't filter on one value with that, but it is a tradeoff I would make for performance. I was also considering restructuring the table into a tall table.

Something interesting is that my old regionserver machines had five 15k SCSI drives instead of 2 SSDs, and performance was about the same. Also, my old network was 1gbit, now it is 10gbit. So neither network nor disk I/O appear to be the bottleneck. The CPU is rather high for the regionserver so it seems like the best candidate to investigate. I will try profiling it tomorrow and will report back. I may revisit compression on vs off since that is adding load to the CPU.

I'll also come up with a sample program that generates data similar to my table.


On Apr 30, 2013, at 10:01 PM, lars hofhansl <la...@apache.org> wrote:

> Your average row is 35k so scanner caching would not make a huge difference, although I would have expected some improvements by setting it to 10 or 50 since you have a wide 10ge pipe.
> 
> I assume your table is split sufficiently to touch all RegionServer... Do you see the same load/IO on all region servers?
> 
> A bunch of scan improvements went into HBase since 0.94.2.
> I blogged about some of these changes here: http://hadoop-hbase.blogspot.com/2012/12/hbase-profiling.html
> 
> In your case - since you have many columns, each of which carry the rowkey - you might benefit a lot from HBASE-7279.
> 
> In the end HBase *is* slower than straight HDFS for full scans. How could it not be?
> So I would start by looking at HDFS first. Make sure Nagle's is disbaled in both HBase and HDFS.
> 
> And lastly SSDs are somewhat new territory for HBase. Maybe Andy Purtell is listening, I think he did some tests with HBase on SSDs.
> With rotating media you typically see an improvement with compression. With SSDs the added CPU needed for decompression might outweigh the benefits.
> 
> At the risk of starting a larger discussion here, I would posit that HBase's LSM based design, which trades random IO with sequential IO, might be a bit more questionable on SSDs.
> 
> If you can, it would be nice to run a profiler against one of the RegionServers (or maybe do it with the single RS setup) and see where it is bottlenecked.
> (And if you send me a sample program to generate some data - not 700g, though :) - I'll try to do a bit of profiling during the next days as my day job permits, but I do not have any machines with SSDs).
> 
> -- Lars
> 
> 
> 
> 
> ________________________________
> From: Bryan Keller <br...@gmail.com>
> To: user@hbase.apache.org 
> Sent: Tuesday, April 30, 2013 9:31 PM
> Subject: Re: Poor HBase map-reduce scan performance
> 
> 
> Yes, I have tried various settings for setCaching() and I have setCacheBlocks(false)
> 
> On Apr 30, 2013, at 9:17 PM, Ted Yu <yu...@gmail.com> wrote:
> 
>> From http://hbase.apache.org/book.html#mapreduce.example :
>> 
>> scan.setCaching(500);        // 1 is the default in Scan, which will
>> be bad for MapReduce jobs
>> scan.setCacheBlocks(false);  // don't set to true for MR jobs
>> 
>> I guess you have used the above setting.
>> 
>> 0.94.x releases are compatible. Have you considered upgrading to, say
>> 0.94.7 which was recently released ?
>> 
>> Cheers
>> 
>> On Tue, Apr 30, 2013 at 9:01 PM, Bryan Keller <br...@gmail.com> wrote:
>> 
>>> I have been attempting to speed up my HBase map-reduce scans for a while
>>> now. I have tried just about everything without much luck. I'm running out
>>> of ideas and was hoping for some suggestions. This is HBase 0.94.2 and
>>> Hadoop 2.0.0 (CDH4.2.1).
>>> 
>>> The table I'm scanning:
>>> 20 mil rows
>>> Hundreds of columns/row
>>> Column keys can be 30-40 bytes
>>> Column values are generally not large, 1k would be on the large side
>>> 250 regions
>>> Snappy compression
>>> 8gb region size
>>> 512mb memstore flush
>>> 128k block size
>>> 700gb of data on HDFS
>>> 
>>> My cluster has 8 datanodes which are also regionservers. Each has 8 cores
>>> (16 HT), 64gb RAM, and 2 SSDs. The network is 10gbit. I have a separate
>>> machine acting as namenode, HMaster, and zookeeper (single instance). I
>>> have disk local reads turned on.
>>> 
>>> I'm seeing around 5 gbit/sec on average network IO. Each disk is getting
>>> 400mb/sec read IO. Theoretically I could get 400mb/sec * 16 = 6.4gb/sec.
>>> 
>>> Using Hadoop's TestDFSIO tool, I'm seeing around 1.4gb/sec read speed. Not
>>> really that great compared to the theoretical I/O. However this is far
>>> better than I am seeing with HBase map-reduce scans of my table.
>>> 
>>> I have a simple no-op map-only job (using TableInputFormat) that scans the
>>> table and does nothing with data. This takes 45 minutes. That's about
>>> 260mb/sec read speed. This is over 5x slower than straight HDFS.
>>> Basically, with HBase I'm seeing read performance of my 16 SSD cluster
>>> performing nearly 35% slower than a single SSD.
>>> 
>>> Here are some things I have changed to no avail:
>>> Scan caching values
>>> HDFS block sizes
>>> HBase block sizes
>>> Region file sizes
>>> Memory settings
>>> GC settings
>>> Number of mappers/node
>>> Compressed vs not compressed
>>> 
>>> One thing I notice is that the regionserver is using quite a bit of CPU
>>> during the map reduce job. When dumping the jstack of the process, it seems
>>> like it is usually in some type of memory allocation or decompression
>>> routine which didn't seem abnormal.
>>> 
>>> I can't seem to pinpoint the bottleneck. CPU use by the regionserver is
>>> high but not maxed out. Disk I/O and network I/O are low, IO wait is low.
>>> I'm on the verge of just writing the dataset out to sequence files once a
>>> day for scan purposes. Is that what others are doing?

Re: Poor HBase map-reduce scan performance

Posted by Bryan Keller <br...@gmail.com>.

The table has hashed keys so rows are evenly distributed amongst the regionservers, and load on each regionserver is pretty much the same. I also have per-table balancing turned on. I get mostly data local mappers with only a few rack local (maybe 10 of the 250 mappers).

Currently the table is a wide table schema, with lists of data structures stored as columns with column prefixes grouping the data structures (e.g. 1_name, 1_address, 1_city, 2_name, 2_address, 2_city). I was thinking of moving those data structures to protobuf which would cut down on the number of columns. The downside is I can't filter on one value with that, but it is a tradeoff I would make for performance. I was also considering restructuring the table into a tall table.

Something interesting is that my old regionserver machines had five 15k SCSI drives instead of 2 SSDs, and performance was about the same. Also, my old network was 1gbit, now it is 10gbit. So neither network nor disk I/O appear to be the bottleneck. The CPU is rather high for the regionserver so it seems like the best candidate to investigate. I will try profiling it tomorrow and will report back. I may revisit compression on vs off since that is adding load to the CPU.

I'll also come up with a sample program that generates data similar to my table.


On Apr 30, 2013, at 10:01 PM, lars hofhansl <la...@apache.org> wrote:

> Your average row is 35k so scanner caching would not make a huge difference, although I would have expected some improvements by setting it to 10 or 50 since you have a wide 10ge pipe.
> 
> I assume your table is split sufficiently to touch all RegionServer... Do you see the same load/IO on all region servers?
> 
> A bunch of scan improvements went into HBase since 0.94.2.
> I blogged about some of these changes here: http://hadoop-hbase.blogspot.com/2012/12/hbase-profiling.html
> 
> In your case - since you have many columns, each of which carry the rowkey - you might benefit a lot from HBASE-7279.
> 
> In the end HBase *is* slower than straight HDFS for full scans. How could it not be?
> So I would start by looking at HDFS first. Make sure Nagle's is disbaled in both HBase and HDFS.
> 
> And lastly SSDs are somewhat new territory for HBase. Maybe Andy Purtell is listening, I think he did some tests with HBase on SSDs.
> With rotating media you typically see an improvement with compression. With SSDs the added CPU needed for decompression might outweigh the benefits.
> 
> At the risk of starting a larger discussion here, I would posit that HBase's LSM based design, which trades random IO with sequential IO, might be a bit more questionable on SSDs.
> 
> If you can, it would be nice to run a profiler against one of the RegionServers (or maybe do it with the single RS setup) and see where it is bottlenecked.
> (And if you send me a sample program to generate some data - not 700g, though :) - I'll try to do a bit of profiling during the next days as my day job permits, but I do not have any machines with SSDs).
> 
> -- Lars
> 
> 
> 
> 
> ________________________________
> From: Bryan Keller <br...@gmail.com>
> To: user@hbase.apache.org 
> Sent: Tuesday, April 30, 2013 9:31 PM
> Subject: Re: Poor HBase map-reduce scan performance
> 
> 
> Yes, I have tried various settings for setCaching() and I have setCacheBlocks(false)
> 
> On Apr 30, 2013, at 9:17 PM, Ted Yu <yu...@gmail.com> wrote:
> 
>> From http://hbase.apache.org/book.html#mapreduce.example :
>> 
>> scan.setCaching(500);        // 1 is the default in Scan, which will
>> be bad for MapReduce jobs
>> scan.setCacheBlocks(false);  // don't set to true for MR jobs
>> 
>> I guess you have used the above setting.
>> 
>> 0.94.x releases are compatible. Have you considered upgrading to, say
>> 0.94.7 which was recently released ?
>> 
>> Cheers
>> 
>> On Tue, Apr 30, 2013 at 9:01 PM, Bryan Keller <br...@gmail.com> wrote:
>> 
>>> I have been attempting to speed up my HBase map-reduce scans for a while
>>> now. I have tried just about everything without much luck. I'm running out
>>> of ideas and was hoping for some suggestions. This is HBase 0.94.2 and
>>> Hadoop 2.0.0 (CDH4.2.1).
>>> 
>>> The table I'm scanning:
>>> 20 mil rows
>>> Hundreds of columns/row
>>> Column keys can be 30-40 bytes
>>> Column values are generally not large, 1k would be on the large side
>>> 250 regions
>>> Snappy compression
>>> 8gb region size
>>> 512mb memstore flush
>>> 128k block size
>>> 700gb of data on HDFS
>>> 
>>> My cluster has 8 datanodes which are also regionservers. Each has 8 cores
>>> (16 HT), 64gb RAM, and 2 SSDs. The network is 10gbit. I have a separate
>>> machine acting as namenode, HMaster, and zookeeper (single instance). I
>>> have disk local reads turned on.
>>> 
>>> I'm seeing around 5 gbit/sec on average network IO. Each disk is getting
>>> 400mb/sec read IO. Theoretically I could get 400mb/sec * 16 = 6.4gb/sec.
>>> 
>>> Using Hadoop's TestDFSIO tool, I'm seeing around 1.4gb/sec read speed. Not
>>> really that great compared to the theoretical I/O. However this is far
>>> better than I am seeing with HBase map-reduce scans of my table.
>>> 
>>> I have a simple no-op map-only job (using TableInputFormat) that scans the
>>> table and does nothing with data. This takes 45 minutes. That's about
>>> 260mb/sec read speed. This is over 5x slower than straight HDFS.
>>> Basically, with HBase I'm seeing read performance of my 16 SSD cluster
>>> performing nearly 35% slower than a single SSD.
>>> 
>>> Here are some things I have changed to no avail:
>>> Scan caching values
>>> HDFS block sizes
>>> HBase block sizes
>>> Region file sizes
>>> Memory settings
>>> GC settings
>>> Number of mappers/node
>>> Compressed vs not compressed
>>> 
>>> One thing I notice is that the regionserver is using quite a bit of CPU
>>> during the map reduce job. When dumping the jstack of the process, it seems
>>> like it is usually in some type of memory allocation or decompression
>>> routine which didn't seem abnormal.
>>> 
>>> I can't seem to pinpoint the bottleneck. CPU use by the regionserver is
>>> high but not maxed out. Disk I/O and network I/O are low, IO wait is low.
>>> I'm on the verge of just writing the dataset out to sequence files once a
>>> day for scan purposes. Is that what others are doing?

Re: Poor HBase map-reduce scan performance

Posted by lars hofhansl <la...@apache.org>.

Your average row is 35k so scanner caching would not make a huge difference, although I would have expected some improvements by setting it to 10 or 50 since you have a wide 10ge pipe.

I assume your table is split sufficiently to touch all RegionServer... Do you see the same load/IO on all region servers?

A bunch of scan improvements went into HBase since 0.94.2.
I blogged about some of these changes here: http://hadoop-hbase.blogspot.com/2012/12/hbase-profiling.html

In your case - since you have many columns, each of which carry the rowkey - you might benefit a lot from HBASE-7279.

In the end HBase *is* slower than straight HDFS for full scans. How could it not be?
So I would start by looking at HDFS first. Make sure Nagle's is disbaled in both HBase and HDFS.

And lastly SSDs are somewhat new territory for HBase. Maybe Andy Purtell is listening, I think he did some tests with HBase on SSDs.
With rotating media you typically see an improvement with compression. With SSDs the added CPU needed for decompression might outweigh the benefits.

At the risk of starting a larger discussion here, I would posit that HBase's LSM based design, which trades random IO with sequential IO, might be a bit more questionable on SSDs.

If you can, it would be nice to run a profiler against one of the RegionServers (or maybe do it with the single RS setup) and see where it is bottlenecked.
(And if you send me a sample program to generate some data - not 700g, though :) - I'll try to do a bit of profiling during the next days as my day job permits, but I do not have any machines with SSDs).

-- Lars

________________________________
 From: Bryan Keller <br...@gmail.com>
To: user@hbase.apache.org 
Sent: Tuesday, April 30, 2013 9:31 PM
Subject: Re: Poor HBase map-reduce scan performance

Yes, I have tried various settings for setCaching() and I have setCacheBlocks(false)

On Apr 30, 2013, at 9:17 PM, Ted Yu <yu...@gmail.com> wrote:

> From http://hbase.apache.org/book.html#mapreduce.example :
> 
> scan.setCaching(500);        // 1 is the default in Scan, which will
> be bad for MapReduce jobs
> scan.setCacheBlocks(false);  // don't set to true for MR jobs
> 
> I guess you have used the above setting.
> 
> 0.94.x releases are compatible. Have you considered upgrading to, say
> 0.94.7 which was recently released ?
> 
> Cheers
> 
> On Tue, Apr 30, 2013 at 9:01 PM, Bryan Keller <br...@gmail.com> wrote:
> 
>> I have been attempting to speed up my HBase map-reduce scans for a while
>> now. I have tried just about everything without much luck. I'm running out
>> of ideas and was hoping for some suggestions. This is HBase 0.94.2 and
>> Hadoop 2.0.0 (CDH4.2.1).
>> 
>> The table I'm scanning:
>> 20 mil rows
>> Hundreds of columns/row
>> Column keys can be 30-40 bytes
>> Column values are generally not large, 1k would be on the large side
>> 250 regions
>> Snappy compression
>> 8gb region size
>> 512mb memstore flush
>> 128k block size
>> 700gb of data on HDFS
>> 
>> My cluster has 8 datanodes which are also regionservers. Each has 8 cores
>> (16 HT), 64gb RAM, and 2 SSDs. The network is 10gbit. I have a separate
>> machine acting as namenode, HMaster, and zookeeper (single instance). I
>> have disk local reads turned on.
>> 
>> I'm seeing around 5 gbit/sec on average network IO. Each disk is getting
>> 400mb/sec read IO. Theoretically I could get 400mb/sec * 16 = 6.4gb/sec.
>> 
>> Using Hadoop's TestDFSIO tool, I'm seeing around 1.4gb/sec read speed. Not
>> really that great compared to the theoretical I/O. However this is far
>> better than I am seeing with HBase map-reduce scans of my table.
>> 
>> I have a simple no-op map-only job (using TableInputFormat) that scans the
>> table and does nothing with data. This takes 45 minutes. That's about
>> 260mb/sec read speed. This is over 5x slower than straight HDFS.
>> Basically, with HBase I'm seeing read performance of my 16 SSD cluster
>> performing nearly 35% slower than a single SSD.
>> 
>> Here are some things I have changed to no avail:
>> Scan caching values
>> HDFS block sizes
>> HBase block sizes
>> Region file sizes
>> Memory settings
>> GC settings
>> Number of mappers/node
>> Compressed vs not compressed
>> 
>> One thing I notice is that the regionserver is using quite a bit of CPU
>> during the map reduce job. When dumping the jstack of the process, it seems
>> like it is usually in some type of memory allocation or decompression
>> routine which didn't seem abnormal.
>> 
>> I can't seem to pinpoint the bottleneck. CPU use by the regionserver is
>> high but not maxed out. Disk I/O and network I/O are low, IO wait is low.
>> I'm on the verge of just writing the dataset out to sequence files once a
>> day for scan purposes. Is that what others are doing?

Re: Poor HBase map-reduce scan performance

Posted by Bryan Keller <br...@gmail.com>.

Yes, I have it enabled (forgot to mention that).

On Apr 30, 2013, at 9:56 PM, Ted Yu <yu...@gmail.com> wrote:

> Have you tried enabling short circuit read ?
> 
> Thanks
> 
> On Apr 30, 2013, at 9:31 PM, Bryan Keller <br...@gmail.com> wrote:
> 
>> Yes, I have tried various settings for setCaching() and I have setCacheBlocks(false)
>> 
>> On Apr 30, 2013, at 9:17 PM, Ted Yu <yu...@gmail.com> wrote:
>> 
>>> From http://hbase.apache.org/book.html#mapreduce.example :
>>> 
>>> scan.setCaching(500);        // 1 is the default in Scan, which will
>>> be bad for MapReduce jobs
>>> scan.setCacheBlocks(false);  // don't set to true for MR jobs
>>> 
>>> I guess you have used the above setting.
>>> 
>>> 0.94.x releases are compatible. Have you considered upgrading to, say
>>> 0.94.7 which was recently released ?
>>> 
>>> Cheers
>>> 
>>> On Tue, Apr 30, 2013 at 9:01 PM, Bryan Keller <br...@gmail.com> wrote:
>>> 
>>>> I have been attempting to speed up my HBase map-reduce scans for a while
>>>> now. I have tried just about everything without much luck. I'm running out
>>>> of ideas and was hoping for some suggestions. This is HBase 0.94.2 and
>>>> Hadoop 2.0.0 (CDH4.2.1).
>>>> 
>>>> The table I'm scanning:
>>>> 20 mil rows
>>>> Hundreds of columns/row
>>>> Column keys can be 30-40 bytes
>>>> Column values are generally not large, 1k would be on the large side
>>>> 250 regions
>>>> Snappy compression
>>>> 8gb region size
>>>> 512mb memstore flush
>>>> 128k block size
>>>> 700gb of data on HDFS
>>>> 
>>>> My cluster has 8 datanodes which are also regionservers. Each has 8 cores
>>>> (16 HT), 64gb RAM, and 2 SSDs. The network is 10gbit. I have a separate
>>>> machine acting as namenode, HMaster, and zookeeper (single instance). I
>>>> have disk local reads turned on.
>>>> 
>>>> I'm seeing around 5 gbit/sec on average network IO. Each disk is getting
>>>> 400mb/sec read IO. Theoretically I could get 400mb/sec * 16 = 6.4gb/sec.
>>>> 
>>>> Using Hadoop's TestDFSIO tool, I'm seeing around 1.4gb/sec read speed. Not
>>>> really that great compared to the theoretical I/O. However this is far
>>>> better than I am seeing with HBase map-reduce scans of my table.
>>>> 
>>>> I have a simple no-op map-only job (using TableInputFormat) that scans the
>>>> table and does nothing with data. This takes 45 minutes. That's about
>>>> 260mb/sec read speed. This is over 5x slower than straight HDFS.
>>>> Basically, with HBase I'm seeing read performance of my 16 SSD cluster
>>>> performing nearly 35% slower than a single SSD.
>>>> 
>>>> Here are some things I have changed to no avail:
>>>> Scan caching values
>>>> HDFS block sizes
>>>> HBase block sizes
>>>> Region file sizes
>>>> Memory settings
>>>> GC settings
>>>> Number of mappers/node
>>>> Compressed vs not compressed
>>>> 
>>>> One thing I notice is that the regionserver is using quite a bit of CPU
>>>> during the map reduce job. When dumping the jstack of the process, it seems
>>>> like it is usually in some type of memory allocation or decompression
>>>> routine which didn't seem abnormal.
>>>> 
>>>> I can't seem to pinpoint the bottleneck. CPU use by the regionserver is
>>>> high but not maxed out. Disk I/O and network I/O are low, IO wait is low.
>>>> I'm on the verge of just writing the dataset out to sequence files once a
>>>> day for scan purposes. Is that what others are doing?
>>

Re: Poor HBase map-reduce scan performance

Posted by Ted Yu <yu...@gmail.com>.

Have you tried enabling short circuit read ?

Thanks

On Apr 30, 2013, at 9:31 PM, Bryan Keller <br...@gmail.com> wrote:

> Yes, I have tried various settings for setCaching() and I have setCacheBlocks(false)
> 
> On Apr 30, 2013, at 9:17 PM, Ted Yu <yu...@gmail.com> wrote:
> 
>> From http://hbase.apache.org/book.html#mapreduce.example :
>> 
>> scan.setCaching(500);        // 1 is the default in Scan, which will
>> be bad for MapReduce jobs
>> scan.setCacheBlocks(false);  // don't set to true for MR jobs
>> 
>> I guess you have used the above setting.
>> 
>> 0.94.x releases are compatible. Have you considered upgrading to, say
>> 0.94.7 which was recently released ?
>> 
>> Cheers
>> 
>> On Tue, Apr 30, 2013 at 9:01 PM, Bryan Keller <br...@gmail.com> wrote:
>> 
>>> I have been attempting to speed up my HBase map-reduce scans for a while
>>> now. I have tried just about everything without much luck. I'm running out
>>> of ideas and was hoping for some suggestions. This is HBase 0.94.2 and
>>> Hadoop 2.0.0 (CDH4.2.1).
>>> 
>>> The table I'm scanning:
>>> 20 mil rows
>>> Hundreds of columns/row
>>> Column keys can be 30-40 bytes
>>> Column values are generally not large, 1k would be on the large side
>>> 250 regions
>>> Snappy compression
>>> 8gb region size
>>> 512mb memstore flush
>>> 128k block size
>>> 700gb of data on HDFS
>>> 
>>> My cluster has 8 datanodes which are also regionservers. Each has 8 cores
>>> (16 HT), 64gb RAM, and 2 SSDs. The network is 10gbit. I have a separate
>>> machine acting as namenode, HMaster, and zookeeper (single instance). I
>>> have disk local reads turned on.
>>> 
>>> I'm seeing around 5 gbit/sec on average network IO. Each disk is getting
>>> 400mb/sec read IO. Theoretically I could get 400mb/sec * 16 = 6.4gb/sec.
>>> 
>>> Using Hadoop's TestDFSIO tool, I'm seeing around 1.4gb/sec read speed. Not
>>> really that great compared to the theoretical I/O. However this is far
>>> better than I am seeing with HBase map-reduce scans of my table.
>>> 
>>> I have a simple no-op map-only job (using TableInputFormat) that scans the
>>> table and does nothing with data. This takes 45 minutes. That's about
>>> 260mb/sec read speed. This is over 5x slower than straight HDFS.
>>> Basically, with HBase I'm seeing read performance of my 16 SSD cluster
>>> performing nearly 35% slower than a single SSD.
>>> 
>>> Here are some things I have changed to no avail:
>>> Scan caching values
>>> HDFS block sizes
>>> HBase block sizes
>>> Region file sizes
>>> Memory settings
>>> GC settings
>>> Number of mappers/node
>>> Compressed vs not compressed
>>> 
>>> One thing I notice is that the regionserver is using quite a bit of CPU
>>> during the map reduce job. When dumping the jstack of the process, it seems
>>> like it is usually in some type of memory allocation or decompression
>>> routine which didn't seem abnormal.
>>> 
>>> I can't seem to pinpoint the bottleneck. CPU use by the regionserver is
>>> high but not maxed out. Disk I/O and network I/O are low, IO wait is low.
>>> I'm on the verge of just writing the dataset out to sequence files once a
>>> day for scan purposes. Is that what others are doing?
>

Re: Poor HBase map-reduce scan performance

Posted by Bryan Keller <br...@gmail.com>.

Yes, I have tried various settings for setCaching() and I have setCacheBlocks(false)

On Apr 30, 2013, at 9:17 PM, Ted Yu <yu...@gmail.com> wrote:

> From http://hbase.apache.org/book.html#mapreduce.example :
> 
> scan.setCaching(500);        // 1 is the default in Scan, which will
> be bad for MapReduce jobs
> scan.setCacheBlocks(false);  // don't set to true for MR jobs
> 
> I guess you have used the above setting.
> 
> 0.94.x releases are compatible. Have you considered upgrading to, say
> 0.94.7 which was recently released ?
> 
> Cheers
> 
> On Tue, Apr 30, 2013 at 9:01 PM, Bryan Keller <br...@gmail.com> wrote:
> 
>> I have been attempting to speed up my HBase map-reduce scans for a while
>> now. I have tried just about everything without much luck. I'm running out
>> of ideas and was hoping for some suggestions. This is HBase 0.94.2 and
>> Hadoop 2.0.0 (CDH4.2.1).
>> 
>> The table I'm scanning:
>> 20 mil rows
>> Hundreds of columns/row
>> Column keys can be 30-40 bytes
>> Column values are generally not large, 1k would be on the large side
>> 250 regions
>> Snappy compression
>> 8gb region size
>> 512mb memstore flush
>> 128k block size
>> 700gb of data on HDFS
>> 
>> My cluster has 8 datanodes which are also regionservers. Each has 8 cores
>> (16 HT), 64gb RAM, and 2 SSDs. The network is 10gbit. I have a separate
>> machine acting as namenode, HMaster, and zookeeper (single instance). I
>> have disk local reads turned on.
>> 
>> I'm seeing around 5 gbit/sec on average network IO. Each disk is getting
>> 400mb/sec read IO. Theoretically I could get 400mb/sec * 16 = 6.4gb/sec.
>> 
>> Using Hadoop's TestDFSIO tool, I'm seeing around 1.4gb/sec read speed. Not
>> really that great compared to the theoretical I/O. However this is far
>> better than I am seeing with HBase map-reduce scans of my table.
>> 
>> I have a simple no-op map-only job (using TableInputFormat) that scans the
>> table and does nothing with data. This takes 45 minutes. That's about
>> 260mb/sec read speed. This is over 5x slower than straight HDFS.
>> Basically, with HBase I'm seeing read performance of my 16 SSD cluster
>> performing nearly 35% slower than a single SSD.
>> 
>> Here are some things I have changed to no avail:
>> Scan caching values
>> HDFS block sizes
>> HBase block sizes
>> Region file sizes
>> Memory settings
>> GC settings
>> Number of mappers/node
>> Compressed vs not compressed
>> 
>> One thing I notice is that the regionserver is using quite a bit of CPU
>> during the map reduce job. When dumping the jstack of the process, it seems
>> like it is usually in some type of memory allocation or decompression
>> routine which didn't seem abnormal.
>> 
>> I can't seem to pinpoint the bottleneck. CPU use by the regionserver is
>> high but not maxed out. Disk I/O and network I/O are low, IO wait is low.
>> I'm on the verge of just writing the dataset out to sequence files once a
>> day for scan purposes. Is that what others are doing?

Re: Poor HBase map-reduce scan performance

Posted by Ted Yu <yu...@gmail.com>.

>From http://hbase.apache.org/book.html#mapreduce.example :

scan.setCaching(500);        // 1 is the default in Scan, which will
be bad for MapReduce jobs
scan.setCacheBlocks(false);  // don't set to true for MR jobs

I guess you have used the above setting.

0.94.x releases are compatible. Have you considered upgrading to, say
0.94.7 which was recently released ?

Cheers

On Tue, Apr 30, 2013 at 9:01 PM, Bryan Keller <br...@gmail.com> wrote:

> I have been attempting to speed up my HBase map-reduce scans for a while
> now. I have tried just about everything without much luck. I'm running out
> of ideas and was hoping for some suggestions. This is HBase 0.94.2 and
> Hadoop 2.0.0 (CDH4.2.1).
>
> The table I'm scanning:
> 20 mil rows
> Hundreds of columns/row
> Column keys can be 30-40 bytes
> Column values are generally not large, 1k would be on the large side
> 250 regions
> Snappy compression
> 8gb region size
> 512mb memstore flush
> 128k block size
> 700gb of data on HDFS
>
> My cluster has 8 datanodes which are also regionservers. Each has 8 cores
> (16 HT), 64gb RAM, and 2 SSDs. The network is 10gbit. I have a separate
> machine acting as namenode, HMaster, and zookeeper (single instance). I
> have disk local reads turned on.
>
> I'm seeing around 5 gbit/sec on average network IO. Each disk is getting
> 400mb/sec read IO. Theoretically I could get 400mb/sec * 16 = 6.4gb/sec.
>
> Using Hadoop's TestDFSIO tool, I'm seeing around 1.4gb/sec read speed. Not
> really that great compared to the theoretical I/O. However this is far
> better than I am seeing with HBase map-reduce scans of my table.
>
> I have a simple no-op map-only job (using TableInputFormat) that scans the
> table and does nothing with data. This takes 45 minutes. That's about
> 260mb/sec read speed. This is over 5x slower than straight HDFS.
>  Basically, with HBase I'm seeing read performance of my 16 SSD cluster
> performing nearly 35% slower than a single SSD.
>
> Here are some things I have changed to no avail:
> Scan caching values
> HDFS block sizes
> HBase block sizes
> Region file sizes
> Memory settings
> GC settings
> Number of mappers/node
> Compressed vs not compressed
>
> One thing I notice is that the regionserver is using quite a bit of CPU
> during the map reduce job. When dumping the jstack of the process, it seems
> like it is usually in some type of memory allocation or decompression
> routine which didn't seem abnormal.
>
> I can't seem to pinpoint the bottleneck. CPU use by the regionserver is
> high but not maxed out. Disk I/O and network I/O are low, IO wait is low.
> I'm on the verge of just writing the dataset out to sequence files once a
> day for scan purposes. Is that what others are doing?