You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Jerry Lam <ch...@gmail.com> on 2014/01/02 16:56:59 UTC

Re: Performance between HBaseClient scan and HFileReaderV2

Hi Tom,

Good point. Note that I also ran the HBaseClient performance test several
times (as you can see from the chart). The caching should also benefit the
second time I ran the HBaseClient performance test not just benefitting the
HFileReaderV2 test.

I still don't understand what makes the HBaseClient performs so poorly in
comparison to access directly HDFS. I can understand maybe a factor of 2
(even that it is too much) but a factor of 8 is quite unreasonable.

Any hint?

Jerry



On Sun, Dec 29, 2013 at 9:09 PM, Tom Hood <to...@gmail.com> wrote:

> I'm also new to HBase and am not familiar with HFileReaderV2.  However, in
> your description, you didn't mention anything about clearing the linux OS
> cache between tests.  That might be why you're seeing the big difference if
> you ran the HBaseClient test first, it may have warmed the OS cache and
> then HFileReaderV2 benefited from it.  Just a guess...
>
> -- Tom
>
>
>
> On Mon, Dec 23, 2013 at 12:18 PM, Jerry Lam <ch...@gmail.com> wrote:
>
> > Hello HBase users,
> >
> > I just ran a very simple performance test and would like to see if what I
> > experienced make sense.
> >
> > The experiment is as follows:
> > - I filled a hbase region with 700MB data (each row has roughly 45
> columns
> > and the size is 20KB for the entire row)
> > - I configured the region to hold 4GB (therefore no split occurs)
> > - I ran compactions after the data is loaded and make sure that there is
> > only 1 region in the table under test.
> > - No other table exists in the hbase cluster because this is a DEV
> > environment
> > - I'm using HBase 0.92.1
> >
> > The test is very basic. I use HBaseClient to scan the entire region to
> > retrieve all rows and all columns in the table, just iterating all
> KeyValue
> > pairs until it is done. It took about 1 minute 22 sec to complete. (Note
> > that I disable block cache and uses caching size about 10000).
> >
> > I ran another test using HFileReaderV2 and scan the entire region to
> > retrieve all rows and all columns, just iterating all keyValue pairs
> until
> > it is done. It took 11 sec.
> >
> > The performance difference is dramatic (almost 8 times faster using
> > HFileReaderV2).
> >
> > I want to know why the difference is so big or I didn't configure HBase
> > properly. From this experiment, HDFS can deliver the data efficiently so
> it
> > is not the bottleneck.
> >
> > Any help is appreciated!
> >
> > Jerry
> >
> >
>

Re: Performance between HBaseClient scan and HFileReaderV2

Posted by Jean-Marc Spaggiari <je...@spaggiari.org>.

There is https://issues.apache.org/jira/browse/HBASE-9272 opened for
un-ordered scans. I see some usecases for that when you scan over multiple
regions but just want to get the result as fast as possible...


2014/1/2 Vladimir Rodionov <vr...@carrieriq.com>

> HBase scanner MUST guarantee correct order of KeyValues (coming from
> different HFile's),
> filter condition+ filter condition on included column families and
> qualifiers, time range, max versions and correctly process deleted cells.
> Direct HFileReader does nothing from the above list.
>
> Best regards,
> Vladimir Rodionov
> Principal Platform Engineer
> Carrier IQ, www.carrieriq.com
> e-mail: vrodionov@carrieriq.com
>
> ________________________________________
> From: Jerry Lam [chilinglam@gmail.com]
> Sent: Thursday, January 02, 2014 7:56 AM
> To: user
> Subject: Re: Performance between HBaseClient scan and HFileReaderV2
>
> Hi Tom,
>
> Good point. Note that I also ran the HBaseClient performance test several
> times (as you can see from the chart). The caching should also benefit the
> second time I ran the HBaseClient performance test not just benefitting the
> HFileReaderV2 test.
>
> I still don't understand what makes the HBaseClient performs so poorly in
> comparison to access directly HDFS. I can understand maybe a factor of 2
> (even that it is too much) but a factor of 8 is quite unreasonable.
>
> Any hint?
>
> Jerry
>
>
>
> On Sun, Dec 29, 2013 at 9:09 PM, Tom Hood <to...@gmail.com> wrote:
>
> > I'm also new to HBase and am not familiar with HFileReaderV2.  However,
> in
> > your description, you didn't mention anything about clearing the linux OS
> > cache between tests.  That might be why you're seeing the big difference
> if
> > you ran the HBaseClient test first, it may have warmed the OS cache and
> > then HFileReaderV2 benefited from it.  Just a guess...
> >
> > -- Tom
> >
> >
> >
> > On Mon, Dec 23, 2013 at 12:18 PM, Jerry Lam <ch...@gmail.com>
> wrote:
> >
> > > Hello HBase users,
> > >
> > > I just ran a very simple performance test and would like to see if
> what I
> > > experienced make sense.
> > >
> > > The experiment is as follows:
> > > - I filled a hbase region with 700MB data (each row has roughly 45
> > columns
> > > and the size is 20KB for the entire row)
> > > - I configured the region to hold 4GB (therefore no split occurs)
> > > - I ran compactions after the data is loaded and make sure that there
> is
> > > only 1 region in the table under test.
> > > - No other table exists in the hbase cluster because this is a DEV
> > > environment
> > > - I'm using HBase 0.92.1
> > >
> > > The test is very basic. I use HBaseClient to scan the entire region to
> > > retrieve all rows and all columns in the table, just iterating all
> > KeyValue
> > > pairs until it is done. It took about 1 minute 22 sec to complete.
> (Note
> > > that I disable block cache and uses caching size about 10000).
> > >
> > > I ran another test using HFileReaderV2 and scan the entire region to
> > > retrieve all rows and all columns, just iterating all keyValue pairs
> > until
> > > it is done. It took 11 sec.
> > >
> > > The performance difference is dramatic (almost 8 times faster using
> > > HFileReaderV2).
> > >
> > > I want to know why the difference is so big or I didn't configure HBase
> > > properly. From this experiment, HDFS can deliver the data efficiently
> so
> > it
> > > is not the bottleneck.
> > >
> > > Any help is appreciated!
> > >
> > > Jerry
> > >
> > >
> >
>
> Confidentiality Notice:  The information contained in this message,
> including any attachments hereto, may be confidential and is intended to be
> read only by the individual or entity to whom this message is addressed. If
> the reader of this message is not the intended recipient or an agent or
> designee of the intended recipient, please note that any review, use,
> disclosure or distribution of this message or its attachments, in any form,
> is strictly prohibited.  If you have received this message in error, please
> immediately notify the sender and/or Notifications@carrieriq.com and
> delete or destroy any copy of this message and its attachments.
>

Re: Performance between HBaseClient scan and HFileReaderV2

Posted by Jerry Lam <ch...@gmail.com>.

Hello Lars,

Yes, I used setCaching for getting more KeyValues in each RPC call. Also
yes, when I used HFileReaderV2 I still reading from HDFS. Short circuiting
is enabled but I don't know how to ensure it has been used (Is there log
that can tell me if it has been used?).

I did made sure the HBaseClient runs on the same regionserver that holds
the data.

I just tried asynchbase (as I'm running out of ideas, I started to try
everything), it takes 60 seconds to scan through the data (20 seconds less
than using HBaseClient).

Best Regards,

Jerry

On Thu, Jan 2, 2014 at 4:44 PM, lars hofhansl <la...@apache.org> wrote:

> From the below I gather you set scanner caching (Scan.setCaching(...))?
> When you use HFileReaderV2, you're still reading from HDFS, right? Are you
> using short circuit reading (avoiding network IO)?
>
> In the HBaseClient client you pipe all the data through the network again.
> Is the HBaseClient located on a different machine?
>
> I would use a profiler (just use jVisualVM, which ships with the JDK and
> use the "sampling" profiler) to see where the time is spent.
>
> Lastly, to echo what other folks have said, 0.92 is pretty old at this
> point and I personally added a lot of performance improvements to HBase
> during the 0.94 timeframe and other's have as well.
> If you could test the same with 0.94, I'd be very interested in the
> numbers.
>
> -- Lars
>
>
>
> ________________________________
>  From: Jerry Lam <ch...@gmail.com>
> To: user <us...@hbase.apache.org>
> Sent: Thursday, January 2, 2014 1:32 PM
> Subject: Re: Performance between HBaseClient scan and HFileReaderV2
>
>
> Hello Vladimir,
>
> In my use case, I guarantee that a major compaction is executed before any
> scan happens because the system we build is a read only system. There will
> have no deleted cells. Additionally, I only need to read from a single
> column family and therefore I don't need to access multiple HFiles.
>
> Filter conditions are nice to have because if I can read HFile 8x faster
> than using HBaseClient, I can do the filter on the client side and still
> perform faster than using HBaseClient.
>
> Thank you for your input!
>
> Jerry
>
>
>
>
> On Thu, Jan 2, 2014 at 1:30 PM, Vladimir Rodionov
> <vr...@carrieriq.com>wrote:
>
> > HBase scanner MUST guarantee correct order of KeyValues (coming from
> > different HFile's),
> > filter condition+ filter condition on included column families and
> > qualifiers, time range, max versions and correctly process deleted cells.
> > Direct HFileReader does nothing from the above list.
> >
> > Best regards,
> > Vladimir Rodionov
> > Principal Platform Engineer
> > Carrier IQ, www.carrieriq.com
> > e-mail: vrodionov@carrieriq.com
> >
> > ________________________________________
> > From: Jerry Lam [chilinglam@gmail.com]
> > Sent: Thursday, January 02, 2014 7:56 AM
> > To: user
> > Subject: Re: Performance between HBaseClient scan and HFileReaderV2
> >
> > Hi Tom,
> >
> > Good point. Note that I also ran the HBaseClient performance test several
> > times (as you can see from the chart). The caching should also benefit
> the
> > second time I ran the HBaseClient performance test not just benefitting
> the
> > HFileReaderV2 test.
> >
> > I still don't understand what makes the HBaseClient performs so poorly in
> > comparison to access directly HDFS. I can understand maybe a factor of 2
> > (even that it is too much) but a factor of 8 is quite unreasonable.
> >
> > Any hint?
> >
> > Jerry
> >
> >
> >
> > On Sun, Dec 29, 2013 at 9:09 PM, Tom Hood <to...@gmail.com> wrote:
> >
> > > I'm also new to HBase and am not familiar with HFileReaderV2.  However,
> > in
> > > your description, you didn't mention anything about clearing the linux
> OS
> > > cache between tests.  That might be why you're seeing the big
> difference
> > if
> > > you ran the HBaseClient test first, it may have warmed the OS cache and
> > > then HFileReaderV2 benefited from it.  Just a guess...
> > >
> > > -- Tom
> > >
> > >
> > >
> > > On Mon, Dec 23, 2013 at 12:18 PM, Jerry Lam <ch...@gmail.com>
> > wrote:
> > >
> > > > Hello HBase users,
> > > >
> > > > I just ran a very simple performance test and would like to see if
> > what I
> > > > experienced make sense.
> > > >
> > > > The experiment is as follows:
> > > > - I filled a hbase region with 700MB data (each row has roughly 45
> > > columns
> > > > and the size is 20KB for the entire row)
> > > > - I configured the region to hold 4GB (therefore no split occurs)
> > > > - I ran compactions after the data is loaded and make sure that there
> > is
> > > > only 1 region in the table under test.
> > > > - No other table exists in the hbase cluster because this is a DEV
> > > > environment
> > > > - I'm using HBase 0.92.1
> > > >
> > > > The test is very basic. I use HBaseClient to scan the entire region
> to
> > > > retrieve all rows and all columns in the table, just iterating all
> > > KeyValue
> > > > pairs until it is done. It took about 1 minute 22 sec to complete.
> > (Note
> > > > that I disable block cache and uses caching size about 10000).
> > > >
> > > > I ran another test using HFileReaderV2 and scan the entire region to
> > > > retrieve all rows and all columns, just iterating all keyValue pairs
> > > until
> > > > it is done. It took 11 sec.
> > > >
> > > > The performance difference is dramatic (almost 8 times faster using
> > > > HFileReaderV2).
> > > >
> > > > I want to know why the difference is so big or I didn't configure
> HBase
> > > > properly. From this experiment, HDFS can deliver the data efficiently
> > so
> > > it
> > > > is not the bottleneck.
> > > >
> > > > Any help is appreciated!
> > > >
> > > > Jerry
> > > >
> > > >
> > >
> >
> > Confidentiality Notice:  The information contained in this message,
> > including any attachments hereto, may be confidential and is intended to
> be
> > read only by the individual or entity to whom this message is addressed.
> If
> > the reader of this message is not the intended recipient or an agent or
> > designee of the intended recipient, please note that any review, use,
> > disclosure or distribution of this message or its attachments, in any
> form,
> > is strictly prohibited.  If you have received this message in error,
> please
> > immediately notify the sender and/or Notifications@carrieriq.com and
> > delete or destroy any copy of this message and its attachments.
> >
>

Re: Performance between HBaseClient scan and HFileReaderV2

Posted by lars hofhansl <la...@apache.org>.

>From the below I gather you set scanner caching (Scan.setCaching(...))?
When you use HFileReaderV2, you're still reading from HDFS, right? Are you using short circuit reading (avoiding network IO)?

In the HBaseClient client you pipe all the data through the network again.
Is the HBaseClient located on a different machine?

I would use a profiler (just use jVisualVM, which ships with the JDK and use the "sampling" profiler) to see where the time is spent.

Lastly, to echo what other folks have said, 0.92 is pretty old at this point and I personally added a lot of performance improvements to HBase during the 0.94 timeframe and other's have as well.
If you could test the same with 0.94, I'd be very interested in the numbers.

-- Lars



________________________________
 From: Jerry Lam <ch...@gmail.com>
To: user <us...@hbase.apache.org> 
Sent: Thursday, January 2, 2014 1:32 PM
Subject: Re: Performance between HBaseClient scan and HFileReaderV2
 

Hello Vladimir,

In my use case, I guarantee that a major compaction is executed before any
scan happens because the system we build is a read only system. There will
have no deleted cells. Additionally, I only need to read from a single
column family and therefore I don't need to access multiple HFiles.

Filter conditions are nice to have because if I can read HFile 8x faster
than using HBaseClient, I can do the filter on the client side and still
perform faster than using HBaseClient.

Thank you for your input!

Jerry




On Thu, Jan 2, 2014 at 1:30 PM, Vladimir Rodionov
<vr...@carrieriq.com>wrote:

> HBase scanner MUST guarantee correct order of KeyValues (coming from
> different HFile's),
> filter condition+ filter condition on included column families and
> qualifiers, time range, max versions and correctly process deleted cells.
> Direct HFileReader does nothing from the above list.
>
> Best regards,
> Vladimir Rodionov
> Principal Platform Engineer
> Carrier IQ, www.carrieriq.com
> e-mail: vrodionov@carrieriq.com
>
> ________________________________________
> From: Jerry Lam [chilinglam@gmail.com]
> Sent: Thursday, January 02, 2014 7:56 AM
> To: user
> Subject: Re: Performance between HBaseClient scan and HFileReaderV2
>
> Hi Tom,
>
> Good point. Note that I also ran the HBaseClient performance test several
> times (as you can see from the chart). The caching should also benefit the
> second time I ran the HBaseClient performance test not just benefitting the
> HFileReaderV2 test.
>
> I still don't understand what makes the HBaseClient performs so poorly in
> comparison to access directly HDFS. I can understand maybe a factor of 2
> (even that it is too much) but a factor of 8 is quite unreasonable.
>
> Any hint?
>
> Jerry
>
>
>
> On Sun, Dec 29, 2013 at 9:09 PM, Tom Hood <to...@gmail.com> wrote:
>
> > I'm also new to HBase and am not familiar with HFileReaderV2.  However,
> in
> > your description, you didn't mention anything about clearing the linux OS
> > cache between tests.  That might be why you're seeing the big difference
> if
> > you ran the HBaseClient test first, it may have warmed the OS cache and
> > then HFileReaderV2 benefited from it.  Just a guess...
> >
> > -- Tom
> >
> >
> >
> > On Mon, Dec 23, 2013 at 12:18 PM, Jerry Lam <ch...@gmail.com>
> wrote:
> >
> > > Hello HBase users,
> > >
> > > I just ran a very simple performance test and would like to see if
> what I
> > > experienced make sense.
> > >
> > > The experiment is as follows:
> > > - I filled a hbase region with 700MB data (each row has roughly 45
> > columns
> > > and the size is 20KB for the entire row)
> > > - I configured the region to hold 4GB (therefore no split occurs)
> > > - I ran compactions after the data is loaded and make sure that there
> is
> > > only 1 region in the table under test.
> > > - No other table exists in the hbase cluster because this is a DEV
> > > environment
> > > - I'm using HBase 0.92.1
> > >
> > > The test is very basic. I use HBaseClient to scan the entire region to
> > > retrieve all rows and all columns in the table, just iterating all
> > KeyValue
> > > pairs until it is done. It took about 1 minute 22 sec to complete.
> (Note
> > > that I disable block cache and uses caching size about 10000).
> > >
> > > I ran another test using HFileReaderV2 and scan the entire region to
> > > retrieve all rows and all columns, just iterating all keyValue pairs
> > until
> > > it is done. It took 11 sec.
> > >
> > > The performance difference is dramatic (almost 8 times faster using
> > > HFileReaderV2).
> > >
> > > I want to know why the difference is so big or I didn't configure HBase
> > > properly. From this experiment, HDFS can deliver the data efficiently
> so
> > it
> > > is not the bottleneck.
> > >
> > > Any help is appreciated!
> > >
> > > Jerry
> > >
> > >
> >
>
> Confidentiality Notice:  The information contained in this message,
> including any attachments hereto, may be confidential and is intended to be
> read only by the individual or entity to whom this message is addressed. If
> the reader of this message is not the intended recipient or an agent or
> designee of the intended recipient, please note that any review, use,
> disclosure or distribution of this message or its attachments, in any form,
> is strictly prohibited.  If you have received this message in error, please
> immediately notify the sender and/or Notifications@carrieriq.com and
> delete or destroy any copy of this message and its attachments.
>

Re: Performance between HBaseClient scan and HFileReaderV2

Posted by Ted Yu <yu...@gmail.com>.

Jerry:
HBase snapshot is not available in 0.92.x
So you cannot use HBASE-10076 in 0.92

FYI


On Thu, Jan 2, 2014 at 3:31 PM, Jerry Lam <ch...@gmail.com> wrote:

> Hello Sergey and Enis,
>
> Thank you for the pointer! HBASE-8691 will definitely help. HBASE-10076
> (Very interesting/exciting feature by the way!) is what I need. How can I
> port it to 0.92.x if it is at all possible?
>
> I understand that my test is not realistic however since I have only 1
> region with 1 HFile (this is by design), so there should not have any
> "merge" sorted read going on.
>
> One thing I'm not sure is that since I use snappy compression, does the
> value of the KeyValue is decompress at the region server? If yes, I think
> it is quite inefficient because the decompression can be done at the client
> side. Saving bandwidth saves a lot of time for the type of workload I'm
> working on.
>
> Best Regards,
>
> Jerry
>
>
>
> On Thu, Jan 2, 2014 at 5:02 PM, Enis Söztutar <en...@apache.org> wrote:
>
> > Nice test!
> >
> > There is a couple of things here:
> >
> >  (1) HFileReader reads only one file, versus, an HRegion reads multiple
> > files (into the KeyValueHeap) to do a merge scan. So, although there is
> > only one file, there is some overehead of doing a merge sort'ed read from
> > multiple files in the region. For a more realistic test, you can try to
> do
> > the reads using HRegion directly (instead of HFileReader). The overhead
> is
> > not that much though in my tests.
> >  (2) For scanning with client API, the results have to be serialized and
> > deserialized and send over the network (or loopback for local). This is
> > another overhead that is not there in HfileReader.
> >  (3) HBase scanner RPC implementation is NOT streaming. The RPC works
> like
> > fetching batch size (10000) records, and cannot fully saturate the disk
> and
> > network pipeline.
> >
> > In my tests for "MapReduce over snapshot files (HBASE-8369)", I have
> > measured 5x difference, because of layers (2) and (3). Please see my
> slides
> > at http://www.slideshare.net/enissoz/mapreduce-over-snapshots
> >
> > I think we can do a much better job at (3), see HBASE-8691. However,
> there
> > will always be "some" overhead, although it should not be 5-8x.
> >
> > As suggested above, in the meantime, you can take a look at the patch for
> > HBASE-8369, and https://issues.apache.org/jira/browse/HBASE-10076 to see
> > whether it suits your use case.
> >
> > Enis
> >
> >
> > On Thu, Jan 2, 2014 at 1:43 PM, Sergey Shelukhin <sergey@hortonworks.com
> > >wrote:
> >
> > > Er, using MR over snapshots, which reads files directly...
> > > https://issues.apache.org/jira/browse/HBASE-8369
> > > However, it was only committed to 98.
> > > There was interest in 94 port (HBASE-10076), but it never happened...
> > >
> > >
> > > On Thu, Jan 2, 2014 at 1:42 PM, Sergey Shelukhin <
> sergey@hortonworks.com
> > > >wrote:
> > >
> > > > You might be interested in using
> > > > https://issues.apache.org/jira/browse/HBASE-8369
> > > > However, it was only committed to 98.
> > > > There was interest in 94 port (HBASE-10076), but it never happened...
> > > >
> > > >
> > > > On Thu, Jan 2, 2014 at 1:32 PM, Jerry Lam <ch...@gmail.com>
> > wrote:
> > > >
> > > >> Hello Vladimir,
> > > >>
> > > >> In my use case, I guarantee that a major compaction is executed
> before
> > > any
> > > >> scan happens because the system we build is a read only system.
> There
> > > will
> > > >> have no deleted cells. Additionally, I only need to read from a
> single
> > > >> column family and therefore I don't need to access multiple HFiles.
> > > >>
> > > >> Filter conditions are nice to have because if I can read HFile 8x
> > faster
> > > >> than using HBaseClient, I can do the filter on the client side and
> > still
> > > >> perform faster than using HBaseClient.
> > > >>
> > > >> Thank you for your input!
> > > >>
> > > >> Jerry
> > > >>
> > > >>
> > > >>
> > > >> On Thu, Jan 2, 2014 at 1:30 PM, Vladimir Rodionov
> > > >> <vr...@carrieriq.com>wrote:
> > > >>
> > > >> > HBase scanner MUST guarantee correct order of KeyValues (coming
> from
> > > >> > different HFile's),
> > > >> > filter condition+ filter condition on included column families and
> > > >> > qualifiers, time range, max versions and correctly process deleted
> > > >> cells.
> > > >> > Direct HFileReader does nothing from the above list.
> > > >> >
> > > >> > Best regards,
> > > >> > Vladimir Rodionov
> > > >> > Principal Platform Engineer
> > > >> > Carrier IQ, www.carrieriq.com
> > > >> > e-mail: vrodionov@carrieriq.com
> > > >> >
> > > >> > ________________________________________
> > > >> > From: Jerry Lam [chilinglam@gmail.com]
> > > >> > Sent: Thursday, January 02, 2014 7:56 AM
> > > >> > To: user
> > > >> > Subject: Re: Performance between HBaseClient scan and
> HFileReaderV2
> > > >> >
> > > >> > Hi Tom,
> > > >> >
> > > >> > Good point. Note that I also ran the HBaseClient performance test
> > > >> several
> > > >> > times (as you can see from the chart). The caching should also
> > benefit
> > > >> the
> > > >> > second time I ran the HBaseClient performance test not just
> > > benefitting
> > > >> the
> > > >> > HFileReaderV2 test.
> > > >> >
> > > >> > I still don't understand what makes the HBaseClient performs so
> > poorly
> > > >> in
> > > >> > comparison to access directly HDFS. I can understand maybe a
> factor
> > > of 2
> > > >> > (even that it is too much) but a factor of 8 is quite
> unreasonable.
> > > >> >
> > > >> > Any hint?
> > > >> >
> > > >> > Jerry
> > > >> >
> > > >> >
> > > >> >
> > > >> > On Sun, Dec 29, 2013 at 9:09 PM, Tom Hood <to...@gmail.com>
> > > wrote:
> > > >> >
> > > >> > > I'm also new to HBase and am not familiar with HFileReaderV2.
> > > >>  However,
> > > >> > in
> > > >> > > your description, you didn't mention anything about clearing the
> > > >> linux OS
> > > >> > > cache between tests.  That might be why you're seeing the big
> > > >> difference
> > > >> > if
> > > >> > > you ran the HBaseClient test first, it may have warmed the OS
> > cache
> > > >> and
> > > >> > > then HFileReaderV2 benefited from it.  Just a guess...
> > > >> > >
> > > >> > > -- Tom
> > > >> > >
> > > >> > >
> > > >> > >
> > > >> > > On Mon, Dec 23, 2013 at 12:18 PM, Jerry Lam <
> chilinglam@gmail.com
> > >
> > > >> > wrote:
> > > >> > >
> > > >> > > > Hello HBase users,
> > > >> > > >
> > > >> > > > I just ran a very simple performance test and would like to
> see
> > if
> > > >> > what I
> > > >> > > > experienced make sense.
> > > >> > > >
> > > >> > > > The experiment is as follows:
> > > >> > > > - I filled a hbase region with 700MB data (each row has
> roughly
> > 45
> > > >> > > columns
> > > >> > > > and the size is 20KB for the entire row)
> > > >> > > > - I configured the region to hold 4GB (therefore no split
> > occurs)
> > > >> > > > - I ran compactions after the data is loaded and make sure
> that
> > > >> there
> > > >> > is
> > > >> > > > only 1 region in the table under test.
> > > >> > > > - No other table exists in the hbase cluster because this is a
> > DEV
> > > >> > > > environment
> > > >> > > > - I'm using HBase 0.92.1
> > > >> > > >
> > > >> > > > The test is very basic. I use HBaseClient to scan the entire
> > > region
> > > >> to
> > > >> > > > retrieve all rows and all columns in the table, just iterating
> > all
> > > >> > > KeyValue
> > > >> > > > pairs until it is done. It took about 1 minute 22 sec to
> > complete.
> > > >> > (Note
> > > >> > > > that I disable block cache and uses caching size about 10000).
> > > >> > > >
> > > >> > > > I ran another test using HFileReaderV2 and scan the entire
> > region
> > > to
> > > >> > > > retrieve all rows and all columns, just iterating all keyValue
> > > pairs
> > > >> > > until
> > > >> > > > it is done. It took 11 sec.
> > > >> > > >
> > > >> > > > The performance difference is dramatic (almost 8 times faster
> > > using
> > > >> > > > HFileReaderV2).
> > > >> > > >
> > > >> > > > I want to know why the difference is so big or I didn't
> > configure
> > > >> HBase
> > > >> > > > properly. From this experiment, HDFS can deliver the data
> > > >> efficiently
> > > >> > so
> > > >> > > it
> > > >> > > > is not the bottleneck.
> > > >> > > >
> > > >> > > > Any help is appreciated!
> > > >> > > >
> > > >> > > > Jerry
> > > >> > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >> > Confidentiality Notice:  The information contained in this
> message,
> > > >> > including any attachments hereto, may be confidential and is
> > intended
> > > >> to be
> > > >> > read only by the individual or entity to whom this message is
> > > >> addressed. If
> > > >> > the reader of this message is not the intended recipient or an
> agent
> > > or
> > > >> > designee of the intended recipient, please note that any review,
> > use,
> > > >> > disclosure or distribution of this message or its attachments, in
> > any
> > > >> form,
> > > >> > is strictly prohibited.  If you have received this message in
> error,
> > > >> please
> > > >> > immediately notify the sender and/or
> Notifications@carrieriq.comand
> > > >> > delete or destroy any copy of this message and its attachments.
> > > >> >
> > > >>
> > > >
> > > >
> > >
> > > --
> > > CONFIDENTIALITY NOTICE
> > > NOTICE: This message is intended for the use of the individual or
> entity
> > to
> > > which it is addressed and may contain information that is confidential,
> > > privileged and exempt from disclosure under applicable law. If the
> reader
> > > of this message is not the intended recipient, you are hereby notified
> > that
> > > any printing, copying, dissemination, distribution, disclosure or
> > > forwarding of this communication is strictly prohibited. If you have
> > > received this communication in error, please contact the sender
> > immediately
> > > and delete it from your system. Thank You.
> > >
> >
>

Re: Performance between HBaseClient scan and HFileReaderV2

Posted by Jerry Lam <ch...@gmail.com>.

Hello Sergey and Enis,

Thank you for the pointer! HBASE-8691 will definitely help. HBASE-10076
(Very interesting/exciting feature by the way!) is what I need. How can I
port it to 0.92.x if it is at all possible?

I understand that my test is not realistic however since I have only 1
region with 1 HFile (this is by design), so there should not have any
"merge" sorted read going on.

One thing I'm not sure is that since I use snappy compression, does the
value of the KeyValue is decompress at the region server? If yes, I think
it is quite inefficient because the decompression can be done at the client
side. Saving bandwidth saves a lot of time for the type of workload I'm
working on.

Best Regards,

Jerry



On Thu, Jan 2, 2014 at 5:02 PM, Enis Söztutar <en...@apache.org> wrote:

> Nice test!
>
> There is a couple of things here:
>
>  (1) HFileReader reads only one file, versus, an HRegion reads multiple
> files (into the KeyValueHeap) to do a merge scan. So, although there is
> only one file, there is some overehead of doing a merge sort'ed read from
> multiple files in the region. For a more realistic test, you can try to do
> the reads using HRegion directly (instead of HFileReader). The overhead is
> not that much though in my tests.
>  (2) For scanning with client API, the results have to be serialized and
> deserialized and send over the network (or loopback for local). This is
> another overhead that is not there in HfileReader.
>  (3) HBase scanner RPC implementation is NOT streaming. The RPC works like
> fetching batch size (10000) records, and cannot fully saturate the disk and
> network pipeline.
>
> In my tests for "MapReduce over snapshot files (HBASE-8369)", I have
> measured 5x difference, because of layers (2) and (3). Please see my slides
> at http://www.slideshare.net/enissoz/mapreduce-over-snapshots
>
> I think we can do a much better job at (3), see HBASE-8691. However, there
> will always be "some" overhead, although it should not be 5-8x.
>
> As suggested above, in the meantime, you can take a look at the patch for
> HBASE-8369, and https://issues.apache.org/jira/browse/HBASE-10076 to see
> whether it suits your use case.
>
> Enis
>
>
> On Thu, Jan 2, 2014 at 1:43 PM, Sergey Shelukhin <sergey@hortonworks.com
> >wrote:
>
> > Er, using MR over snapshots, which reads files directly...
> > https://issues.apache.org/jira/browse/HBASE-8369
> > However, it was only committed to 98.
> > There was interest in 94 port (HBASE-10076), but it never happened...
> >
> >
> > On Thu, Jan 2, 2014 at 1:42 PM, Sergey Shelukhin <sergey@hortonworks.com
> > >wrote:
> >
> > > You might be interested in using
> > > https://issues.apache.org/jira/browse/HBASE-8369
> > > However, it was only committed to 98.
> > > There was interest in 94 port (HBASE-10076), but it never happened...
> > >
> > >
> > > On Thu, Jan 2, 2014 at 1:32 PM, Jerry Lam <ch...@gmail.com>
> wrote:
> > >
> > >> Hello Vladimir,
> > >>
> > >> In my use case, I guarantee that a major compaction is executed before
> > any
> > >> scan happens because the system we build is a read only system. There
> > will
> > >> have no deleted cells. Additionally, I only need to read from a single
> > >> column family and therefore I don't need to access multiple HFiles.
> > >>
> > >> Filter conditions are nice to have because if I can read HFile 8x
> faster
> > >> than using HBaseClient, I can do the filter on the client side and
> still
> > >> perform faster than using HBaseClient.
> > >>
> > >> Thank you for your input!
> > >>
> > >> Jerry
> > >>
> > >>
> > >>
> > >> On Thu, Jan 2, 2014 at 1:30 PM, Vladimir Rodionov
> > >> <vr...@carrieriq.com>wrote:
> > >>
> > >> > HBase scanner MUST guarantee correct order of KeyValues (coming from
> > >> > different HFile's),
> > >> > filter condition+ filter condition on included column families and
> > >> > qualifiers, time range, max versions and correctly process deleted
> > >> cells.
> > >> > Direct HFileReader does nothing from the above list.
> > >> >
> > >> > Best regards,
> > >> > Vladimir Rodionov
> > >> > Principal Platform Engineer
> > >> > Carrier IQ, www.carrieriq.com
> > >> > e-mail: vrodionov@carrieriq.com
> > >> >
> > >> > ________________________________________
> > >> > From: Jerry Lam [chilinglam@gmail.com]
> > >> > Sent: Thursday, January 02, 2014 7:56 AM
> > >> > To: user
> > >> > Subject: Re: Performance between HBaseClient scan and HFileReaderV2
> > >> >
> > >> > Hi Tom,
> > >> >
> > >> > Good point. Note that I also ran the HBaseClient performance test
> > >> several
> > >> > times (as you can see from the chart). The caching should also
> benefit
> > >> the
> > >> > second time I ran the HBaseClient performance test not just
> > benefitting
> > >> the
> > >> > HFileReaderV2 test.
> > >> >
> > >> > I still don't understand what makes the HBaseClient performs so
> poorly
> > >> in
> > >> > comparison to access directly HDFS. I can understand maybe a factor
> > of 2
> > >> > (even that it is too much) but a factor of 8 is quite unreasonable.
> > >> >
> > >> > Any hint?
> > >> >
> > >> > Jerry
> > >> >
> > >> >
> > >> >
> > >> > On Sun, Dec 29, 2013 at 9:09 PM, Tom Hood <to...@gmail.com>
> > wrote:
> > >> >
> > >> > > I'm also new to HBase and am not familiar with HFileReaderV2.
> > >>  However,
> > >> > in
> > >> > > your description, you didn't mention anything about clearing the
> > >> linux OS
> > >> > > cache between tests.  That might be why you're seeing the big
> > >> difference
> > >> > if
> > >> > > you ran the HBaseClient test first, it may have warmed the OS
> cache
> > >> and
> > >> > > then HFileReaderV2 benefited from it.  Just a guess...
> > >> > >
> > >> > > -- Tom
> > >> > >
> > >> > >
> > >> > >
> > >> > > On Mon, Dec 23, 2013 at 12:18 PM, Jerry Lam <chilinglam@gmail.com
> >
> > >> > wrote:
> > >> > >
> > >> > > > Hello HBase users,
> > >> > > >
> > >> > > > I just ran a very simple performance test and would like to see
> if
> > >> > what I
> > >> > > > experienced make sense.
> > >> > > >
> > >> > > > The experiment is as follows:
> > >> > > > - I filled a hbase region with 700MB data (each row has roughly
> 45
> > >> > > columns
> > >> > > > and the size is 20KB for the entire row)
> > >> > > > - I configured the region to hold 4GB (therefore no split
> occurs)
> > >> > > > - I ran compactions after the data is loaded and make sure that
> > >> there
> > >> > is
> > >> > > > only 1 region in the table under test.
> > >> > > > - No other table exists in the hbase cluster because this is a
> DEV
> > >> > > > environment
> > >> > > > - I'm using HBase 0.92.1
> > >> > > >
> > >> > > > The test is very basic. I use HBaseClient to scan the entire
> > region
> > >> to
> > >> > > > retrieve all rows and all columns in the table, just iterating
> all
> > >> > > KeyValue
> > >> > > > pairs until it is done. It took about 1 minute 22 sec to
> complete.
> > >> > (Note
> > >> > > > that I disable block cache and uses caching size about 10000).
> > >> > > >
> > >> > > > I ran another test using HFileReaderV2 and scan the entire
> region
> > to
> > >> > > > retrieve all rows and all columns, just iterating all keyValue
> > pairs
> > >> > > until
> > >> > > > it is done. It took 11 sec.
> > >> > > >
> > >> > > > The performance difference is dramatic (almost 8 times faster
> > using
> > >> > > > HFileReaderV2).
> > >> > > >
> > >> > > > I want to know why the difference is so big or I didn't
> configure
> > >> HBase
> > >> > > > properly. From this experiment, HDFS can deliver the data
> > >> efficiently
> > >> > so
> > >> > > it
> > >> > > > is not the bottleneck.
> > >> > > >
> > >> > > > Any help is appreciated!
> > >> > > >
> > >> > > > Jerry
> > >> > > >
> > >> > > >
> > >> > >
> > >> >
> > >> > Confidentiality Notice:  The information contained in this message,
> > >> > including any attachments hereto, may be confidential and is
> intended
> > >> to be
> > >> > read only by the individual or entity to whom this message is
> > >> addressed. If
> > >> > the reader of this message is not the intended recipient or an agent
> > or
> > >> > designee of the intended recipient, please note that any review,
> use,
> > >> > disclosure or distribution of this message or its attachments, in
> any
> > >> form,
> > >> > is strictly prohibited.  If you have received this message in error,
> > >> please
> > >> > immediately notify the sender and/or Notifications@carrieriq.comand
> > >> > delete or destroy any copy of this message and its attachments.
> > >> >
> > >>
> > >
> > >
> >
> > --
> > CONFIDENTIALITY NOTICE
> > NOTICE: This message is intended for the use of the individual or entity
> to
> > which it is addressed and may contain information that is confidential,
> > privileged and exempt from disclosure under applicable law. If the reader
> > of this message is not the intended recipient, you are hereby notified
> that
> > any printing, copying, dissemination, distribution, disclosure or
> > forwarding of this communication is strictly prohibited. If you have
> > received this communication in error, please contact the sender
> immediately
> > and delete it from your system. Thank You.
> >
>

Re: Performance between HBaseClient scan and HFileReaderV2

Posted by Enis Söztutar <en...@apache.org>.

Nice test!

There is a couple of things here:

 (1) HFileReader reads only one file, versus, an HRegion reads multiple
files (into the KeyValueHeap) to do a merge scan. So, although there is
only one file, there is some overehead of doing a merge sort'ed read from
multiple files in the region. For a more realistic test, you can try to do
the reads using HRegion directly (instead of HFileReader). The overhead is
not that much though in my tests.
 (2) For scanning with client API, the results have to be serialized and
deserialized and send over the network (or loopback for local). This is
another overhead that is not there in HfileReader.
 (3) HBase scanner RPC implementation is NOT streaming. The RPC works like
fetching batch size (10000) records, and cannot fully saturate the disk and
network pipeline.

In my tests for "MapReduce over snapshot files (HBASE-8369)", I have
measured 5x difference, because of layers (2) and (3). Please see my slides
at http://www.slideshare.net/enissoz/mapreduce-over-snapshots

I think we can do a much better job at (3), see HBASE-8691. However, there
will always be "some" overhead, although it should not be 5-8x.

As suggested above, in the meantime, you can take a look at the patch for
HBASE-8369, and https://issues.apache.org/jira/browse/HBASE-10076 to see
whether it suits your use case.

Enis


On Thu, Jan 2, 2014 at 1:43 PM, Sergey Shelukhin <se...@hortonworks.com>wrote:

> Er, using MR over snapshots, which reads files directly...
> https://issues.apache.org/jira/browse/HBASE-8369
> However, it was only committed to 98.
> There was interest in 94 port (HBASE-10076), but it never happened...
>
>
> On Thu, Jan 2, 2014 at 1:42 PM, Sergey Shelukhin <sergey@hortonworks.com
> >wrote:
>
> > You might be interested in using
> > https://issues.apache.org/jira/browse/HBASE-8369
> > However, it was only committed to 98.
> > There was interest in 94 port (HBASE-10076), but it never happened...
> >
> >
> > On Thu, Jan 2, 2014 at 1:32 PM, Jerry Lam <ch...@gmail.com> wrote:
> >
> >> Hello Vladimir,
> >>
> >> In my use case, I guarantee that a major compaction is executed before
> any
> >> scan happens because the system we build is a read only system. There
> will
> >> have no deleted cells. Additionally, I only need to read from a single
> >> column family and therefore I don't need to access multiple HFiles.
> >>
> >> Filter conditions are nice to have because if I can read HFile 8x faster
> >> than using HBaseClient, I can do the filter on the client side and still
> >> perform faster than using HBaseClient.
> >>
> >> Thank you for your input!
> >>
> >> Jerry
> >>
> >>
> >>
> >> On Thu, Jan 2, 2014 at 1:30 PM, Vladimir Rodionov
> >> <vr...@carrieriq.com>wrote:
> >>
> >> > HBase scanner MUST guarantee correct order of KeyValues (coming from
> >> > different HFile's),
> >> > filter condition+ filter condition on included column families and
> >> > qualifiers, time range, max versions and correctly process deleted
> >> cells.
> >> > Direct HFileReader does nothing from the above list.
> >> >
> >> > Best regards,
> >> > Vladimir Rodionov
> >> > Principal Platform Engineer
> >> > Carrier IQ, www.carrieriq.com
> >> > e-mail: vrodionov@carrieriq.com
> >> >
> >> > ________________________________________
> >> > From: Jerry Lam [chilinglam@gmail.com]
> >> > Sent: Thursday, January 02, 2014 7:56 AM
> >> > To: user
> >> > Subject: Re: Performance between HBaseClient scan and HFileReaderV2
> >> >
> >> > Hi Tom,
> >> >
> >> > Good point. Note that I also ran the HBaseClient performance test
> >> several
> >> > times (as you can see from the chart). The caching should also benefit
> >> the
> >> > second time I ran the HBaseClient performance test not just
> benefitting
> >> the
> >> > HFileReaderV2 test.
> >> >
> >> > I still don't understand what makes the HBaseClient performs so poorly
> >> in
> >> > comparison to access directly HDFS. I can understand maybe a factor
> of 2
> >> > (even that it is too much) but a factor of 8 is quite unreasonable.
> >> >
> >> > Any hint?
> >> >
> >> > Jerry
> >> >
> >> >
> >> >
> >> > On Sun, Dec 29, 2013 at 9:09 PM, Tom Hood <to...@gmail.com>
> wrote:
> >> >
> >> > > I'm also new to HBase and am not familiar with HFileReaderV2.
> >>  However,
> >> > in
> >> > > your description, you didn't mention anything about clearing the
> >> linux OS
> >> > > cache between tests.  That might be why you're seeing the big
> >> difference
> >> > if
> >> > > you ran the HBaseClient test first, it may have warmed the OS cache
> >> and
> >> > > then HFileReaderV2 benefited from it.  Just a guess...
> >> > >
> >> > > -- Tom
> >> > >
> >> > >
> >> > >
> >> > > On Mon, Dec 23, 2013 at 12:18 PM, Jerry Lam <ch...@gmail.com>
> >> > wrote:
> >> > >
> >> > > > Hello HBase users,
> >> > > >
> >> > > > I just ran a very simple performance test and would like to see if
> >> > what I
> >> > > > experienced make sense.
> >> > > >
> >> > > > The experiment is as follows:
> >> > > > - I filled a hbase region with 700MB data (each row has roughly 45
> >> > > columns
> >> > > > and the size is 20KB for the entire row)
> >> > > > - I configured the region to hold 4GB (therefore no split occurs)
> >> > > > - I ran compactions after the data is loaded and make sure that
> >> there
> >> > is
> >> > > > only 1 region in the table under test.
> >> > > > - No other table exists in the hbase cluster because this is a DEV
> >> > > > environment
> >> > > > - I'm using HBase 0.92.1
> >> > > >
> >> > > > The test is very basic. I use HBaseClient to scan the entire
> region
> >> to
> >> > > > retrieve all rows and all columns in the table, just iterating all
> >> > > KeyValue
> >> > > > pairs until it is done. It took about 1 minute 22 sec to complete.
> >> > (Note
> >> > > > that I disable block cache and uses caching size about 10000).
> >> > > >
> >> > > > I ran another test using HFileReaderV2 and scan the entire region
> to
> >> > > > retrieve all rows and all columns, just iterating all keyValue
> pairs
> >> > > until
> >> > > > it is done. It took 11 sec.
> >> > > >
> >> > > > The performance difference is dramatic (almost 8 times faster
> using
> >> > > > HFileReaderV2).
> >> > > >
> >> > > > I want to know why the difference is so big or I didn't configure
> >> HBase
> >> > > > properly. From this experiment, HDFS can deliver the data
> >> efficiently
> >> > so
> >> > > it
> >> > > > is not the bottleneck.
> >> > > >
> >> > > > Any help is appreciated!
> >> > > >
> >> > > > Jerry
> >> > > >
> >> > > >
> >> > >
> >> >
> >> > Confidentiality Notice:  The information contained in this message,
> >> > including any attachments hereto, may be confidential and is intended
> >> to be
> >> > read only by the individual or entity to whom this message is
> >> addressed. If
> >> > the reader of this message is not the intended recipient or an agent
> or
> >> > designee of the intended recipient, please note that any review, use,
> >> > disclosure or distribution of this message or its attachments, in any
> >> form,
> >> > is strictly prohibited.  If you have received this message in error,
> >> please
> >> > immediately notify the sender and/or Notifications@carrieriq.com and
> >> > delete or destroy any copy of this message and its attachments.
> >> >
> >>
> >
> >
>
> --
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to
> which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.
>

Re: Performance between HBaseClient scan and HFileReaderV2

Posted by lars hofhansl <la...@apache.org>.

It never happened because it can be done without.
I can attach the code that we will use (untested at this point) to the issue.

-- Lars



________________________________
 From: Sergey Shelukhin <se...@hortonworks.com>
To: "user@hbase.apache.org" <us...@hbase.apache.org> 
Sent: Thursday, January 2, 2014 1:43 PM
Subject: Re: Performance between HBaseClient scan and HFileReaderV2
 

Er, using MR over snapshots, which reads files directly...
https://issues.apache.org/jira/browse/HBASE-8369
However, it was only committed to 98.
There was interest in 94 port (HBASE-10076), but it never happened...


On Thu, Jan 2, 2014 at 1:42 PM, Sergey Shelukhin <se...@hortonworks.com>wrote:

> You might be interested in using
> https://issues.apache.org/jira/browse/HBASE-8369
> However, it was only committed to 98.
> There was interest in 94 port (HBASE-10076), but it never happened...
>
>
> On Thu, Jan 2, 2014 at 1:32 PM, Jerry Lam <ch...@gmail.com> wrote:
>
>> Hello Vladimir,
>>
>> In my use case, I guarantee that a major compaction is executed before any
>> scan happens because the system we build is a read only system. There will
>> have no deleted cells. Additionally, I only need to read from a single
>> column family and therefore I don't need to access multiple HFiles.
>>
>> Filter conditions are nice to have because if I can read HFile 8x faster
>> than using HBaseClient, I can do the filter on the client side and still
>> perform faster than using HBaseClient.
>>
>> Thank you for your input!
>>
>> Jerry
>>
>>
>>
>> On Thu, Jan 2, 2014 at 1:30 PM, Vladimir Rodionov
>> <vr...@carrieriq.com>wrote:
>>
>> > HBase scanner MUST guarantee correct order of KeyValues (coming from
>> > different HFile's),
>> > filter condition+ filter condition on included column families and
>> > qualifiers, time range, max versions and correctly process deleted
>> cells.
>> > Direct HFileReader does nothing from the above list.
>> >
>> > Best regards,
>> > Vladimir Rodionov
>> > Principal Platform Engineer
>> > Carrier IQ, www.carrieriq.com
>> > e-mail: vrodionov@carrieriq.com
>> >
>> > ________________________________________
>> > From: Jerry Lam [chilinglam@gmail.com]
>> > Sent: Thursday, January 02, 2014 7:56 AM
>> > To: user
>> > Subject: Re: Performance between HBaseClient scan and HFileReaderV2
>> >
>> > Hi Tom,
>> >
>> > Good point. Note that I also ran the HBaseClient performance test
>> several
>> > times (as you can see from the chart). The caching should also benefit
>> the
>> > second time I ran the HBaseClient performance test not just benefitting
>> the
>> > HFileReaderV2 test.
>> >
>> > I still don't understand what makes the HBaseClient performs so poorly
>> in
>> > comparison to access directly HDFS. I can understand maybe a factor of 2
>> > (even that it is too much) but a factor of 8 is quite unreasonable.
>> >
>> > Any hint?
>> >
>> > Jerry
>> >
>> >
>> >
>> > On Sun, Dec 29, 2013 at 9:09 PM, Tom Hood <to...@gmail.com> wrote:
>> >
>> > > I'm also new to HBase and am not familiar with HFileReaderV2.
>>  However,
>> > in
>> > > your description, you didn't mention anything about clearing the
>> linux OS
>> > > cache between tests.  That might be why you're seeing the big
>> difference
>> > if
>> > > you ran the HBaseClient test first, it may have warmed the OS cache
>> and
>> > > then HFileReaderV2 benefited from it.  Just a guess...
>> > >
>> > > -- Tom
>> > >
>> > >
>> > >
>> > > On Mon, Dec 23, 2013 at 12:18 PM, Jerry Lam <ch...@gmail.com>
>> > wrote:
>> > >
>> > > > Hello HBase users,
>> > > >
>> > > > I just ran a very simple performance test and would like to see if
>> > what I
>> > > > experienced make sense.
>> > > >
>> > > > The experiment is as follows:
>> > > > - I filled a hbase region with 700MB data (each row has roughly 45
>> > > columns
>> > > > and the size is 20KB for the entire row)
>> > > > - I configured the region to hold 4GB (therefore no split occurs)
>> > > > - I ran compactions after the data is loaded and make sure that
>> there
>> > is
>> > > > only 1 region in the table under test.
>> > > > - No other table exists in the hbase cluster because this is a DEV
>> > > > environment
>> > > > - I'm using HBase 0.92.1
>> > > >
>> > > > The test is very basic. I use HBaseClient to scan the entire region
>> to
>> > > > retrieve all rows and all columns in the table, just iterating all
>> > > KeyValue
>> > > > pairs until it is done. It took about 1 minute 22 sec to complete.
>> > (Note
>> > > > that I disable block cache and uses caching size about 10000).
>> > > >
>> > > > I ran another test using HFileReaderV2 and scan the entire region to
>> > > > retrieve all rows and all columns, just iterating all keyValue pairs
>> > > until
>> > > > it is done. It took 11 sec.
>> > > >
>> > > > The performance difference is dramatic (almost 8 times faster using
>> > > > HFileReaderV2).
>> > > >
>> > > > I want to know why the difference is so big or I didn't configure
>> HBase
>> > > > properly. From this experiment, HDFS can deliver the data
>> efficiently
>> > so
>> > > it
>> > > > is not the bottleneck.
>> > > >
>> > > > Any help is appreciated!
>> > > >
>> > > > Jerry
>> > > >
>> > > >
>> > >
>> >
>> > Confidentiality Notice:  The information contained in this message,
>> > including any attachments hereto, may be confidential and is intended
>> to be
>> > read only by the individual or entity to whom this message is
>> addressed. If
>> > the reader of this message is not the intended recipient or an agent or
>> > designee of the intended recipient, please note that any review, use,
>> > disclosure or distribution of this message or its attachments, in any
>> form,
>> > is strictly prohibited.  If you have received this message in error,
>> please
>> > immediately notify the sender and/or Notifications@carrieriq.com and
>> > delete or destroy any copy of this message and its attachments.
>> >
>>
>
>

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: Performance between HBaseClient scan and HFileReaderV2

Posted by Sergey Shelukhin <se...@hortonworks.com>.

Er, using MR over snapshots, which reads files directly...
https://issues.apache.org/jira/browse/HBASE-8369
However, it was only committed to 98.
There was interest in 94 port (HBASE-10076), but it never happened...


On Thu, Jan 2, 2014 at 1:42 PM, Sergey Shelukhin <se...@hortonworks.com>wrote:

> You might be interested in using
> https://issues.apache.org/jira/browse/HBASE-8369
> However, it was only committed to 98.
> There was interest in 94 port (HBASE-10076), but it never happened...
>
>
> On Thu, Jan 2, 2014 at 1:32 PM, Jerry Lam <ch...@gmail.com> wrote:
>
>> Hello Vladimir,
>>
>> In my use case, I guarantee that a major compaction is executed before any
>> scan happens because the system we build is a read only system. There will
>> have no deleted cells. Additionally, I only need to read from a single
>> column family and therefore I don't need to access multiple HFiles.
>>
>> Filter conditions are nice to have because if I can read HFile 8x faster
>> than using HBaseClient, I can do the filter on the client side and still
>> perform faster than using HBaseClient.
>>
>> Thank you for your input!
>>
>> Jerry
>>
>>
>>
>> On Thu, Jan 2, 2014 at 1:30 PM, Vladimir Rodionov
>> <vr...@carrieriq.com>wrote:
>>
>> > HBase scanner MUST guarantee correct order of KeyValues (coming from
>> > different HFile's),
>> > filter condition+ filter condition on included column families and
>> > qualifiers, time range, max versions and correctly process deleted
>> cells.
>> > Direct HFileReader does nothing from the above list.
>> >
>> > Best regards,
>> > Vladimir Rodionov
>> > Principal Platform Engineer
>> > Carrier IQ, www.carrieriq.com
>> > e-mail: vrodionov@carrieriq.com
>> >
>> > ________________________________________
>> > From: Jerry Lam [chilinglam@gmail.com]
>> > Sent: Thursday, January 02, 2014 7:56 AM
>> > To: user
>> > Subject: Re: Performance between HBaseClient scan and HFileReaderV2
>> >
>> > Hi Tom,
>> >
>> > Good point. Note that I also ran the HBaseClient performance test
>> several
>> > times (as you can see from the chart). The caching should also benefit
>> the
>> > second time I ran the HBaseClient performance test not just benefitting
>> the
>> > HFileReaderV2 test.
>> >
>> > I still don't understand what makes the HBaseClient performs so poorly
>> in
>> > comparison to access directly HDFS. I can understand maybe a factor of 2
>> > (even that it is too much) but a factor of 8 is quite unreasonable.
>> >
>> > Any hint?
>> >
>> > Jerry
>> >
>> >
>> >
>> > On Sun, Dec 29, 2013 at 9:09 PM, Tom Hood <to...@gmail.com> wrote:
>> >
>> > > I'm also new to HBase and am not familiar with HFileReaderV2.
>>  However,
>> > in
>> > > your description, you didn't mention anything about clearing the
>> linux OS
>> > > cache between tests.  That might be why you're seeing the big
>> difference
>> > if
>> > > you ran the HBaseClient test first, it may have warmed the OS cache
>> and
>> > > then HFileReaderV2 benefited from it.  Just a guess...
>> > >
>> > > -- Tom
>> > >
>> > >
>> > >
>> > > On Mon, Dec 23, 2013 at 12:18 PM, Jerry Lam <ch...@gmail.com>
>> > wrote:
>> > >
>> > > > Hello HBase users,
>> > > >
>> > > > I just ran a very simple performance test and would like to see if
>> > what I
>> > > > experienced make sense.
>> > > >
>> > > > The experiment is as follows:
>> > > > - I filled a hbase region with 700MB data (each row has roughly 45
>> > > columns
>> > > > and the size is 20KB for the entire row)
>> > > > - I configured the region to hold 4GB (therefore no split occurs)
>> > > > - I ran compactions after the data is loaded and make sure that
>> there
>> > is
>> > > > only 1 region in the table under test.
>> > > > - No other table exists in the hbase cluster because this is a DEV
>> > > > environment
>> > > > - I'm using HBase 0.92.1
>> > > >
>> > > > The test is very basic. I use HBaseClient to scan the entire region
>> to
>> > > > retrieve all rows and all columns in the table, just iterating all
>> > > KeyValue
>> > > > pairs until it is done. It took about 1 minute 22 sec to complete.
>> > (Note
>> > > > that I disable block cache and uses caching size about 10000).
>> > > >
>> > > > I ran another test using HFileReaderV2 and scan the entire region to
>> > > > retrieve all rows and all columns, just iterating all keyValue pairs
>> > > until
>> > > > it is done. It took 11 sec.
>> > > >
>> > > > The performance difference is dramatic (almost 8 times faster using
>> > > > HFileReaderV2).
>> > > >
>> > > > I want to know why the difference is so big or I didn't configure
>> HBase
>> > > > properly. From this experiment, HDFS can deliver the data
>> efficiently
>> > so
>> > > it
>> > > > is not the bottleneck.
>> > > >
>> > > > Any help is appreciated!
>> > > >
>> > > > Jerry
>> > > >
>> > > >
>> > >
>> >
>> > Confidentiality Notice:  The information contained in this message,
>> > including any attachments hereto, may be confidential and is intended
>> to be
>> > read only by the individual or entity to whom this message is
>> addressed. If
>> > the reader of this message is not the intended recipient or an agent or
>> > designee of the intended recipient, please note that any review, use,
>> > disclosure or distribution of this message or its attachments, in any
>> form,
>> > is strictly prohibited.  If you have received this message in error,
>> please
>> > immediately notify the sender and/or Notifications@carrieriq.com and
>> > delete or destroy any copy of this message and its attachments.
>> >
>>
>
>

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: Performance between HBaseClient scan and HFileReaderV2

Posted by Sergey Shelukhin <se...@hortonworks.com>.

You might be interested in using
https://issues.apache.org/jira/browse/HBASE-8369
However, it was only committed to 98.
There was interest in 94 port (HBASE-10076), but it never happened...


On Thu, Jan 2, 2014 at 1:32 PM, Jerry Lam <ch...@gmail.com> wrote:

> Hello Vladimir,
>
> In my use case, I guarantee that a major compaction is executed before any
> scan happens because the system we build is a read only system. There will
> have no deleted cells. Additionally, I only need to read from a single
> column family and therefore I don't need to access multiple HFiles.
>
> Filter conditions are nice to have because if I can read HFile 8x faster
> than using HBaseClient, I can do the filter on the client side and still
> perform faster than using HBaseClient.
>
> Thank you for your input!
>
> Jerry
>
>
>
> On Thu, Jan 2, 2014 at 1:30 PM, Vladimir Rodionov
> <vr...@carrieriq.com>wrote:
>
> > HBase scanner MUST guarantee correct order of KeyValues (coming from
> > different HFile's),
> > filter condition+ filter condition on included column families and
> > qualifiers, time range, max versions and correctly process deleted cells.
> > Direct HFileReader does nothing from the above list.
> >
> > Best regards,
> > Vladimir Rodionov
> > Principal Platform Engineer
> > Carrier IQ, www.carrieriq.com
> > e-mail: vrodionov@carrieriq.com
> >
> > ________________________________________
> > From: Jerry Lam [chilinglam@gmail.com]
> > Sent: Thursday, January 02, 2014 7:56 AM
> > To: user
> > Subject: Re: Performance between HBaseClient scan and HFileReaderV2
> >
> > Hi Tom,
> >
> > Good point. Note that I also ran the HBaseClient performance test several
> > times (as you can see from the chart). The caching should also benefit
> the
> > second time I ran the HBaseClient performance test not just benefitting
> the
> > HFileReaderV2 test.
> >
> > I still don't understand what makes the HBaseClient performs so poorly in
> > comparison to access directly HDFS. I can understand maybe a factor of 2
> > (even that it is too much) but a factor of 8 is quite unreasonable.
> >
> > Any hint?
> >
> > Jerry
> >
> >
> >
> > On Sun, Dec 29, 2013 at 9:09 PM, Tom Hood <to...@gmail.com> wrote:
> >
> > > I'm also new to HBase and am not familiar with HFileReaderV2.  However,
> > in
> > > your description, you didn't mention anything about clearing the linux
> OS
> > > cache between tests.  That might be why you're seeing the big
> difference
> > if
> > > you ran the HBaseClient test first, it may have warmed the OS cache and
> > > then HFileReaderV2 benefited from it.  Just a guess...
> > >
> > > -- Tom
> > >
> > >
> > >
> > > On Mon, Dec 23, 2013 at 12:18 PM, Jerry Lam <ch...@gmail.com>
> > wrote:
> > >
> > > > Hello HBase users,
> > > >
> > > > I just ran a very simple performance test and would like to see if
> > what I
> > > > experienced make sense.
> > > >
> > > > The experiment is as follows:
> > > > - I filled a hbase region with 700MB data (each row has roughly 45
> > > columns
> > > > and the size is 20KB for the entire row)
> > > > - I configured the region to hold 4GB (therefore no split occurs)
> > > > - I ran compactions after the data is loaded and make sure that there
> > is
> > > > only 1 region in the table under test.
> > > > - No other table exists in the hbase cluster because this is a DEV
> > > > environment
> > > > - I'm using HBase 0.92.1
> > > >
> > > > The test is very basic. I use HBaseClient to scan the entire region
> to
> > > > retrieve all rows and all columns in the table, just iterating all
> > > KeyValue
> > > > pairs until it is done. It took about 1 minute 22 sec to complete.
> > (Note
> > > > that I disable block cache and uses caching size about 10000).
> > > >
> > > > I ran another test using HFileReaderV2 and scan the entire region to
> > > > retrieve all rows and all columns, just iterating all keyValue pairs
> > > until
> > > > it is done. It took 11 sec.
> > > >
> > > > The performance difference is dramatic (almost 8 times faster using
> > > > HFileReaderV2).
> > > >
> > > > I want to know why the difference is so big or I didn't configure
> HBase
> > > > properly. From this experiment, HDFS can deliver the data efficiently
> > so
> > > it
> > > > is not the bottleneck.
> > > >
> > > > Any help is appreciated!
> > > >
> > > > Jerry
> > > >
> > > >
> > >
> >
> > Confidentiality Notice:  The information contained in this message,
> > including any attachments hereto, may be confidential and is intended to
> be
> > read only by the individual or entity to whom this message is addressed.
> If
> > the reader of this message is not the intended recipient or an agent or
> > designee of the intended recipient, please note that any review, use,
> > disclosure or distribution of this message or its attachments, in any
> form,
> > is strictly prohibited.  If you have received this message in error,
> please
> > immediately notify the sender and/or Notifications@carrieriq.com and
> > delete or destroy any copy of this message and its attachments.
> >
>

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: Performance between HBaseClient scan and HFileReaderV2

Posted by Jerry Lam <ch...@gmail.com>.

Hello Vladimir,

In my use case, I guarantee that a major compaction is executed before any
scan happens because the system we build is a read only system. There will
have no deleted cells. Additionally, I only need to read from a single
column family and therefore I don't need to access multiple HFiles.

Filter conditions are nice to have because if I can read HFile 8x faster
than using HBaseClient, I can do the filter on the client side and still
perform faster than using HBaseClient.

Thank you for your input!

Jerry



On Thu, Jan 2, 2014 at 1:30 PM, Vladimir Rodionov
<vr...@carrieriq.com>wrote:

> HBase scanner MUST guarantee correct order of KeyValues (coming from
> different HFile's),
> filter condition+ filter condition on included column families and
> qualifiers, time range, max versions and correctly process deleted cells.
> Direct HFileReader does nothing from the above list.
>
> Best regards,
> Vladimir Rodionov
> Principal Platform Engineer
> Carrier IQ, www.carrieriq.com
> e-mail: vrodionov@carrieriq.com
>
> ________________________________________
> From: Jerry Lam [chilinglam@gmail.com]
> Sent: Thursday, January 02, 2014 7:56 AM
> To: user
> Subject: Re: Performance between HBaseClient scan and HFileReaderV2
>
> Hi Tom,
>
> Good point. Note that I also ran the HBaseClient performance test several
> times (as you can see from the chart). The caching should also benefit the
> second time I ran the HBaseClient performance test not just benefitting the
> HFileReaderV2 test.
>
> I still don't understand what makes the HBaseClient performs so poorly in
> comparison to access directly HDFS. I can understand maybe a factor of 2
> (even that it is too much) but a factor of 8 is quite unreasonable.
>
> Any hint?
>
> Jerry
>
>
>
> On Sun, Dec 29, 2013 at 9:09 PM, Tom Hood <to...@gmail.com> wrote:
>
> > I'm also new to HBase and am not familiar with HFileReaderV2.  However,
> in
> > your description, you didn't mention anything about clearing the linux OS
> > cache between tests.  That might be why you're seeing the big difference
> if
> > you ran the HBaseClient test first, it may have warmed the OS cache and
> > then HFileReaderV2 benefited from it.  Just a guess...
> >
> > -- Tom
> >
> >
> >
> > On Mon, Dec 23, 2013 at 12:18 PM, Jerry Lam <ch...@gmail.com>
> wrote:
> >
> > > Hello HBase users,
> > >
> > > I just ran a very simple performance test and would like to see if
> what I
> > > experienced make sense.
> > >
> > > The experiment is as follows:
> > > - I filled a hbase region with 700MB data (each row has roughly 45
> > columns
> > > and the size is 20KB for the entire row)
> > > - I configured the region to hold 4GB (therefore no split occurs)
> > > - I ran compactions after the data is loaded and make sure that there
> is
> > > only 1 region in the table under test.
> > > - No other table exists in the hbase cluster because this is a DEV
> > > environment
> > > - I'm using HBase 0.92.1
> > >
> > > The test is very basic. I use HBaseClient to scan the entire region to
> > > retrieve all rows and all columns in the table, just iterating all
> > KeyValue
> > > pairs until it is done. It took about 1 minute 22 sec to complete.
> (Note
> > > that I disable block cache and uses caching size about 10000).
> > >
> > > I ran another test using HFileReaderV2 and scan the entire region to
> > > retrieve all rows and all columns, just iterating all keyValue pairs
> > until
> > > it is done. It took 11 sec.
> > >
> > > The performance difference is dramatic (almost 8 times faster using
> > > HFileReaderV2).
> > >
> > > I want to know why the difference is so big or I didn't configure HBase
> > > properly. From this experiment, HDFS can deliver the data efficiently
> so
> > it
> > > is not the bottleneck.
> > >
> > > Any help is appreciated!
> > >
> > > Jerry
> > >
> > >
> >
>
> Confidentiality Notice:  The information contained in this message,
> including any attachments hereto, may be confidential and is intended to be
> read only by the individual or entity to whom this message is addressed. If
> the reader of this message is not the intended recipient or an agent or
> designee of the intended recipient, please note that any review, use,
> disclosure or distribution of this message or its attachments, in any form,
> is strictly prohibited.  If you have received this message in error, please
> immediately notify the sender and/or Notifications@carrieriq.com and
> delete or destroy any copy of this message and its attachments.
>

RE: Performance between HBaseClient scan and HFileReaderV2

Posted by Vladimir Rodionov <vr...@carrieriq.com>.

HBase scanner MUST guarantee correct order of KeyValues (coming from different HFile's),
filter condition+ filter condition on included column families and qualifiers, time range, max versions and correctly process deleted cells.
Direct HFileReader does nothing from the above list.

Best regards,
Vladimir Rodionov
Principal Platform Engineer
Carrier IQ, www.carrieriq.com
e-mail: vrodionov@carrieriq.com

________________________________________
From: Jerry Lam [chilinglam@gmail.com]
Sent: Thursday, January 02, 2014 7:56 AM
To: user
Subject: Re: Performance between HBaseClient scan and HFileReaderV2

Hi Tom,

Good point. Note that I also ran the HBaseClient performance test several
times (as you can see from the chart). The caching should also benefit the
second time I ran the HBaseClient performance test not just benefitting the
HFileReaderV2 test.

I still don't understand what makes the HBaseClient performs so poorly in
comparison to access directly HDFS. I can understand maybe a factor of 2
(even that it is too much) but a factor of 8 is quite unreasonable.

Any hint?

Jerry

On Sun, Dec 29, 2013 at 9:09 PM, Tom Hood <to...@gmail.com> wrote:

> I'm also new to HBase and am not familiar with HFileReaderV2.  However, in
> your description, you didn't mention anything about clearing the linux OS
> cache between tests.  That might be why you're seeing the big difference if
> you ran the HBaseClient test first, it may have warmed the OS cache and
> then HFileReaderV2 benefited from it.  Just a guess...
>
> -- Tom
>
>
>
> On Mon, Dec 23, 2013 at 12:18 PM, Jerry Lam <ch...@gmail.com> wrote:
>
> > Hello HBase users,
> >
> > I just ran a very simple performance test and would like to see if what I
> > experienced make sense.
> >
> > The experiment is as follows:
> > - I filled a hbase region with 700MB data (each row has roughly 45
> columns
> > and the size is 20KB for the entire row)
> > - I configured the region to hold 4GB (therefore no split occurs)
> > - I ran compactions after the data is loaded and make sure that there is
> > only 1 region in the table under test.
> > - No other table exists in the hbase cluster because this is a DEV
> > environment
> > - I'm using HBase 0.92.1
> >
> > The test is very basic. I use HBaseClient to scan the entire region to
> > retrieve all rows and all columns in the table, just iterating all
> KeyValue
> > pairs until it is done. It took about 1 minute 22 sec to complete. (Note
> > that I disable block cache and uses caching size about 10000).
> >
> > I ran another test using HFileReaderV2 and scan the entire region to
> > retrieve all rows and all columns, just iterating all keyValue pairs
> until
> > it is done. It took 11 sec.
> >
> > The performance difference is dramatic (almost 8 times faster using
> > HFileReaderV2).
> >
> > I want to know why the difference is so big or I didn't configure HBase
> > properly. From this experiment, HDFS can deliver the data efficiently so
> it
> > is not the bottleneck.
> >
> > Any help is appreciated!
> >
> > Jerry
> >
> >
>

Confidentiality Notice:  The information contained in this message, including any attachments hereto, may be confidential and is intended to be read only by the individual or entity to whom this message is addressed. If the reader of this message is not the intended recipient or an agent or designee of the intended recipient, please note that any review, use, disclosure or distribution of this message or its attachments, in any form, is strictly prohibited.  If you have received this message in error, please immediately notify the sender and/or Notifications@carrieriq.com and delete or destroy any copy of this message and its attachments.