You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by James Taylor <jt...@salesforce.com> on 2013/04/07 08:05:15 UTC

Essential column family performance

Hello,
We're doing some performance testing of the essential column family 
feature, and we're seeing some performance degradation when comparing 
with and without the feature enabled:

                           Performance of scan relative
% of rows selected        to not enabling the feature
---------------------    --------------------------------
100%                            1.0x
  80%                            2.0x
  60%                            2.3x
  40%                            2.2x
  20%                            1.5x
  10%                            1.0x
   5%                            0.67x
   0%                            0.30%

In our scenario, we have two column families. The key value from the 
essential column family is used in the filter, while the key value from 
the other, non essential column family is returned by the scan. Each row 
contains values for both key values, with the values being relatively 
narrow (less than 50 bytes). In this scenario, the only time we're 
seeing a performance gain is when less than 10% of the rows are selected.

Is this a reasonable test? Has anyone else measured this?

Thanks,

James






Re: Essential column family performance

Posted by Michael Segel <mi...@hotmail.com>.
I think that JM brings up a good point. 
Keep in mind that RLL in HBase is not the same when you think of Row Level Locking in transactional systems. 
Depending on the use case... you can keep things in separate tables and not worry about the issues w CF's.

So when you think about your design... separate tables may be a valid design. 

IMHO I think more thought is needed before using CFs.

The Essential column family sounds like its more beneficial for edge cases and not so much for the primary use case. 
Again, IMHO if you're using it for your primary use case, then I think you should rethink your schema design. 

To Ted's point, by keeping like data within CFs, it makes it easier when processing data within a M/R framework since your scanner will work against the CFs in the table. 

Yet, I have to ask why you would filter on one CF when pulling data from a second? Why not duplicate the data and store in both?  Again, this is highly dependent on the use case.

Just saying...


On Apr 8, 2013, at 12:23 PM, Ted Yu <yu...@gmail.com> wrote:

> Currently atomicity support in HBase is for single table, single region.
> 
> If user chooses separate tables, it might be harder to implement the
> business logic.
> 
> On Mon, Apr 8, 2013 at 10:19 AM, Jean-Marc Spaggiari <
> jean-marc@spaggiari.org> wrote:
> 
>> Something I'm not getting, why not using separate tables instead of
>> CFs for a single table? Simply name your table tablename_cfname then
>> you get ride of the CF# limitation?
>> 
>> Or is there big pros to have CFs?
>> 
>> JM
>> 
>> 2013/4/8 Anoop John <an...@gmail.com>:
>>> Agree here. The effectiveness depends on what % of data satisfies the
>>> condition, how it is distributed across HFile blocks. We will get
>>> performance gain when the we will be able to skip some HFile blocks (from
>>> non essential CFs). Can test with different HFile block size (lower
>> value)?
>>> 
>>> -Anoop-
>>> 
>>> 
>>> On Mon, Apr 8, 2013 at 8:19 PM, Ted Yu <yu...@gmail.com> wrote:
>>> 
>>>> I made the following change in TestJoinedScanners.java:
>>>> 
>>>> -      int flag_percent = 1;
>>>> +      int flag_percent = 40;
>>>> 
>>>> The test took longer but still favors joined scanner.
>>>> I got some new results:
>>>> 
>>>> 2013-04-08 07:46:06,959 INFO  [main]
>> regionserver.TestJoinedScanners(157):
>>>> Slow scanner finished in 7.424388 seconds, got 2050 rows
>>>> ...
>>>> 2013-04-08 07:46:12,010 INFO  [main]
>> regionserver.TestJoinedScanners(157):
>>>> Joined scanner finished in 5.05063 seconds, got 2050 rows
>>>> 
>>>> 2013-04-08 07:46:18,358 INFO  [main]
>> regionserver.TestJoinedScanners(157):
>>>> Slow scanner finished in 6.348517 seconds, got 2050 rows
>>>> ...
>>>> 2013-04-08 07:46:22,946 INFO  [main]
>> regionserver.TestJoinedScanners(157):
>>>> Joined scanner finished in 4.587545 seconds, got 2050 rows
>>>> 
>>>> Looks like effectiveness of joined scanner is affected by distribution
>> of
>>>> data.
>>>> 
>>>> Cheers
>>>> 
>>>> On Sun, Apr 7, 2013 at 8:52 PM, lars hofhansl <la...@apache.org> wrote:
>>>> 
>>>>> Looking at the joined scanner test code, it sets it up such that 1% of
>>>> the
>>>>> rows match, which would somewhat be in line with James' results.
>>>>> 
>>>>> In my own testing a while ago I found a 100% improvement with 0%
>> match.
>>>>> 
>>>>> 
>>>>> -- Lars
>>>>> 
>>>>> 
>>>>> 
>>>>> ________________________________
>>>>> From: Ted Yu <yu...@gmail.com>
>>>>> To: user@hbase.apache.org
>>>>> Sent: Sunday, April 7, 2013 4:13 PM
>>>>> Subject: Re: Essential column family performance
>>>>> 
>>>>> I have attached 5416-TestJoinedScanners-0.94.txt to HBASE-5416 for
>> your
>>>>> reference.
>>>>> 
>>>>> On my MacBook, I got the following results from the test:
>>>>> 
>>>>> 2013-04-07 16:08:17,474 INFO  [main]
>>>> regionserver.TestJoinedScanners(157):
>>>>> Slow scanner finished in 7.973822 seconds, got 100 rows
>>>>> ...
>>>>> 2013-04-07 16:08:17,946 INFO  [main]
>>>> regionserver.TestJoinedScanners(157):
>>>>> Joined scanner finished in 0.47235 seconds, got 100 rows
>>>>> 
>>>>> Cheers
>>>>> 
>>>>> On Sun, Apr 7, 2013 at 4:03 PM, Ted Yu <yu...@gmail.com> wrote:
>>>>> 
>>>>>> Looking at
>>>>>> 
>>>>> 
>>>> 
>> https://issues.apache.org/jira/secure/attachment/12564340/5416-0.94-v3.txt
>>>> ,
>>>>> I found that it didn't contain TestJoinedScanners which shows
>>>>>> difference in scanner performance:
>>>>>> 
>>>>>>   LOG.info((slow ? "Slow" : "Joined") + " scanner finished in " +
>>>>>> Double.toString(timeSec)
>>>>>> 
>>>>>>      + " seconds, got " + Long.toString(rows_count/2) + " rows");
>>>>>> 
>>>>>> The test uses SingleColumnValueFilter:
>>>>>> 
>>>>>>    SingleColumnValueFilter filter = new SingleColumnValueFilter(
>>>>>> 
>>>>>>        cf_essential, col_name, CompareFilter.CompareOp.EQUAL,
>>>> flag_yes);
>>>>>> It is possible that the custom filter you were using would exhibit
>>>>>> different access pattern compared to SingleColumnValueFilter. e.g.
>> does
>>>>>> your filter utilize hint ?
>>>>>> It would be easier for me and other people to reproduce the issue
>> you
>>>>>> experienced if you put your scenario in some test similar to
>>>>>> TestJoinedScanners.
>>>>>> 
>>>>>> Will take a closer look at the code Monday.
>>>>>> 
>>>>>> Cheers
>>>>>> 
>>>>>> On Sun, Apr 7, 2013 at 11:37 AM, James Taylor <
>> jtaylor@salesforce.com
>>>>>> wrote:
>>>>>> 
>>>>>>> Yes, on 0.94.6. We have our own custom filter derived from
>> FilterBase,
>>>>> so
>>>>>>> filterIfMissing isn't the issue - the results of the scan are
>> correct.
>>>>>>> 
>>>>>>> I can see that if the essential column family has more data
>> compared
>>>> to
>>>>>>> the non essential column family that the results would eventually
>> even
>>>>> out.
>>>>>>> I was hoping to always be able to enable the essential column
>> family
>>>>>>> feature. Is there an inherent reason why performance would degrade
>>>> like
>>>>>>> this? Does it boil down to a single sequential scan versus many
>> seeks?
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> 
>>>>>>> James
>>>>>>> 
>>>>>>> 
>>>>>>> On 04/07/2013 07:44 AM, Ted Yu wrote:
>>>>>>> 
>>>>>>>> James:
>>>>>>>> Your test was based on 0.94.6.1, right ?
>>>>>>>> 
>>>>>>>> What Filter were you using ?
>>>>>>>> 
>>>>>>>> If you used SingleColumnValueFilter, have you seen my comment
>> here ?
>>>>>>>> https://issues.apache.org/**jira/browse/HBASE-5416?**
>>>>>>>> focusedCommentId=13541229&**page=com.atlassian.jira.**
>>>>>>>> 
>> plugin.system.issuetabpanels:**comment-tabpanel#comment-**13541229<
>>>>> 
>>>> 
>> https://issues.apache.org/jira/browse/HBASE-5416?focusedCommentId=13541229&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13541229
>>>>>> 
>>>>>>>> 
>>>>>>>> BTW the use case Max Lapan tried to address has non essential
>> column
>>>>>>>> family
>>>>>>>> carrying considerably more data compared to essential column
>> family.
>>>>>>>> 
>>>>>>>> Cheers
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Sat, Apr 6, 2013 at 11:05 PM, James Taylor <
>>>> jtaylor@salesforce.com
>>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>> Hello,
>>>>>>>>> We're doing some performance testing of the essential column
>> family
>>>>>>>>> feature, and we're seeing some performance degradation when
>>>> comparing
>>>>>>>>> with
>>>>>>>>> and without the feature enabled:
>>>>>>>>> 
>>>>>>>>>                           Performance of scan relative
>>>>>>>>> % of rows selected        to not enabling the feature
>>>>>>>>> ---------------------    ------------------------------****--
>>>>>>>>> 
>>>>>>>>> 100%                            1.0x
>>>>>>>>>  80%                            2.0x
>>>>>>>>>  60%                            2.3x
>>>>>>>>>  40%                            2.2x
>>>>>>>>>  20%                            1.5x
>>>>>>>>>  10%                            1.0x
>>>>>>>>>   5%                            0.67x
>>>>>>>>>   0%                            0.30%
>>>>>>>>> 
>>>>>>>>> In our scenario, we have two column families. The key value from
>> the
>>>>>>>>> essential column family is used in the filter, while the key
>> value
>>>>> from
>>>>>>>>> the
>>>>>>>>> other, non essential column family is returned by the scan. Each
>> row
>>>>>>>>> contains values for both key values, with the values being
>>>> relatively
>>>>>>>>> narrow (less than 50 bytes). In this scenario, the only time
>> we're
>>>>>>>>> seeing a
>>>>>>>>> performance gain is when less than 10% of the rows are selected.
>>>>>>>>> 
>>>>>>>>> Is this a reasonable test? Has anyone else measured this?
>>>>>>>>> 
>>>>>>>>> Thanks,
>>>>>>>>> 
>>>>>>>>> James
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>> 


Re: Essential column family performance

Posted by Ted Yu <yu...@gmail.com>.
Currently atomicity support in HBase is for single table, single region.

If user chooses separate tables, it might be harder to implement the
business logic.

On Mon, Apr 8, 2013 at 10:19 AM, Jean-Marc Spaggiari <
jean-marc@spaggiari.org> wrote:

> Something I'm not getting, why not using separate tables instead of
> CFs for a single table? Simply name your table tablename_cfname then
> you get ride of the CF# limitation?
>
> Or is there big pros to have CFs?
>
> JM
>
> 2013/4/8 Anoop John <an...@gmail.com>:
> > Agree here. The effectiveness depends on what % of data satisfies the
> > condition, how it is distributed across HFile blocks. We will get
> > performance gain when the we will be able to skip some HFile blocks (from
> > non essential CFs). Can test with different HFile block size (lower
> value)?
> >
> > -Anoop-
> >
> >
> > On Mon, Apr 8, 2013 at 8:19 PM, Ted Yu <yu...@gmail.com> wrote:
> >
> >> I made the following change in TestJoinedScanners.java:
> >>
> >> -      int flag_percent = 1;
> >> +      int flag_percent = 40;
> >>
> >> The test took longer but still favors joined scanner.
> >> I got some new results:
> >>
> >> 2013-04-08 07:46:06,959 INFO  [main]
> regionserver.TestJoinedScanners(157):
> >> Slow scanner finished in 7.424388 seconds, got 2050 rows
> >> ...
> >> 2013-04-08 07:46:12,010 INFO  [main]
> regionserver.TestJoinedScanners(157):
> >> Joined scanner finished in 5.05063 seconds, got 2050 rows
> >>
> >> 2013-04-08 07:46:18,358 INFO  [main]
> regionserver.TestJoinedScanners(157):
> >> Slow scanner finished in 6.348517 seconds, got 2050 rows
> >> ...
> >> 2013-04-08 07:46:22,946 INFO  [main]
> regionserver.TestJoinedScanners(157):
> >> Joined scanner finished in 4.587545 seconds, got 2050 rows
> >>
> >> Looks like effectiveness of joined scanner is affected by distribution
> of
> >> data.
> >>
> >> Cheers
> >>
> >> On Sun, Apr 7, 2013 at 8:52 PM, lars hofhansl <la...@apache.org> wrote:
> >>
> >> > Looking at the joined scanner test code, it sets it up such that 1% of
> >> the
> >> > rows match, which would somewhat be in line with James' results.
> >> >
> >> > In my own testing a while ago I found a 100% improvement with 0%
> match.
> >> >
> >> >
> >> > -- Lars
> >> >
> >> >
> >> >
> >> > ________________________________
> >> >  From: Ted Yu <yu...@gmail.com>
> >> > To: user@hbase.apache.org
> >> > Sent: Sunday, April 7, 2013 4:13 PM
> >> > Subject: Re: Essential column family performance
> >> >
> >> > I have attached 5416-TestJoinedScanners-0.94.txt to HBASE-5416 for
> your
> >> > reference.
> >> >
> >> > On my MacBook, I got the following results from the test:
> >> >
> >> > 2013-04-07 16:08:17,474 INFO  [main]
> >> regionserver.TestJoinedScanners(157):
> >> > Slow scanner finished in 7.973822 seconds, got 100 rows
> >> > ...
> >> > 2013-04-07 16:08:17,946 INFO  [main]
> >> regionserver.TestJoinedScanners(157):
> >> > Joined scanner finished in 0.47235 seconds, got 100 rows
> >> >
> >> > Cheers
> >> >
> >> > On Sun, Apr 7, 2013 at 4:03 PM, Ted Yu <yu...@gmail.com> wrote:
> >> >
> >> > > Looking at
> >> > >
> >> >
> >>
> https://issues.apache.org/jira/secure/attachment/12564340/5416-0.94-v3.txt
> >> ,
> >> > I found that it didn't contain TestJoinedScanners which shows
> >> > > difference in scanner performance:
> >> > >
> >> > >    LOG.info((slow ? "Slow" : "Joined") + " scanner finished in " +
> >> > > Double.toString(timeSec)
> >> > >
> >> > >       + " seconds, got " + Long.toString(rows_count/2) + " rows");
> >> > >
> >> > > The test uses SingleColumnValueFilter:
> >> > >
> >> > >     SingleColumnValueFilter filter = new SingleColumnValueFilter(
> >> > >
> >> > >         cf_essential, col_name, CompareFilter.CompareOp.EQUAL,
> >> flag_yes);
> >> > > It is possible that the custom filter you were using would exhibit
> >> > > different access pattern compared to SingleColumnValueFilter. e.g.
> does
> >> > > your filter utilize hint ?
> >> > > It would be easier for me and other people to reproduce the issue
> you
> >> > > experienced if you put your scenario in some test similar to
> >> > > TestJoinedScanners.
> >> > >
> >> > > Will take a closer look at the code Monday.
> >> > >
> >> > > Cheers
> >> > >
> >> > > On Sun, Apr 7, 2013 at 11:37 AM, James Taylor <
> jtaylor@salesforce.com
> >> > >wrote:
> >> > >
> >> > >> Yes, on 0.94.6. We have our own custom filter derived from
> FilterBase,
> >> > so
> >> > >> filterIfMissing isn't the issue - the results of the scan are
> correct.
> >> > >>
> >> > >> I can see that if the essential column family has more data
> compared
> >> to
> >> > >> the non essential column family that the results would eventually
> even
> >> > out.
> >> > >> I was hoping to always be able to enable the essential column
> family
> >> > >> feature. Is there an inherent reason why performance would degrade
> >> like
> >> > >> this? Does it boil down to a single sequential scan versus many
> seeks?
> >> > >>
> >> > >> Thanks,
> >> > >>
> >> > >> James
> >> > >>
> >> > >>
> >> > >> On 04/07/2013 07:44 AM, Ted Yu wrote:
> >> > >>
> >> > >>> James:
> >> > >>> Your test was based on 0.94.6.1, right ?
> >> > >>>
> >> > >>> What Filter were you using ?
> >> > >>>
> >> > >>> If you used SingleColumnValueFilter, have you seen my comment
> here ?
> >> > >>> https://issues.apache.org/**jira/browse/HBASE-5416?**
> >> > >>> focusedCommentId=13541229&**page=com.atlassian.jira.**
> >> > >>>
> plugin.system.issuetabpanels:**comment-tabpanel#comment-**13541229<
> >> >
> >>
> https://issues.apache.org/jira/browse/HBASE-5416?focusedCommentId=13541229&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13541229
> >> > >
> >> > >>>
> >> > >>> BTW the use case Max Lapan tried to address has non essential
> column
> >> > >>> family
> >> > >>> carrying considerably more data compared to essential column
> family.
> >> > >>>
> >> > >>> Cheers
> >> > >>>
> >> > >>>
> >> > >>>
> >> > >>> On Sat, Apr 6, 2013 at 11:05 PM, James Taylor <
> >> jtaylor@salesforce.com
> >> > >>> >wrote:
> >> > >>>
> >> > >>>  Hello,
> >> > >>>> We're doing some performance testing of the essential column
> family
> >> > >>>> feature, and we're seeing some performance degradation when
> >> comparing
> >> > >>>> with
> >> > >>>> and without the feature enabled:
> >> > >>>>
> >> > >>>>                            Performance of scan relative
> >> > >>>> % of rows selected        to not enabling the feature
> >> > >>>> ---------------------    ------------------------------****--
> >> > >>>>
> >> > >>>> 100%                            1.0x
> >> > >>>>   80%                            2.0x
> >> > >>>>   60%                            2.3x
> >> > >>>>   40%                            2.2x
> >> > >>>>   20%                            1.5x
> >> > >>>>   10%                            1.0x
> >> > >>>>    5%                            0.67x
> >> > >>>>    0%                            0.30%
> >> > >>>>
> >> > >>>> In our scenario, we have two column families. The key value from
> the
> >> > >>>> essential column family is used in the filter, while the key
> value
> >> > from
> >> > >>>> the
> >> > >>>> other, non essential column family is returned by the scan. Each
> row
> >> > >>>> contains values for both key values, with the values being
> >> relatively
> >> > >>>> narrow (less than 50 bytes). In this scenario, the only time
> we're
> >> > >>>> seeing a
> >> > >>>> performance gain is when less than 10% of the rows are selected.
> >> > >>>>
> >> > >>>> Is this a reasonable test? Has anyone else measured this?
> >> > >>>>
> >> > >>>> Thanks,
> >> > >>>>
> >> > >>>> James
> >> > >>>>
> >> > >>>>
> >> > >>>>
> >> > >>>>
> >> > >>>>
> >> > >>>>
> >> > >>>>
> >> > >>
> >> > >
> >> >
> >>
>

Re: Essential column family performance

Posted by lars hofhansl <la...@apache.org>.
In this case it is handled all at the server, and if doing scans you still get the benefits of the sequential access pattern (rather doing a lot of seeks for point Gets).

-- Lars



________________________________
 From: Jean-Marc Spaggiari <je...@spaggiari.org>
To: user@hbase.apache.org 
Sent: Monday, April 8, 2013 10:19 AM
Subject: Re: Essential column family performance
 
Something I'm not getting, why not using separate tables instead of
CFs for a single table? Simply name your table tablename_cfname then
you get ride of the CF# limitation?

Or is there big pros to have CFs?

JM

2013/4/8 Anoop John <an...@gmail.com>:
> Agree here. The effectiveness depends on what % of data satisfies the
> condition, how it is distributed across HFile blocks. We will get
> performance gain when the we will be able to skip some HFile blocks (from
> non essential CFs). Can test with different HFile block size (lower value)?
>
> -Anoop-
>
>
> On Mon, Apr 8, 2013 at 8:19 PM, Ted Yu <yu...@gmail.com> wrote:
>
>> I made the following change in TestJoinedScanners.java:
>>
>> -      int flag_percent = 1;
>> +      int flag_percent = 40;
>>
>> The test took longer but still favors joined scanner.
>> I got some new results:
>>
>> 2013-04-08 07:46:06,959 INFO  [main] regionserver.TestJoinedScanners(157):
>> Slow scanner finished in 7.424388 seconds, got 2050 rows
>> ...
>> 2013-04-08 07:46:12,010 INFO  [main] regionserver.TestJoinedScanners(157):
>> Joined scanner finished in 5.05063 seconds, got 2050 rows
>>
>> 2013-04-08 07:46:18,358 INFO  [main] regionserver.TestJoinedScanners(157):
>> Slow scanner finished in 6.348517 seconds, got 2050 rows
>> ...
>> 2013-04-08 07:46:22,946 INFO  [main] regionserver.TestJoinedScanners(157):
>> Joined scanner finished in 4.587545 seconds, got 2050 rows
>>
>> Looks like effectiveness of joined scanner is affected by distribution of
>> data.
>>
>> Cheers
>>
>> On Sun, Apr 7, 2013 at 8:52 PM, lars hofhansl <la...@apache.org> wrote:
>>
>> > Looking at the joined scanner test code, it sets it up such that 1% of
>> the
>> > rows match, which would somewhat be in line with James' results.
>> >
>> > In my own testing a while ago I found a 100% improvement with 0% match.
>> >
>> >
>> > -- Lars
>> >
>> >
>> >
>> > ________________________________
>> >  From: Ted Yu <yu...@gmail.com>
>> > To: user@hbase.apache.org
>> > Sent: Sunday, April 7, 2013 4:13 PM
>> > Subject: Re: Essential column family performance
>> >
>> > I have attached 5416-TestJoinedScanners-0.94.txt to HBASE-5416 for your
>> > reference.
>> >
>> > On my MacBook, I got the following results from the test:
>> >
>> > 2013-04-07 16:08:17,474 INFO  [main]
>> regionserver.TestJoinedScanners(157):
>> > Slow scanner finished in 7.973822 seconds, got 100 rows
>> > ...
>> > 2013-04-07 16:08:17,946 INFO  [main]
>> regionserver.TestJoinedScanners(157):
>> > Joined scanner finished in 0.47235 seconds, got 100 rows
>> >
>> > Cheers
>> >
>> > On Sun, Apr 7, 2013 at 4:03 PM, Ted Yu <yu...@gmail.com> wrote:
>> >
>> > > Looking at
>> > >
>> >
>> https://issues.apache.org/jira/secure/attachment/12564340/5416-0.94-v3.txt
>> ,
>> > I found that it didn't contain TestJoinedScanners which shows
>> > > difference in scanner performance:
>> > >
>> > >    LOG.info((slow ? "Slow" : "Joined") + " scanner finished in " +
>> > > Double.toString(timeSec)
>> > >
>> > >       + " seconds, got " + Long.toString(rows_count/2) + " rows");
>> > >
>> > > The test uses SingleColumnValueFilter:
>> > >
>> > >     SingleColumnValueFilter filter = new SingleColumnValueFilter(
>> > >
>> > >         cf_essential, col_name, CompareFilter.CompareOp.EQUAL,
>> flag_yes);
>> > > It is possible that the custom filter you were using would exhibit
>> > > different access pattern compared to SingleColumnValueFilter. e.g. does
>> > > your filter utilize hint ?
>> > > It would be easier for me and other people to reproduce the issue you
>> > > experienced if you put your scenario in some test similar to
>> > > TestJoinedScanners.
>> > >
>> > > Will take a closer look at the code Monday.
>> > >
>> > > Cheers
>> > >
>> > > On Sun, Apr 7, 2013 at 11:37 AM, James Taylor <jtaylor@salesforce.com
>> > >wrote:
>> > >
>> > >> Yes, on 0.94.6. We have our own custom filter derived from FilterBase,
>> > so
>> > >> filterIfMissing isn't the issue - the results of the scan are correct.
>> > >>
>> > >> I can see that if the essential column family has more data compared
>> to
>> > >> the non essential column family that the results would eventually even
>> > out.
>> > >> I was hoping to always be able to enable the essential column family
>> > >> feature. Is there an inherent reason why performance would degrade
>> like
>> > >> this? Does it boil down to a single sequential scan versus many seeks?
>> > >>
>> > >> Thanks,
>> > >>
>> > >> James
>> > >>
>> > >>
>> > >> On 04/07/2013 07:44 AM, Ted Yu wrote:
>> > >>
>> > >>> James:
>> > >>> Your test was based on 0.94.6.1, right ?
>> > >>>
>> > >>> What Filter were you using ?
>> > >>>
>> > >>> If you used SingleColumnValueFilter, have you seen my comment here ?
>> > >>> https://issues.apache.org/**jira/browse/HBASE-5416?**
>> > >>> focusedCommentId=13541229&**page=com.atlassian.jira.**
>> > >>> plugin.system.issuetabpanels:**comment-tabpanel#comment-**13541229<
>> >
>> https://issues.apache.org/jira/browse/HBASE-5416?focusedCommentId=13541229&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13541229
>> > >
>> > >>>
>> > >>> BTW the use case Max Lapan tried to address has non essential column
>> > >>> family
>> > >>> carrying considerably more data compared to essential column family.
>> > >>>
>> > >>> Cheers
>> > >>>
>> > >>>
>> > >>>
>> > >>> On Sat, Apr 6, 2013 at 11:05 PM, James Taylor <
>> jtaylor@salesforce.com
>> > >>> >wrote:
>> > >>>
>> > >>>  Hello,
>> > >>>> We're doing some performance testing of the essential column family
>> > >>>> feature, and we're seeing some performance degradation when
>> comparing
>> > >>>> with
>> > >>>> and without the feature enabled:
>> > >>>>
>> > >>>>                            Performance of scan relative
>> > >>>> % of rows selected        to not enabling the feature
>> > >>>> ---------------------    ------------------------------****--
>> > >>>>
>> > >>>> 100%                            1.0x
>> > >>>>   80%                            2.0x
>> > >>>>   60%                            2.3x
>> > >>>>   40%                            2.2x
>> > >>>>   20%                            1.5x
>> > >>>>   10%                            1.0x
>> > >>>>    5%                            0.67x
>> > >>>>    0%                            0.30%
>> > >>>>
>> > >>>> In our scenario, we have two column families. The key value from the
>> > >>>> essential column family is used in the filter, while the key value
>> > from
>> > >>>> the
>> > >>>> other, non essential column family is returned by the scan. Each row
>> > >>>> contains values for both key values, with the values being
>> relatively
>> > >>>> narrow (less than 50 bytes). In this scenario, the only time we're
>> > >>>> seeing a
>> > >>>> performance gain is when less than 10% of the rows are selected.
>> > >>>>
>> > >>>> Is this a reasonable test? Has anyone else measured this?
>> > >>>>
>> > >>>> Thanks,
>> > >>>>
>> > >>>> James
>> > >>>>
>> > >>>>
>> > >>>>
>> > >>>>
>> > >>>>
>> > >>>>
>> > >>>>
>> > >>
>> > >
>> >
>>

Re: Essential column family performance

Posted by Jean-Marc Spaggiari <je...@spaggiari.org>.
Something I'm not getting, why not using separate tables instead of
CFs for a single table? Simply name your table tablename_cfname then
you get ride of the CF# limitation?

Or is there big pros to have CFs?

JM

2013/4/8 Anoop John <an...@gmail.com>:
> Agree here. The effectiveness depends on what % of data satisfies the
> condition, how it is distributed across HFile blocks. We will get
> performance gain when the we will be able to skip some HFile blocks (from
> non essential CFs). Can test with different HFile block size (lower value)?
>
> -Anoop-
>
>
> On Mon, Apr 8, 2013 at 8:19 PM, Ted Yu <yu...@gmail.com> wrote:
>
>> I made the following change in TestJoinedScanners.java:
>>
>> -      int flag_percent = 1;
>> +      int flag_percent = 40;
>>
>> The test took longer but still favors joined scanner.
>> I got some new results:
>>
>> 2013-04-08 07:46:06,959 INFO  [main] regionserver.TestJoinedScanners(157):
>> Slow scanner finished in 7.424388 seconds, got 2050 rows
>> ...
>> 2013-04-08 07:46:12,010 INFO  [main] regionserver.TestJoinedScanners(157):
>> Joined scanner finished in 5.05063 seconds, got 2050 rows
>>
>> 2013-04-08 07:46:18,358 INFO  [main] regionserver.TestJoinedScanners(157):
>> Slow scanner finished in 6.348517 seconds, got 2050 rows
>> ...
>> 2013-04-08 07:46:22,946 INFO  [main] regionserver.TestJoinedScanners(157):
>> Joined scanner finished in 4.587545 seconds, got 2050 rows
>>
>> Looks like effectiveness of joined scanner is affected by distribution of
>> data.
>>
>> Cheers
>>
>> On Sun, Apr 7, 2013 at 8:52 PM, lars hofhansl <la...@apache.org> wrote:
>>
>> > Looking at the joined scanner test code, it sets it up such that 1% of
>> the
>> > rows match, which would somewhat be in line with James' results.
>> >
>> > In my own testing a while ago I found a 100% improvement with 0% match.
>> >
>> >
>> > -- Lars
>> >
>> >
>> >
>> > ________________________________
>> >  From: Ted Yu <yu...@gmail.com>
>> > To: user@hbase.apache.org
>> > Sent: Sunday, April 7, 2013 4:13 PM
>> > Subject: Re: Essential column family performance
>> >
>> > I have attached 5416-TestJoinedScanners-0.94.txt to HBASE-5416 for your
>> > reference.
>> >
>> > On my MacBook, I got the following results from the test:
>> >
>> > 2013-04-07 16:08:17,474 INFO  [main]
>> regionserver.TestJoinedScanners(157):
>> > Slow scanner finished in 7.973822 seconds, got 100 rows
>> > ...
>> > 2013-04-07 16:08:17,946 INFO  [main]
>> regionserver.TestJoinedScanners(157):
>> > Joined scanner finished in 0.47235 seconds, got 100 rows
>> >
>> > Cheers
>> >
>> > On Sun, Apr 7, 2013 at 4:03 PM, Ted Yu <yu...@gmail.com> wrote:
>> >
>> > > Looking at
>> > >
>> >
>> https://issues.apache.org/jira/secure/attachment/12564340/5416-0.94-v3.txt
>> ,
>> > I found that it didn't contain TestJoinedScanners which shows
>> > > difference in scanner performance:
>> > >
>> > >    LOG.info((slow ? "Slow" : "Joined") + " scanner finished in " +
>> > > Double.toString(timeSec)
>> > >
>> > >       + " seconds, got " + Long.toString(rows_count/2) + " rows");
>> > >
>> > > The test uses SingleColumnValueFilter:
>> > >
>> > >     SingleColumnValueFilter filter = new SingleColumnValueFilter(
>> > >
>> > >         cf_essential, col_name, CompareFilter.CompareOp.EQUAL,
>> flag_yes);
>> > > It is possible that the custom filter you were using would exhibit
>> > > different access pattern compared to SingleColumnValueFilter. e.g. does
>> > > your filter utilize hint ?
>> > > It would be easier for me and other people to reproduce the issue you
>> > > experienced if you put your scenario in some test similar to
>> > > TestJoinedScanners.
>> > >
>> > > Will take a closer look at the code Monday.
>> > >
>> > > Cheers
>> > >
>> > > On Sun, Apr 7, 2013 at 11:37 AM, James Taylor <jtaylor@salesforce.com
>> > >wrote:
>> > >
>> > >> Yes, on 0.94.6. We have our own custom filter derived from FilterBase,
>> > so
>> > >> filterIfMissing isn't the issue - the results of the scan are correct.
>> > >>
>> > >> I can see that if the essential column family has more data compared
>> to
>> > >> the non essential column family that the results would eventually even
>> > out.
>> > >> I was hoping to always be able to enable the essential column family
>> > >> feature. Is there an inherent reason why performance would degrade
>> like
>> > >> this? Does it boil down to a single sequential scan versus many seeks?
>> > >>
>> > >> Thanks,
>> > >>
>> > >> James
>> > >>
>> > >>
>> > >> On 04/07/2013 07:44 AM, Ted Yu wrote:
>> > >>
>> > >>> James:
>> > >>> Your test was based on 0.94.6.1, right ?
>> > >>>
>> > >>> What Filter were you using ?
>> > >>>
>> > >>> If you used SingleColumnValueFilter, have you seen my comment here ?
>> > >>> https://issues.apache.org/**jira/browse/HBASE-5416?**
>> > >>> focusedCommentId=13541229&**page=com.atlassian.jira.**
>> > >>> plugin.system.issuetabpanels:**comment-tabpanel#comment-**13541229<
>> >
>> https://issues.apache.org/jira/browse/HBASE-5416?focusedCommentId=13541229&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13541229
>> > >
>> > >>>
>> > >>> BTW the use case Max Lapan tried to address has non essential column
>> > >>> family
>> > >>> carrying considerably more data compared to essential column family.
>> > >>>
>> > >>> Cheers
>> > >>>
>> > >>>
>> > >>>
>> > >>> On Sat, Apr 6, 2013 at 11:05 PM, James Taylor <
>> jtaylor@salesforce.com
>> > >>> >wrote:
>> > >>>
>> > >>>  Hello,
>> > >>>> We're doing some performance testing of the essential column family
>> > >>>> feature, and we're seeing some performance degradation when
>> comparing
>> > >>>> with
>> > >>>> and without the feature enabled:
>> > >>>>
>> > >>>>                            Performance of scan relative
>> > >>>> % of rows selected        to not enabling the feature
>> > >>>> ---------------------    ------------------------------****--
>> > >>>>
>> > >>>> 100%                            1.0x
>> > >>>>   80%                            2.0x
>> > >>>>   60%                            2.3x
>> > >>>>   40%                            2.2x
>> > >>>>   20%                            1.5x
>> > >>>>   10%                            1.0x
>> > >>>>    5%                            0.67x
>> > >>>>    0%                            0.30%
>> > >>>>
>> > >>>> In our scenario, we have two column families. The key value from the
>> > >>>> essential column family is used in the filter, while the key value
>> > from
>> > >>>> the
>> > >>>> other, non essential column family is returned by the scan. Each row
>> > >>>> contains values for both key values, with the values being
>> relatively
>> > >>>> narrow (less than 50 bytes). In this scenario, the only time we're
>> > >>>> seeing a
>> > >>>> performance gain is when less than 10% of the rows are selected.
>> > >>>>
>> > >>>> Is this a reasonable test? Has anyone else measured this?
>> > >>>>
>> > >>>> Thanks,
>> > >>>>
>> > >>>> James
>> > >>>>
>> > >>>>
>> > >>>>
>> > >>>>
>> > >>>>
>> > >>>>
>> > >>>>
>> > >>
>> > >
>> >
>>

Re: Essential column family performance

Posted by Ted Yu <yu...@gmail.com>.
bq. additional cost of seeks/merging the results from two CFs outweights
the benefit of lazy loading on such small values

This was my thinking as well.

HRegion#nextInternal() operation is local to the underlying region. This
makes it difficult for this method to adjust scanning behavior in-flight.
Currently the only hint client can provide when calling
RegionScanner#nextRaw() is limit:

  public boolean nextRaw(List<KeyValue> result, int limit, String metric)
throws IOException;
I think we need some mechanism where RegionScanner can tell the caller what
percentage of data gets filtered. In case essential column family feature
is used, what is the relative sizes of the essential and non-essential
column families. etc

On Mon, Apr 8, 2013 at 1:34 PM, Sergey Shelukhin <se...@hortonworks.com>wrote:

> IntegrationTestLazyCfLoading uses randomly distributed keys with the
> following condition for filtering:
> 1 == (Long.parseLong(Bytes.toString(rowKey, 0, 4), 16) & 1); where rowKey
> is hex string of MD5 key.
> Then, there are 2 "lazy" CFs, each of which has a value of 4-64k.
> This test also showed significant improvement IIRC, so random distribution
> and high %%ge of values selected should not be a problem as such.
>
> My hunch would be that the additional cost of seeks/merging the results
> from two CFs outweights the benefit of lazy loading on such small values
> for the "lazy" CF with lots of data selected. This feature definitely makes
> no sense if you are selecting all values, because then extra work is being
> done for no benefit (everything is read anyway).
> So the use cases would be larger "lazy" CFs or/and low percentage of values
> selected.
>
> Can you try to increase the 2nd CF values' size and rerun the test?
>
>
> On Mon, Apr 8, 2013 at 10:38 AM, James Taylor <jtaylor@salesforce.com
> >wrote:
>
> > In the TestJoinedScanners.java, is the 40% randomly distributed or
> > sequential?
> >
> > In our test, the % is randomly distributed. Also, our custom filter does
> > the same thing that SingleColumnValueFilter does.  On the client-side,
> we'd
> > execute the query in parallel, through multiple scans along the region
> > boundaries. Would that have a negative impact on performance for this
> > "essential column family" feature?
> >
> > Thanks,
> >
> >     James
> >
> >
> > On 04/08/2013 10:10 AM, Anoop John wrote:
> >
> >> Agree here. The effectiveness depends on what % of data satisfies the
> >> condition, how it is distributed across HFile blocks. We will get
> >> performance gain when the we will be able to skip some HFile blocks
> (from
> >> non essential CFs). Can test with different HFile block size (lower
> >> value)?
> >>
> >> -Anoop-
> >>
> >>
> >> On Mon, Apr 8, 2013 at 8:19 PM, Ted Yu <yu...@gmail.com> wrote:
> >>
> >>  I made the following change in TestJoinedScanners.java:
> >>>
> >>> -      int flag_percent = 1;
> >>> +      int flag_percent = 40;
> >>>
> >>> The test took longer but still favors joined scanner.
> >>> I got some new results:
> >>>
> >>> 2013-04-08 07:46:06,959 INFO  [main] regionserver.**
> >>> TestJoinedScanners(157):
> >>> Slow scanner finished in 7.424388 seconds, got 2050 rows
> >>> ...
> >>> 2013-04-08 07:46:12,010 INFO  [main] regionserver.**
> >>> TestJoinedScanners(157):
> >>> Joined scanner finished in 5.05063 seconds, got 2050 rows
> >>>
> >>> 2013-04-08 07:46:18,358 INFO  [main] regionserver.**
> >>> TestJoinedScanners(157):
> >>> Slow scanner finished in 6.348517 seconds, got 2050 rows
> >>> ...
> >>> 2013-04-08 07:46:22,946 INFO  [main] regionserver.**
> >>> TestJoinedScanners(157):
> >>> Joined scanner finished in 4.587545 seconds, got 2050 rows
> >>>
> >>> Looks like effectiveness of joined scanner is affected by distribution
> of
> >>> data.
> >>>
> >>> Cheers
> >>>
> >>> On Sun, Apr 7, 2013 at 8:52 PM, lars hofhansl <la...@apache.org>
> wrote:
> >>>
> >>>  Looking at the joined scanner test code, it sets it up such that 1% of
> >>>>
> >>> the
> >>>
> >>>> rows match, which would somewhat be in line with James' results.
> >>>>
> >>>> In my own testing a while ago I found a 100% improvement with 0%
> match.
> >>>>
> >>>>
> >>>> -- Lars
> >>>>
> >>>>
> >>>>
> >>>> ______________________________**__
> >>>>   From: Ted Yu <yu...@gmail.com>
> >>>> To: user@hbase.apache.org
> >>>> Sent: Sunday, April 7, 2013 4:13 PM
> >>>> Subject: Re: Essential column family performance
> >>>>
> >>>> I have attached 5416-TestJoinedScanners-0.94.**txt to HBASE-5416 for
> >>>> your
> >>>> reference.
> >>>>
> >>>> On my MacBook, I got the following results from the test:
> >>>>
> >>>> 2013-04-07 16:08:17,474 INFO  [main]
> >>>>
> >>> regionserver.**TestJoinedScanners(157):
> >>>
> >>>> Slow scanner finished in 7.973822 seconds, got 100 rows
> >>>> ...
> >>>> 2013-04-07 16:08:17,946 INFO  [main]
> >>>>
> >>> regionserver.**TestJoinedScanners(157):
> >>>
> >>>> Joined scanner finished in 0.47235 seconds, got 100 rows
> >>>>
> >>>> Cheers
> >>>>
> >>>> On Sun, Apr 7, 2013 at 4:03 PM, Ted Yu <yu...@gmail.com> wrote:
> >>>>
> >>>>  Looking at
> >>>>>
> >>>>>  https://issues.apache.org/**jira/secure/attachment/**
> >>> 12564340/5416-0.94-v3.txt<
> https://issues.apache.org/jira/secure/attachment/12564340/5416-0.94-v3.txt
> >
> >>> ,
> >>>
> >>>> I found that it didn't contain TestJoinedScanners which shows
> >>>>
> >>>>> difference in scanner performance:
> >>>>>
> >>>>>     LOG.info((slow ? "Slow" : "Joined") + " scanner finished in " +
> >>>>> Double.toString(timeSec)
> >>>>>
> >>>>>        + " seconds, got " + Long.toString(rows_count/2) + " rows");
> >>>>>
> >>>>> The test uses SingleColumnValueFilter:
> >>>>>
> >>>>>      SingleColumnValueFilter filter = new SingleColumnValueFilter(
> >>>>>
> >>>>>          cf_essential, col_name, CompareFilter.CompareOp.EQUAL,
> >>>>>
> >>>> flag_yes);
> >>>
> >>>> It is possible that the custom filter you were using would exhibit
> >>>>> different access pattern compared to SingleColumnValueFilter. e.g.
> does
> >>>>> your filter utilize hint ?
> >>>>> It would be easier for me and other people to reproduce the issue you
> >>>>> experienced if you put your scenario in some test similar to
> >>>>> TestJoinedScanners.
> >>>>>
> >>>>> Will take a closer look at the code Monday.
> >>>>>
> >>>>> Cheers
> >>>>>
> >>>>> On Sun, Apr 7, 2013 at 11:37 AM, James Taylor <
> jtaylor@salesforce.com
> >>>>> wrote:
> >>>>>
> >>>>>  Yes, on 0.94.6. We have our own custom filter derived from
> FilterBase,
> >>>>>>
> >>>>> so
> >>>>
> >>>>> filterIfMissing isn't the issue - the results of the scan are
> correct.
> >>>>>>
> >>>>>> I can see that if the essential column family has more data compared
> >>>>>>
> >>>>> to
> >>>
> >>>> the non essential column family that the results would eventually even
> >>>>>>
> >>>>> out.
> >>>>
> >>>>> I was hoping to always be able to enable the essential column family
> >>>>>> feature. Is there an inherent reason why performance would degrade
> >>>>>>
> >>>>> like
> >>>
> >>>> this? Does it boil down to a single sequential scan versus many seeks?
> >>>>>>
> >>>>>> Thanks,
> >>>>>>
> >>>>>> James
> >>>>>>
> >>>>>>
> >>>>>> On 04/07/2013 07:44 AM, Ted Yu wrote:
> >>>>>>
> >>>>>>  James:
> >>>>>>> Your test was based on 0.94.6.1, right ?
> >>>>>>>
> >>>>>>> What Filter were you using ?
> >>>>>>>
> >>>>>>> If you used SingleColumnValueFilter, have you seen my comment here
> ?
> >>>>>>> https://issues.apache.org/****jira/browse/HBASE-5416?**<
> https://issues.apache.org/**jira/browse/HBASE-5416?**>
> >>>>>>> focusedCommentId=13541229&****page=com.atlassian.jira.**
> >>>>>>> plugin.system.issuetabpanels:****comment-tabpanel#comment-****
> >>>>>>> 13541229<
> >>>>>>>
> >>>>>> https://issues.apache.org/**jira/browse/HBASE-5416?**
> >>> focusedCommentId=13541229&**page=com.atlassian.jira.**
> >>> plugin.system.issuetabpanels:**comment-tabpanel#comment-**13541229<
> https://issues.apache.org/jira/browse/HBASE-5416?focusedCommentId=13541229&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13541229
> >
> >>>
> >>>>  BTW the use case Max Lapan tried to address has non essential column
> >>>>>>> family
> >>>>>>> carrying considerably more data compared to essential column
> family.
> >>>>>>>
> >>>>>>> Cheers
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On Sat, Apr 6, 2013 at 11:05 PM, James Taylor <
> >>>>>>>
> >>>>>> jtaylor@salesforce.com
> >>>
> >>>>  wrote:
> >>>>>>>>
> >>>>>>>   Hello,
> >>>>>>>
> >>>>>>>> We're doing some performance testing of the essential column
> family
> >>>>>>>> feature, and we're seeing some performance degradation when
> >>>>>>>>
> >>>>>>> comparing
> >>>
> >>>>  with
> >>>>>>>> and without the feature enabled:
> >>>>>>>>
> >>>>>>>>                             Performance of scan relative
> >>>>>>>> % of rows selected        to not enabling the feature
> >>>>>>>> ---------------------    ------------------------------******--
> >>>>>>>>
> >>>>>>>> 100%                            1.0x
> >>>>>>>>    80%                            2.0x
> >>>>>>>>    60%                            2.3x
> >>>>>>>>    40%                            2.2x
> >>>>>>>>    20%                            1.5x
> >>>>>>>>    10%                            1.0x
> >>>>>>>>     5%                            0.67x
> >>>>>>>>     0%                            0.30%
> >>>>>>>>
> >>>>>>>> In our scenario, we have two column families. The key value from
> the
> >>>>>>>> essential column family is used in the filter, while the key value
> >>>>>>>>
> >>>>>>> from
> >>>>
> >>>>>  the
> >>>>>>>> other, non essential column family is returned by the scan. Each
> row
> >>>>>>>> contains values for both key values, with the values being
> >>>>>>>>
> >>>>>>> relatively
> >>>
> >>>>  narrow (less than 50 bytes). In this scenario, the only time we're
> >>>>>>>> seeing a
> >>>>>>>> performance gain is when less than 10% of the rows are selected.
> >>>>>>>>
> >>>>>>>> Is this a reasonable test? Has anyone else measured this?
> >>>>>>>>
> >>>>>>>> Thanks,
> >>>>>>>>
> >>>>>>>> James
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >
>

Re: Essential column family performance

Posted by lars hofhansl <la...@apache.org>.
One of James' motivation was to always be able to enable scanners to make use of essential column families (and thus avoid HBase API version - essential column families was added only in 0.94.5+).
Sounds like general answer to this is: "No you shouldn't. It should still be a per query option, or at least a per table option"


-- Lars



________________________________
 From: Sergey Shelukhin <se...@hortonworks.com>
To: user@hbase.apache.org 
Sent: Monday, April 8, 2013 1:34 PM
Subject: Re: Essential column family performance
 
IntegrationTestLazyCfLoading uses randomly distributed keys with the
following condition for filtering:
1 == (Long.parseLong(Bytes.toString(rowKey, 0, 4), 16) & 1); where rowKey
is hex string of MD5 key.
Then, there are 2 "lazy" CFs, each of which has a value of 4-64k.
This test also showed significant improvement IIRC, so random distribution
and high %%ge of values selected should not be a problem as such.

My hunch would be that the additional cost of seeks/merging the results
from two CFs outweights the benefit of lazy loading on such small values
for the "lazy" CF with lots of data selected. This feature definitely makes
no sense if you are selecting all values, because then extra work is being
done for no benefit (everything is read anyway).
So the use cases would be larger "lazy" CFs or/and low percentage of values
selected.

Can you try to increase the 2nd CF values' size and rerun the test?


On Mon, Apr 8, 2013 at 10:38 AM, James Taylor <jt...@salesforce.com>wrote:

> In the TestJoinedScanners.java, is the 40% randomly distributed or
> sequential?
>
> In our test, the % is randomly distributed. Also, our custom filter does
> the same thing that SingleColumnValueFilter does.  On the client-side, we'd
> execute the query in parallel, through multiple scans along the region
> boundaries. Would that have a negative impact on performance for this
> "essential column family" feature?
>
> Thanks,
>
>     James
>
>
> On 04/08/2013 10:10 AM, Anoop John wrote:
>
>> Agree here. The effectiveness depends on what % of data satisfies the
>> condition, how it is distributed across HFile blocks. We will get
>> performance gain when the we will be able to skip some HFile blocks (from
>> non essential CFs). Can test with different HFile block size (lower
>> value)?
>>
>> -Anoop-
>>
>>
>> On Mon, Apr 8, 2013 at 8:19 PM, Ted Yu <yu...@gmail.com> wrote:
>>
>>  I made the following change in TestJoinedScanners.java:
>>>
>>> -      int flag_percent = 1;
>>> +      int flag_percent = 40;
>>>
>>> The test took longer but still favors joined scanner.
>>> I got some new results:
>>>
>>> 2013-04-08 07:46:06,959 INFO  [main] regionserver.**
>>> TestJoinedScanners(157):
>>> Slow scanner finished in 7.424388 seconds, got 2050 rows
>>> ...
>>> 2013-04-08 07:46:12,010 INFO  [main] regionserver.**
>>> TestJoinedScanners(157):
>>> Joined scanner finished in 5.05063 seconds, got 2050 rows
>>>
>>> 2013-04-08 07:46:18,358 INFO  [main] regionserver.**
>>> TestJoinedScanners(157):
>>> Slow scanner finished in 6.348517 seconds, got 2050 rows
>>> ...
>>> 2013-04-08 07:46:22,946 INFO  [main] regionserver.**
>>> TestJoinedScanners(157):
>>> Joined scanner finished in 4.587545 seconds, got 2050 rows
>>>
>>> Looks like effectiveness of joined scanner is affected by distribution of
>>> data.
>>>
>>> Cheers
>>>
>>> On Sun, Apr 7, 2013 at 8:52 PM, lars hofhansl <la...@apache.org> wrote:
>>>
>>>  Looking at the joined scanner test code, it sets it up such that 1% of
>>>>
>>> the
>>>
>>>> rows match, which would somewhat be in line with James' results.
>>>>
>>>> In my own testing a while ago I found a 100% improvement with 0% match.
>>>>
>>>>
>>>> -- Lars
>>>>
>>>>
>>>>
>>>> ______________________________**__
>>>>   From: Ted Yu <yu...@gmail.com>
>>>> To: user@hbase.apache.org
>>>> Sent: Sunday, April 7, 2013 4:13 PM
>>>> Subject: Re: Essential column family performance
>>>>
>>>> I have attached 5416-TestJoinedScanners-0.94.**txt to HBASE-5416 for
>>>> your
>>>> reference.
>>>>
>>>> On my MacBook, I got the following results from the test:
>>>>
>>>> 2013-04-07 16:08:17,474 INFO  [main]
>>>>
>>> regionserver.**TestJoinedScanners(157):
>>>
>>>> Slow scanner finished in 7.973822 seconds, got 100 rows
>>>> ...
>>>> 2013-04-07 16:08:17,946 INFO  [main]
>>>>
>>> regionserver.**TestJoinedScanners(157):
>>>
>>>> Joined scanner finished in 0.47235 seconds, got 100 rows
>>>>
>>>> Cheers
>>>>
>>>> On Sun, Apr 7, 2013 at 4:03 PM, Ted Yu <yu...@gmail.com> wrote:
>>>>
>>>>  Looking at
>>>>>
>>>>>  https://issues.apache.org/**jira/secure/attachment/**
>>> 12564340/5416-0.94-v3.txt<https://issues.apache.org/jira/secure/attachment/12564340/5416-0.94-v3.txt>
>>> ,
>>>
>>>> I found that it didn't contain TestJoinedScanners which shows
>>>>
>>>>> difference in scanner performance:
>>>>>
>>>>>    LOG.info((slow ? "Slow" : "Joined") + " scanner finished in " +
>>>>> Double.toString(timeSec)
>>>>>
>>>>>        + " seconds, got " + Long.toString(rows_count/2) + " rows");
>>>>>
>>>>> The test uses SingleColumnValueFilter:
>>>>>
>>>>>      SingleColumnValueFilter filter = new SingleColumnValueFilter(
>>>>>
>>>>>          cf_essential, col_name, CompareFilter.CompareOp.EQUAL,
>>>>>
>>>> flag_yes);
>>>
>>>> It is possible that the custom filter you were using would exhibit
>>>>> different access pattern compared to SingleColumnValueFilter. e.g. does
>>>>> your filter utilize hint ?
>>>>> It would be easier for me and other people to reproduce the issue you
>>>>> experienced if you put your scenario in some test similar to
>>>>> TestJoinedScanners.
>>>>>
>>>>> Will take a closer look at the code Monday.
>>>>>
>>>>> Cheers
>>>>>
>>>>> On Sun, Apr 7, 2013 at 11:37 AM, James Taylor <jtaylor@salesforce.com
>>>>> wrote:
>>>>>
>>>>>  Yes, on 0.94.6. We have our own custom filter derived from FilterBase,
>>>>>>
>>>>> so
>>>>
>>>>> filterIfMissing isn't the issue - the results of the scan are correct.
>>>>>>
>>>>>> I can see that if the essential column family has more data compared
>>>>>>
>>>>> to
>>>
>>>> the non essential column family that the results would eventually even
>>>>>>
>>>>> out.
>>>>
>>>>> I was hoping to always be able to enable the essential column family
>>>>>> feature. Is there an inherent reason why performance would degrade
>>>>>>
>>>>> like
>>>
>>>> this? Does it boil down to a single sequential scan versus many seeks?
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> James
>>>>>>
>>>>>>
>>>>>> On 04/07/2013 07:44 AM, Ted Yu wrote:
>>>>>>
>>>>>>  James:
>>>>>>> Your test was based on 0.94.6.1, right ?
>>>>>>>
>>>>>>> What Filter were you using ?
>>>>>>>
>>>>>>> If you used SingleColumnValueFilter, have you seen my comment here ?
>>>>>>> https://issues.apache.org/****jira/browse/HBASE-5416?**<https://issues.apache.org/**jira/browse/HBASE-5416?**>
>>>>>>> focusedCommentId=13541229&****page=com.atlassian.jira.**
>>>>>>> plugin.system.issuetabpanels:****comment-tabpanel#comment-****
>>>>>>> 13541229<
>>>>>>>
>>>>>> https://issues.apache.org/**jira/browse/HBASE-5416?**
>>> focusedCommentId=13541229&**page=com.atlassian.jira.**
>>> plugin.system.issuetabpanels:**comment-tabpanel#comment-**13541229<https://issues.apache.org/jira/browse/HBASE-5416?focusedCommentId=13541229&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13541229>
>>>
>>>>  BTW the use case Max Lapan tried to address has non essential column
>>>>>>> family
>>>>>>> carrying considerably more data compared to essential column family.
>>>>>>>
>>>>>>> Cheers
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Sat, Apr 6, 2013 at 11:05 PM, James Taylor <
>>>>>>>
>>>>>> jtaylor@salesforce.com
>>>
>>>>  wrote:
>>>>>>>>
>>>>>>>   Hello,
>>>>>>>
>>>>>>>> We're doing some performance testing of the essential column family
>>>>>>>> feature, and we're seeing some performance degradation when
>>>>>>>>
>>>>>>> comparing
>>>
>>>>  with
>>>>>>>> and without the feature enabled:
>>>>>>>>
>>>>>>>>                             Performance of scan relative
>>>>>>>> % of rows selected        to not enabling the feature
>>>>>>>> ---------------------    ------------------------------******--
>>>>>>>>
>>>>>>>> 100%                            1.0x
>>>>>>>>    80%                            2.0x
>>>>>>>>    60%                            2.3x
>>>>>>>>    40%                            2.2x
>>>>>>>>    20%                            1.5x
>>>>>>>>    10%                            1.0x
>>>>>>>>     5%                            0.67x
>>>>>>>>     0%                            0.30%
>>>>>>>>
>>>>>>>> In our scenario, we have two column families. The key value from the
>>>>>>>> essential column family is used in the filter, while the key value
>>>>>>>>
>>>>>>> from
>>>>
>>>>>  the
>>>>>>>> other, non essential column family is returned by the scan. Each row
>>>>>>>> contains values for both key values, with the values being
>>>>>>>>
>>>>>>> relatively
>>>
>>>>  narrow (less than 50 bytes). In this scenario, the only time we're
>>>>>>>> seeing a
>>>>>>>> performance gain is when less than 10% of the rows are selected.
>>>>>>>>
>>>>>>>> Is this a reasonable test? Has anyone else measured this?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> James
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>

Re: Essential column family performance

Posted by Ted Yu <yu...@gmail.com>.
Using 30% selection rate, random distribution and FAST_DIFF encoding on
both column families, I got:

2013-04-08 19:46:21,802 INFO  [main] regionserver.TestJoinedScanners(166):
Slow scanner finished in 5.251182 seconds, got 1547 rows
...
2013-04-08 19:46:26,661 INFO  [main] regionserver.TestJoinedScanners(166):
Joined scanner finished in 4.858834 seconds, got 1547 rows

2013-04-08 19:46:31,891 INFO  [main] regionserver.TestJoinedScanners(166):
Slow scanner finished in 5.22988 seconds, got 1547 rows
...
2013-04-08 19:46:36,566 INFO  [main] regionserver.TestJoinedScanners(166):
Joined scanner finished in 4.674822 seconds, got 1547 rows

Cheers

On Mon, Apr 8, 2013 at 6:53 PM, James Taylor <jt...@salesforce.com> wrote:

> Good idea, Sergey. We'll rerun with larger non essential column family
> values and see if there's a crossover point. One other difference for us is
> that we're using FAST_DIFF encoding. We'll try with no encoding too. Our
> table has 20 million rows across four regions servers.
>
> Regarding the parallelization we do, we run multiple scans in parallel
> instead of one single scan over the table. We use the region boundaries of
> the table to divide up the work evenly, adding a start/stop key for each
> scan that corresponds to the region boundaries. Our client then does a
> final merge/aggregation step (i.e. adding up the count it gets back from
> the scan for each region).
>
>
> On 04/08/2013 01:34 PM, Sergey Shelukhin wrote:
>
>> IntegrationTestLazyCfLoading uses randomly distributed keys with the
>> following condition for filtering:
>> 1 == (Long.parseLong(Bytes.**toString(rowKey, 0, 4), 16) & 1); where
>> rowKey
>> is hex string of MD5 key.
>> Then, there are 2 "lazy" CFs, each of which has a value of 4-64k.
>> This test also showed significant improvement IIRC, so random distribution
>> and high %%ge of values selected should not be a problem as such.
>>
>> My hunch would be that the additional cost of seeks/merging the results
>> from two CFs outweights the benefit of lazy loading on such small values
>> for the "lazy" CF with lots of data selected. This feature definitely
>> makes
>> no sense if you are selecting all values, because then extra work is being
>> done for no benefit (everything is read anyway).
>> So the use cases would be larger "lazy" CFs or/and low percentage of
>> values
>> selected.
>>
>> Can you try to increase the 2nd CF values' size and rerun the test?
>>
>>
>> On Mon, Apr 8, 2013 at 10:38 AM, James Taylor <jtaylor@salesforce.com
>> >wrote:
>>
>>  In the TestJoinedScanners.java, is the 40% randomly distributed or
>>> sequential?
>>>
>>> In our test, the % is randomly distributed. Also, our custom filter does
>>> the same thing that SingleColumnValueFilter does.  On the client-side,
>>> we'd
>>> execute the query in parallel, through multiple scans along the region
>>> boundaries. Would that have a negative impact on performance for this
>>> "essential column family" feature?
>>>
>>> Thanks,
>>>
>>>      James
>>>
>>>
>>> On 04/08/2013 10:10 AM, Anoop John wrote:
>>>
>>>  Agree here. The effectiveness depends on what % of data satisfies the
>>>> condition, how it is distributed across HFile blocks. We will get
>>>> performance gain when the we will be able to skip some HFile blocks
>>>> (from
>>>> non essential CFs). Can test with different HFile block size (lower
>>>> value)?
>>>>
>>>> -Anoop-
>>>>
>>>>
>>>> On Mon, Apr 8, 2013 at 8:19 PM, Ted Yu <yu...@gmail.com> wrote:
>>>>
>>>>   I made the following change in TestJoinedScanners.java:
>>>>
>>>>> -      int flag_percent = 1;
>>>>> +      int flag_percent = 40;
>>>>>
>>>>> The test took longer but still favors joined scanner.
>>>>> I got some new results:
>>>>>
>>>>> 2013-04-08 07:46:06,959 INFO  [main] regionserver.**
>>>>> TestJoinedScanners(157):
>>>>> Slow scanner finished in 7.424388 seconds, got 2050 rows
>>>>> ...
>>>>> 2013-04-08 07:46:12,010 INFO  [main] regionserver.**
>>>>> TestJoinedScanners(157):
>>>>> Joined scanner finished in 5.05063 seconds, got 2050 rows
>>>>>
>>>>> 2013-04-08 07:46:18,358 INFO  [main] regionserver.**
>>>>> TestJoinedScanners(157):
>>>>> Slow scanner finished in 6.348517 seconds, got 2050 rows
>>>>> ...
>>>>> 2013-04-08 07:46:22,946 INFO  [main] regionserver.**
>>>>> TestJoinedScanners(157):
>>>>> Joined scanner finished in 4.587545 seconds, got 2050 rows
>>>>>
>>>>> Looks like effectiveness of joined scanner is affected by distribution
>>>>> of
>>>>> data.
>>>>>
>>>>> Cheers
>>>>>
>>>>> On Sun, Apr 7, 2013 at 8:52 PM, lars hofhansl <la...@apache.org>
>>>>> wrote:
>>>>>
>>>>>   Looking at the joined scanner test code, it sets it up such that 1%
>>>>> of
>>>>> the
>>>>>
>>>>>  rows match, which would somewhat be in line with James' results.
>>>>>>
>>>>>> In my own testing a while ago I found a 100% improvement with 0%
>>>>>> match.
>>>>>>
>>>>>>
>>>>>> -- Lars
>>>>>>
>>>>>>
>>>>>>
>>>>>> ______________________________****__
>>>>>>    From: Ted Yu <yu...@gmail.com>
>>>>>> To: user@hbase.apache.org
>>>>>> Sent: Sunday, April 7, 2013 4:13 PM
>>>>>> Subject: Re: Essential column family performance
>>>>>>
>>>>>> I have attached 5416-TestJoinedScanners-0.94.****txt to HBASE-5416
>>>>>> for
>>>>>> your
>>>>>> reference.
>>>>>>
>>>>>> On my MacBook, I got the following results from the test:
>>>>>>
>>>>>> 2013-04-07 16:08:17,474 INFO  [main]
>>>>>>
>>>>>>  regionserver.****TestJoinedScanners(157):
>>>>>
>>>>>  Slow scanner finished in 7.973822 seconds, got 100 rows
>>>>>> ...
>>>>>> 2013-04-07 16:08:17,946 INFO  [main]
>>>>>>
>>>>>>  regionserver.****TestJoinedScanners(157):
>>>>>
>>>>>  Joined scanner finished in 0.47235 seconds, got 100 rows
>>>>>>
>>>>>> Cheers
>>>>>>
>>>>>> On Sun, Apr 7, 2013 at 4:03 PM, Ted Yu <yu...@gmail.com> wrote:
>>>>>>
>>>>>>   Looking at
>>>>>>
>>>>>>>   https://issues.apache.org/****jira/secure/attachment/**<https://issues.apache.org/**jira/secure/attachment/**>
>>>>>>>
>>>>>> 12564340/5416-0.94-v3.txt<http**s://issues.apache.org/jira/**
>>>>> secure/attachment/12564340/**5416-0.94-v3.txt<https://issues.apache.org/jira/secure/attachment/12564340/5416-0.94-v3.txt>
>>>>> >
>>>>> ,
>>>>>
>>>>>  I found that it didn't contain TestJoinedScanners which shows
>>>>>>
>>>>>>  difference in scanner performance:
>>>>>>>
>>>>>>>      LOG.info((slow ? "Slow" : "Joined") + " scanner finished in " +
>>>>>>> Double.toString(timeSec)
>>>>>>>
>>>>>>>         + " seconds, got " + Long.toString(rows_count/2) + " rows");
>>>>>>>
>>>>>>> The test uses SingleColumnValueFilter:
>>>>>>>
>>>>>>>       SingleColumnValueFilter filter = new SingleColumnValueFilter(
>>>>>>>
>>>>>>>           cf_essential, col_name, CompareFilter.CompareOp.EQUAL,
>>>>>>>
>>>>>>>  flag_yes);
>>>>>> It is possible that the custom filter you were using would exhibit
>>>>>>
>>>>>>> different access pattern compared to SingleColumnValueFilter. e.g.
>>>>>>> does
>>>>>>> your filter utilize hint ?
>>>>>>> It would be easier for me and other people to reproduce the issue you
>>>>>>> experienced if you put your scenario in some test similar to
>>>>>>> TestJoinedScanners.
>>>>>>>
>>>>>>> Will take a closer look at the code Monday.
>>>>>>>
>>>>>>> Cheers
>>>>>>>
>>>>>>> On Sun, Apr 7, 2013 at 11:37 AM, James Taylor <
>>>>>>> jtaylor@salesforce.com
>>>>>>> wrote:
>>>>>>>
>>>>>>>   Yes, on 0.94.6. We have our own custom filter derived from
>>>>>>> FilterBase,
>>>>>>> so
>>>>>>> filterIfMissing isn't the issue - the results of the scan are
>>>>>>> correct.
>>>>>>>
>>>>>>>> I can see that if the essential column family has more data compared
>>>>>>>>
>>>>>>>>  to
>>>>>>>
>>>>>> the non essential column family that the results would eventually even
>>>>>>
>>>>>>> out.
>>>>>>> I was hoping to always be able to enable the essential column family
>>>>>>>
>>>>>>>> feature. Is there an inherent reason why performance would degrade
>>>>>>>>
>>>>>>>>  like
>>>>>>>
>>>>>> this? Does it boil down to a single sequential scan versus many seeks?
>>>>>>
>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> James
>>>>>>>>
>>>>>>>>
>>>>>>>> On 04/07/2013 07:44 AM, Ted Yu wrote:
>>>>>>>>
>>>>>>>>   James:
>>>>>>>>
>>>>>>>>> Your test was based on 0.94.6.1, right ?
>>>>>>>>>
>>>>>>>>> What Filter were you using ?
>>>>>>>>>
>>>>>>>>> If you used SingleColumnValueFilter, have you seen my comment here
>>>>>>>>> ?
>>>>>>>>> https://issues.apache.org/******jira/browse/HBASE-5416?**<https://issues.apache.org/****jira/browse/HBASE-5416?**>
>>>>>>>>> <http**s://issues.apache.org/**jira/**browse/HBASE-5416?**<https://issues.apache.org/**jira/browse/HBASE-5416?**>
>>>>>>>>> >
>>>>>>>>> focusedCommentId=13541229&******page=com.atlassian.jira.**
>>>>>>>>> plugin.system.issuetabpanels:******comment-tabpanel#comment-******
>>>>>>>>> 13541229<
>>>>>>>>>
>>>>>>>>>  https://issues.apache.org/****jira/browse/HBASE-5416?**<https://issues.apache.org/**jira/browse/HBASE-5416?**>
>>>>>>>>
>>>>>>> focusedCommentId=13541229&****page=com.atlassian.jira.**
>>>>> plugin.system.issuetabpanels:****comment-tabpanel#comment-****
>>>>> 13541229<https://issues.**apache.org/jira/browse/HBASE-**
>>>>> 5416?focusedCommentId=**13541229&page=com.atlassian.**
>>>>> jira.plugin.system.**issuetabpanels:comment-**
>>>>> tabpanel#comment-13541229<https://issues.apache.org/jira/browse/HBASE-5416?focusedCommentId=13541229&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13541229>
>>>>> >
>>>>>
>>>>>    BTW the use case Max Lapan tried to address has non essential column
>>>>>>
>>>>>>> family
>>>>>>>>> carrying considerably more data compared to essential column
>>>>>>>>> family.
>>>>>>>>>
>>>>>>>>> Cheers
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sat, Apr 6, 2013 at 11:05 PM, James Taylor <
>>>>>>>>>
>>>>>>>>>  jtaylor@salesforce.com
>>>>>>>>
>>>>>>>   wrote:
>>>>>>
>>>>>>>    Hello,
>>>>>>>>>
>>>>>>>>>  We're doing some performance testing of the essential column
>>>>>>>>>> family
>>>>>>>>>> feature, and we're seeing some performance degradation when
>>>>>>>>>>
>>>>>>>>>>  comparing
>>>>>>>>>
>>>>>>>>   with
>>>>>>
>>>>>>> and without the feature enabled:
>>>>>>>>>>
>>>>>>>>>>                              Performance of scan relative
>>>>>>>>>> % of rows selected        to not enabling the feature
>>>>>>>>>> ---------------------    ------------------------------********--
>>>>>>>>>>
>>>>>>>>>> 100%                            1.0x
>>>>>>>>>>     80%                            2.0x
>>>>>>>>>>     60%                            2.3x
>>>>>>>>>>     40%                            2.2x
>>>>>>>>>>     20%                            1.5x
>>>>>>>>>>     10%                            1.0x
>>>>>>>>>>      5%                            0.67x
>>>>>>>>>>      0%                            0.30%
>>>>>>>>>>
>>>>>>>>>> In our scenario, we have two column families. The key value from
>>>>>>>>>> the
>>>>>>>>>> essential column family is used in the filter, while the key value
>>>>>>>>>>
>>>>>>>>>>  from
>>>>>>>>>
>>>>>>>>   the
>>>>>>>
>>>>>>>> other, non essential column family is returned by the scan. Each row
>>>>>>>>>> contains values for both key values, with the values being
>>>>>>>>>>
>>>>>>>>>>  relatively
>>>>>>>>>
>>>>>>>>   narrow (less than 50 bytes). In this scenario, the only time we're
>>>>>>
>>>>>>> seeing a
>>>>>>>>>> performance gain is when less than 10% of the rows are selected.
>>>>>>>>>>
>>>>>>>>>> Is this a reasonable test? Has anyone else measured this?
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>>
>>>>>>>>>> James
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>

Re: Essential column family performance

Posted by Ted Yu <yu...@gmail.com>.
I tried using reseek() as suggested, along with my patch from HBASE-8306 (30%
selection rate, random distribution and FAST_DIFF encoding on both column
families).
I got uneven results:

2013-04-09 16:59:01,324 INFO  [main] regionserver.TestJoinedScanners(167):
Slow scanner finished in 7.529083 seconds, got 1546 rows

2013-04-09 16:59:06,760 INFO  [main] regionserver.TestJoinedScanners(167):
Joined scanner finished in 5.43579 seconds, got 1546 rows
...
2013-04-09 16:59:12,711 INFO  [main] regionserver.TestJoinedScanners(167):
Slow scanner finished in 5.95016 seconds, got 1546 rows

2013-04-09 16:59:20,240 INFO  [main] regionserver.TestJoinedScanners(167):
Joined scanner finished in 7.529044 seconds, got 1546 rows

FYI

On Tue, Apr 9, 2013 at 4:47 PM, lars hofhansl <la...@apache.org> wrote:

> We did some tests here.
> I ran this through the profiler against a local RegionServer and found the
> part that causes the slowdown is a seek called here:
>              boolean mayHaveData =
>               (nextJoinedKv != null &&
> nextJoinedKv.matchingRow(currentRow, offset, length))
>               ||
> (this.joinedHeap.seek(KeyValue.createFirstOnRow(currentRow, offset, length))
>                   && joinedHeap.peek() != null
>                   && joinedHeap.peek().matchingRow(currentRow, offset,
> length));
>
> Looking at the code, this is needed because the joinedHeap can fall
> behind, and hence we have to catch it up.
> The key observation, though, is that the joined heap can only ever be
> behind, and hence we do not need a seek, but only a reseek.
>
> Deploying a RegionServer with the seek replaced with reseek we see an
> improvement in *all* cases.
>
> I'll file a jira with a fix later.
>
> -- Lars
>
>
>
> ________________________________
>  From: James Taylor <jt...@salesforce.com>
> To: user@hbase.apache.org
> Sent: Monday, April 8, 2013 6:53 PM
> Subject: Re: Essential column family performance
>
> Good idea, Sergey. We'll rerun with larger non essential column family
> values and see if there's a crossover point. One other difference for us
> is that we're using FAST_DIFF encoding. We'll try with no encoding too.
> Our table has 20 million rows across four regions servers.
>
> Regarding the parallelization we do, we run multiple scans in parallel
> instead of one single scan over the table. We use the region boundaries
> of the table to divide up the work evenly, adding a start/stop key for
> each scan that corresponds to the region boundaries. Our client then
> does a final merge/aggregation step (i.e. adding up the count it gets
> back from the scan for each region).
>
> On 04/08/2013 01:34 PM, Sergey Shelukhin wrote:
> > IntegrationTestLazyCfLoading uses randomly distributed keys with the
> > following condition for filtering:
> > 1 == (Long.parseLong(Bytes.toString(rowKey, 0, 4), 16) & 1); where rowKey
> > is hex string of MD5 key.
> > Then, there are 2 "lazy" CFs, each of which has a value of 4-64k.
> > This test also showed significant improvement IIRC, so random
> distribution
> > and high %%ge of values selected should not be a problem as such.
> >
> > My hunch would be that the additional cost of seeks/merging the results
> > from two CFs outweights the benefit of lazy loading on such small values
> > for the "lazy" CF with lots of data selected. This feature definitely
> makes
> > no sense if you are selecting all values, because then extra work is
> being
> > done for no benefit (everything is read anyway).
> > So the use cases would be larger "lazy" CFs or/and low percentage of
> values
> > selected.
> >
> > Can you try to increase the 2nd CF values' size and rerun the test?
> >
> >
> > On Mon, Apr 8, 2013 at 10:38 AM, James Taylor <jtaylor@salesforce.com
> >wrote:
> >
> >> In the TestJoinedScanners.java, is the 40% randomly distributed or
> >> sequential?
> >>
> >> In our test, the % is randomly distributed. Also, our custom filter does
> >> the same thing that SingleColumnValueFilter does.  On the client-side,
> we'd
> >> execute the query in parallel, through multiple scans along the region
> >> boundaries. Would that have a negative impact on performance for this
> >> "essential column family" feature?
> >>
> >> Thanks,
> >>
> >>      James
> >>
> >>
> >> On 04/08/2013 10:10 AM, Anoop John wrote:
> >>
> >>> Agree here. The effectiveness depends on what % of data satisfies the
> >>> condition, how it is distributed across HFile blocks. We will get
> >>> performance gain when the we will be able to skip some HFile blocks
> (from
> >>> non essential CFs). Can test with different HFile block size (lower
> >>> value)?
> >>>
> >>> -Anoop-
> >>>
> >>>
> >>> On Mon, Apr 8, 2013 at 8:19 PM, Ted Yu <yu...@gmail.com> wrote:
> >>>
> >>>   I made the following change in TestJoinedScanners.java:
> >>>> -      int flag_percent = 1;
> >>>> +      int flag_percent = 40;
> >>>>
> >>>> The test took longer but still favors joined scanner.
> >>>> I got some new results:
> >>>>
> >>>> 2013-04-08 07:46:06,959 INFO  [main] regionserver.**
> >>>> TestJoinedScanners(157):
> >>>> Slow scanner finished in 7.424388 seconds, got 2050 rows
> >>>> ...
> >>>> 2013-04-08 07:46:12,010 INFO  [main] regionserver.**
> >>>> TestJoinedScanners(157):
> >>>> Joined scanner finished in 5.05063 seconds, got 2050 rows
> >>>>
> >>>> 2013-04-08 07:46:18,358 INFO  [main] regionserver.**
> >>>> TestJoinedScanners(157):
> >>>> Slow scanner finished in 6.348517 seconds, got 2050 rows
> >>>> ...
> >>>> 2013-04-08 07:46:22,946 INFO  [main] regionserver.**
> >>>> TestJoinedScanners(157):
> >>>> Joined scanner finished in 4.587545 seconds, got 2050 rows
> >>>>
> >>>> Looks like effectiveness of joined scanner is affected by
> distribution of
> >>>> data.
> >>>>
> >>>> Cheers
> >>>>
> >>>> On Sun, Apr 7, 2013 at 8:52 PM, lars hofhansl <la...@apache.org>
> wrote:
> >>>>
> >>>>   Looking at the joined scanner test code, it sets it up such that 1%
> of
> >>>> the
> >>>>
> >>>>> rows match, which would somewhat be in line with James' results.
> >>>>>
> >>>>> In my own testing a while ago I found a 100% improvement with 0%
> match.
> >>>>>
> >>>>>
> >>>>> -- Lars
> >>>>>
> >>>>>
> >>>>>
> >>>>> ______________________________**__
> >>>>>    From: Ted Yu <yu...@gmail.com>
> >>>>> To: user@hbase.apache.org
> >>>>> Sent: Sunday, April 7, 2013 4:13 PM
> >>>>> Subject: Re: Essential column family performance
> >>>>>
> >>>>> I have attached 5416-TestJoinedScanners-0.94.**txt to HBASE-5416 for
> >>>>> your
> >>>>> reference.
> >>>>>
> >>>>> On my MacBook, I got the following results from the test:
> >>>>>
> >>>>> 2013-04-07 16:08:17,474 INFO  [main]
> >>>>>
> >>>> regionserver.**TestJoinedScanners(157):
> >>>>
> >>>>> Slow scanner finished in 7.973822 seconds, got 100 rows
> >>>>> ...
> >>>>> 2013-04-07 16:08:17,946 INFO  [main]
> >>>>>
> >>>> regionserver.**TestJoinedScanners(157):
> >>>>
> >>>>> Joined scanner finished in 0.47235 seconds, got 100 rows
> >>>>>
> >>>>> Cheers
> >>>>>
> >>>>> On Sun, Apr 7, 2013 at 4:03 PM, Ted Yu <yu...@gmail.com> wrote:
> >>>>>
> >>>>>   Looking at
> >>>>>>  https://issues.apache.org/**jira/secure/attachment/**
> >>>> 12564340/5416-0.94-v3.txt<
> https://issues.apache.org/jira/secure/attachment/12564340/5416-0.94-v3.txt
> >
> >>>> ,
> >>>>
> >>>>> I found that it didn't contain TestJoinedScanners which shows
> >>>>>
> >>>>>> difference in scanner performance:
> >>>>>>
> >>>>>>      LOG.info((slow ? "Slow" : "Joined") + " scanner finished in " +
> >>>>>> Double.toString(timeSec)
> >>>>>>
> >>>>>>         + " seconds, got " + Long.toString(rows_count/2) + " rows");
> >>>>>>
> >>>>>> The test uses SingleColumnValueFilter:
> >>>>>>
> >>>>>>       SingleColumnValueFilter filter = new SingleColumnValueFilter(
> >>>>>>
> >>>>>>           cf_essential, col_name, CompareFilter.CompareOp.EQUAL,
> >>>>>>
> >>>>> flag_yes);
> >>>>> It is possible that the custom filter you were using would exhibit
> >>>>>> different access pattern compared to SingleColumnValueFilter. e.g.
> does
> >>>>>> your filter utilize hint ?
> >>>>>> It would be easier for me and other people to reproduce the issue
> you
> >>>>>> experienced if you put your scenario in some test similar to
> >>>>>> TestJoinedScanners.
> >>>>>>
> >>>>>> Will take a closer look at the code Monday.
> >>>>>>
> >>>>>> Cheers
> >>>>>>
> >>>>>> On Sun, Apr 7, 2013 at 11:37 AM, James Taylor <
> jtaylor@salesforce.com
> >>>>>> wrote:
> >>>>>>
> >>>>>>   Yes, on 0.94.6. We have our own custom filter derived from
> FilterBase,
> >>>>>> so
> >>>>>> filterIfMissing isn't the issue - the results of the scan are
> correct.
> >>>>>>> I can see that if the essential column family has more data
> compared
> >>>>>>>
> >>>>>> to
> >>>>> the non essential column family that the results would eventually
> even
> >>>>>> out.
> >>>>>> I was hoping to always be able to enable the essential column family
> >>>>>>> feature. Is there an inherent reason why performance would degrade
> >>>>>>>
> >>>>>> like
> >>>>> this? Does it boil down to a single sequential scan versus many
> seeks?
> >>>>>>> Thanks,
> >>>>>>>
> >>>>>>> James
> >>>>>>>
> >>>>>>>
> >>>>>>> On 04/07/2013 07:44 AM, Ted Yu wrote:
> >>>>>>>
> >>>>>>>   James:
> >>>>>>>> Your test was based on 0.94.6.1, right ?
> >>>>>>>>
> >>>>>>>> What Filter were you using ?
> >>>>>>>>
> >>>>>>>> If you used SingleColumnValueFilter, have you seen my comment
> here ?
> >>>>>>>> https://issues.apache.org/****jira/browse/HBASE-5416?**<
> https://issues.apache.org/**jira/browse/HBASE-5416?**>
> >>>>>>>> focusedCommentId=13541229&****page=com.atlassian.jira.**
> >>>>>>>> plugin.system.issuetabpanels:****comment-tabpanel#comment-****
> >>>>>>>> 13541229<
> >>>>>>>>
> >>>>>>> https://issues.apache.org/**jira/browse/HBASE-5416?**
> >>>> focusedCommentId=13541229&**page=com.atlassian.jira.**
> >>>> plugin.system.issuetabpanels:**comment-tabpanel#comment-**13541229<
> https://issues.apache.org/jira/browse/HBASE-5416?focusedCommentId=13541229&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13541229
> >
> >>>>
> >>>>>   BTW the use case Max Lapan tried to address has non essential
> column
> >>>>>>>> family
> >>>>>>>> carrying considerably more data compared to essential column
> family.
> >>>>>>>>
> >>>>>>>> Cheers
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Sat, Apr 6, 2013 at 11:05 PM, James Taylor <
> >>>>>>>>
> >>>>>>> jtaylor@salesforce.com
> >>>>>   wrote:
> >>>>>>>>    Hello,
> >>>>>>>>
> >>>>>>>>> We're doing some performance testing of the essential column
> family
> >>>>>>>>> feature, and we're seeing some performance degradation when
> >>>>>>>>>
> >>>>>>>> comparing
> >>>>>   with
> >>>>>>>>> and without the feature enabled:
> >>>>>>>>>
> >>>>>>>>>                              Performance of scan relative
> >>>>>>>>> % of rows selected        to not enabling the feature
> >>>>>>>>> ---------------------    ------------------------------******--
> >>>>>>>>>
> >>>>>>>>> 100%                            1.0x
> >>>>>>>>>     80%                            2.0x
> >>>>>>>>>     60%                            2.3x
> >>>>>>>>>     40%                            2.2x
> >>>>>>>>>     20%                            1.5x
> >>>>>>>>>     10%                            1.0x
> >>>>>>>>>      5%                            0.67x
> >>>>>>>>>      0%                            0.30%
> >>>>>>>>>
> >>>>>>>>> In our scenario, we have two column families. The key value from
> the
> >>>>>>>>> essential column family is used in the filter, while the key
> value
> >>>>>>>>>
> >>>>>>>> from
> >>>>>>   the
> >>>>>>>>> other, non essential column family is returned by the scan. Each
> row
> >>>>>>>>> contains values for both key values, with the values being
> >>>>>>>>>
> >>>>>>>> relatively
> >>>>>   narrow (less than 50 bytes). In this scenario, the only time we're
> >>>>>>>>> seeing a
> >>>>>>>>> performance gain is when less than 10% of the rows are selected.
> >>>>>>>>>
> >>>>>>>>> Is this a reasonable test? Has anyone else measured this?
> >>>>>>>>>
> >>>>>>>>> Thanks,
> >>>>>>>>>
> >>>>>>>>> James
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
>

Re: Essential column family performance

Posted by lars hofhansl <la...@apache.org>.
We did some tests here.
I ran this through the profiler against a local RegionServer and found the part that causes the slowdown is a seek called here:
             boolean mayHaveData =
              (nextJoinedKv != null && nextJoinedKv.matchingRow(currentRow, offset, length))
              || (this.joinedHeap.seek(KeyValue.createFirstOnRow(currentRow, offset, length))
                  && joinedHeap.peek() != null
                  && joinedHeap.peek().matchingRow(currentRow, offset, length));

Looking at the code, this is needed because the joinedHeap can fall behind, and hence we have to catch it up.
The key observation, though, is that the joined heap can only ever be behind, and hence we do not need a seek, but only a reseek.

Deploying a RegionServer with the seek replaced with reseek we see an improvement in *all* cases.

I'll file a jira with a fix later.

-- Lars



________________________________
 From: James Taylor <jt...@salesforce.com>
To: user@hbase.apache.org 
Sent: Monday, April 8, 2013 6:53 PM
Subject: Re: Essential column family performance
 
Good idea, Sergey. We'll rerun with larger non essential column family 
values and see if there's a crossover point. One other difference for us 
is that we're using FAST_DIFF encoding. We'll try with no encoding too. 
Our table has 20 million rows across four regions servers.

Regarding the parallelization we do, we run multiple scans in parallel 
instead of one single scan over the table. We use the region boundaries 
of the table to divide up the work evenly, adding a start/stop key for 
each scan that corresponds to the region boundaries. Our client then 
does a final merge/aggregation step (i.e. adding up the count it gets 
back from the scan for each region).

On 04/08/2013 01:34 PM, Sergey Shelukhin wrote:
> IntegrationTestLazyCfLoading uses randomly distributed keys with the
> following condition for filtering:
> 1 == (Long.parseLong(Bytes.toString(rowKey, 0, 4), 16) & 1); where rowKey
> is hex string of MD5 key.
> Then, there are 2 "lazy" CFs, each of which has a value of 4-64k.
> This test also showed significant improvement IIRC, so random distribution
> and high %%ge of values selected should not be a problem as such.
>
> My hunch would be that the additional cost of seeks/merging the results
> from two CFs outweights the benefit of lazy loading on such small values
> for the "lazy" CF with lots of data selected. This feature definitely makes
> no sense if you are selecting all values, because then extra work is being
> done for no benefit (everything is read anyway).
> So the use cases would be larger "lazy" CFs or/and low percentage of values
> selected.
>
> Can you try to increase the 2nd CF values' size and rerun the test?
>
>
> On Mon, Apr 8, 2013 at 10:38 AM, James Taylor <jt...@salesforce.com>wrote:
>
>> In the TestJoinedScanners.java, is the 40% randomly distributed or
>> sequential?
>>
>> In our test, the % is randomly distributed. Also, our custom filter does
>> the same thing that SingleColumnValueFilter does.  On the client-side, we'd
>> execute the query in parallel, through multiple scans along the region
>> boundaries. Would that have a negative impact on performance for this
>> "essential column family" feature?
>>
>> Thanks,
>>
>>      James
>>
>>
>> On 04/08/2013 10:10 AM, Anoop John wrote:
>>
>>> Agree here. The effectiveness depends on what % of data satisfies the
>>> condition, how it is distributed across HFile blocks. We will get
>>> performance gain when the we will be able to skip some HFile blocks (from
>>> non essential CFs). Can test with different HFile block size (lower
>>> value)?
>>>
>>> -Anoop-
>>>
>>>
>>> On Mon, Apr 8, 2013 at 8:19 PM, Ted Yu <yu...@gmail.com> wrote:
>>>
>>>   I made the following change in TestJoinedScanners.java:
>>>> -      int flag_percent = 1;
>>>> +      int flag_percent = 40;
>>>>
>>>> The test took longer but still favors joined scanner.
>>>> I got some new results:
>>>>
>>>> 2013-04-08 07:46:06,959 INFO  [main] regionserver.**
>>>> TestJoinedScanners(157):
>>>> Slow scanner finished in 7.424388 seconds, got 2050 rows
>>>> ...
>>>> 2013-04-08 07:46:12,010 INFO  [main] regionserver.**
>>>> TestJoinedScanners(157):
>>>> Joined scanner finished in 5.05063 seconds, got 2050 rows
>>>>
>>>> 2013-04-08 07:46:18,358 INFO  [main] regionserver.**
>>>> TestJoinedScanners(157):
>>>> Slow scanner finished in 6.348517 seconds, got 2050 rows
>>>> ...
>>>> 2013-04-08 07:46:22,946 INFO  [main] regionserver.**
>>>> TestJoinedScanners(157):
>>>> Joined scanner finished in 4.587545 seconds, got 2050 rows
>>>>
>>>> Looks like effectiveness of joined scanner is affected by distribution of
>>>> data.
>>>>
>>>> Cheers
>>>>
>>>> On Sun, Apr 7, 2013 at 8:52 PM, lars hofhansl <la...@apache.org> wrote:
>>>>
>>>>   Looking at the joined scanner test code, it sets it up such that 1% of
>>>> the
>>>>
>>>>> rows match, which would somewhat be in line with James' results.
>>>>>
>>>>> In my own testing a while ago I found a 100% improvement with 0% match.
>>>>>
>>>>>
>>>>> -- Lars
>>>>>
>>>>>
>>>>>
>>>>> ______________________________**__
>>>>>    From: Ted Yu <yu...@gmail.com>
>>>>> To: user@hbase.apache.org
>>>>> Sent: Sunday, April 7, 2013 4:13 PM
>>>>> Subject: Re: Essential column family performance
>>>>>
>>>>> I have attached 5416-TestJoinedScanners-0.94.**txt to HBASE-5416 for
>>>>> your
>>>>> reference.
>>>>>
>>>>> On my MacBook, I got the following results from the test:
>>>>>
>>>>> 2013-04-07 16:08:17,474 INFO  [main]
>>>>>
>>>> regionserver.**TestJoinedScanners(157):
>>>>
>>>>> Slow scanner finished in 7.973822 seconds, got 100 rows
>>>>> ...
>>>>> 2013-04-07 16:08:17,946 INFO  [main]
>>>>>
>>>> regionserver.**TestJoinedScanners(157):
>>>>
>>>>> Joined scanner finished in 0.47235 seconds, got 100 rows
>>>>>
>>>>> Cheers
>>>>>
>>>>> On Sun, Apr 7, 2013 at 4:03 PM, Ted Yu <yu...@gmail.com> wrote:
>>>>>
>>>>>   Looking at
>>>>>>  https://issues.apache.org/**jira/secure/attachment/**
>>>> 12564340/5416-0.94-v3.txt<https://issues.apache.org/jira/secure/attachment/12564340/5416-0.94-v3.txt>
>>>> ,
>>>>
>>>>> I found that it didn't contain TestJoinedScanners which shows
>>>>>
>>>>>> difference in scanner performance:
>>>>>>
>>>>>>      LOG.info((slow ? "Slow" : "Joined") + " scanner finished in " +
>>>>>> Double.toString(timeSec)
>>>>>>
>>>>>>         + " seconds, got " + Long.toString(rows_count/2) + " rows");
>>>>>>
>>>>>> The test uses SingleColumnValueFilter:
>>>>>>
>>>>>>       SingleColumnValueFilter filter = new SingleColumnValueFilter(
>>>>>>
>>>>>>           cf_essential, col_name, CompareFilter.CompareOp.EQUAL,
>>>>>>
>>>>> flag_yes);
>>>>> It is possible that the custom filter you were using would exhibit
>>>>>> different access pattern compared to SingleColumnValueFilter. e.g. does
>>>>>> your filter utilize hint ?
>>>>>> It would be easier for me and other people to reproduce the issue you
>>>>>> experienced if you put your scenario in some test similar to
>>>>>> TestJoinedScanners.
>>>>>>
>>>>>> Will take a closer look at the code Monday.
>>>>>>
>>>>>> Cheers
>>>>>>
>>>>>> On Sun, Apr 7, 2013 at 11:37 AM, James Taylor <jtaylor@salesforce.com
>>>>>> wrote:
>>>>>>
>>>>>>   Yes, on 0.94.6. We have our own custom filter derived from FilterBase,
>>>>>> so
>>>>>> filterIfMissing isn't the issue - the results of the scan are correct.
>>>>>>> I can see that if the essential column family has more data compared
>>>>>>>
>>>>>> to
>>>>> the non essential column family that the results would eventually even
>>>>>> out.
>>>>>> I was hoping to always be able to enable the essential column family
>>>>>>> feature. Is there an inherent reason why performance would degrade
>>>>>>>
>>>>>> like
>>>>> this? Does it boil down to a single sequential scan versus many seeks?
>>>>>>> Thanks,
>>>>>>>
>>>>>>> James
>>>>>>>
>>>>>>>
>>>>>>> On 04/07/2013 07:44 AM, Ted Yu wrote:
>>>>>>>
>>>>>>>   James:
>>>>>>>> Your test was based on 0.94.6.1, right ?
>>>>>>>>
>>>>>>>> What Filter were you using ?
>>>>>>>>
>>>>>>>> If you used SingleColumnValueFilter, have you seen my comment here ?
>>>>>>>> https://issues.apache.org/****jira/browse/HBASE-5416?**<https://issues.apache.org/**jira/browse/HBASE-5416?**>
>>>>>>>> focusedCommentId=13541229&****page=com.atlassian.jira.**
>>>>>>>> plugin.system.issuetabpanels:****comment-tabpanel#comment-****
>>>>>>>> 13541229<
>>>>>>>>
>>>>>>> https://issues.apache.org/**jira/browse/HBASE-5416?**
>>>> focusedCommentId=13541229&**page=com.atlassian.jira.**
>>>> plugin.system.issuetabpanels:**comment-tabpanel#comment-**13541229<https://issues.apache.org/jira/browse/HBASE-5416?focusedCommentId=13541229&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13541229>
>>>>
>>>>>   BTW the use case Max Lapan tried to address has non essential column
>>>>>>>> family
>>>>>>>> carrying considerably more data compared to essential column family.
>>>>>>>>
>>>>>>>> Cheers
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, Apr 6, 2013 at 11:05 PM, James Taylor <
>>>>>>>>
>>>>>>> jtaylor@salesforce.com
>>>>>   wrote:
>>>>>>>>    Hello,
>>>>>>>>
>>>>>>>>> We're doing some performance testing of the essential column family
>>>>>>>>> feature, and we're seeing some performance degradation when
>>>>>>>>>
>>>>>>>> comparing
>>>>>   with
>>>>>>>>> and without the feature enabled:
>>>>>>>>>
>>>>>>>>>                              Performance of scan relative
>>>>>>>>> % of rows selected        to not enabling the feature
>>>>>>>>> ---------------------    ------------------------------******--
>>>>>>>>>
>>>>>>>>> 100%                            1.0x
>>>>>>>>>     80%                            2.0x
>>>>>>>>>     60%                            2.3x
>>>>>>>>>     40%                            2.2x
>>>>>>>>>     20%                            1.5x
>>>>>>>>>     10%                            1.0x
>>>>>>>>>      5%                            0.67x
>>>>>>>>>      0%                            0.30%
>>>>>>>>>
>>>>>>>>> In our scenario, we have two column families. The key value from the
>>>>>>>>> essential column family is used in the filter, while the key value
>>>>>>>>>
>>>>>>>> from
>>>>>>   the
>>>>>>>>> other, non essential column family is returned by the scan. Each row
>>>>>>>>> contains values for both key values, with the values being
>>>>>>>>>
>>>>>>>> relatively
>>>>>   narrow (less than 50 bytes). In this scenario, the only time we're
>>>>>>>>> seeing a
>>>>>>>>> performance gain is when less than 10% of the rows are selected.
>>>>>>>>>
>>>>>>>>> Is this a reasonable test? Has anyone else measured this?
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> James
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>

Re: Essential column family performance

Posted by James Taylor <jt...@salesforce.com>.
Good idea, Sergey. We'll rerun with larger non essential column family 
values and see if there's a crossover point. One other difference for us 
is that we're using FAST_DIFF encoding. We'll try with no encoding too. 
Our table has 20 million rows across four regions servers.

Regarding the parallelization we do, we run multiple scans in parallel 
instead of one single scan over the table. We use the region boundaries 
of the table to divide up the work evenly, adding a start/stop key for 
each scan that corresponds to the region boundaries. Our client then 
does a final merge/aggregation step (i.e. adding up the count it gets 
back from the scan for each region).

On 04/08/2013 01:34 PM, Sergey Shelukhin wrote:
> IntegrationTestLazyCfLoading uses randomly distributed keys with the
> following condition for filtering:
> 1 == (Long.parseLong(Bytes.toString(rowKey, 0, 4), 16) & 1); where rowKey
> is hex string of MD5 key.
> Then, there are 2 "lazy" CFs, each of which has a value of 4-64k.
> This test also showed significant improvement IIRC, so random distribution
> and high %%ge of values selected should not be a problem as such.
>
> My hunch would be that the additional cost of seeks/merging the results
> from two CFs outweights the benefit of lazy loading on such small values
> for the "lazy" CF with lots of data selected. This feature definitely makes
> no sense if you are selecting all values, because then extra work is being
> done for no benefit (everything is read anyway).
> So the use cases would be larger "lazy" CFs or/and low percentage of values
> selected.
>
> Can you try to increase the 2nd CF values' size and rerun the test?
>
>
> On Mon, Apr 8, 2013 at 10:38 AM, James Taylor <jt...@salesforce.com>wrote:
>
>> In the TestJoinedScanners.java, is the 40% randomly distributed or
>> sequential?
>>
>> In our test, the % is randomly distributed. Also, our custom filter does
>> the same thing that SingleColumnValueFilter does.  On the client-side, we'd
>> execute the query in parallel, through multiple scans along the region
>> boundaries. Would that have a negative impact on performance for this
>> "essential column family" feature?
>>
>> Thanks,
>>
>>      James
>>
>>
>> On 04/08/2013 10:10 AM, Anoop John wrote:
>>
>>> Agree here. The effectiveness depends on what % of data satisfies the
>>> condition, how it is distributed across HFile blocks. We will get
>>> performance gain when the we will be able to skip some HFile blocks (from
>>> non essential CFs). Can test with different HFile block size (lower
>>> value)?
>>>
>>> -Anoop-
>>>
>>>
>>> On Mon, Apr 8, 2013 at 8:19 PM, Ted Yu <yu...@gmail.com> wrote:
>>>
>>>   I made the following change in TestJoinedScanners.java:
>>>> -      int flag_percent = 1;
>>>> +      int flag_percent = 40;
>>>>
>>>> The test took longer but still favors joined scanner.
>>>> I got some new results:
>>>>
>>>> 2013-04-08 07:46:06,959 INFO  [main] regionserver.**
>>>> TestJoinedScanners(157):
>>>> Slow scanner finished in 7.424388 seconds, got 2050 rows
>>>> ...
>>>> 2013-04-08 07:46:12,010 INFO  [main] regionserver.**
>>>> TestJoinedScanners(157):
>>>> Joined scanner finished in 5.05063 seconds, got 2050 rows
>>>>
>>>> 2013-04-08 07:46:18,358 INFO  [main] regionserver.**
>>>> TestJoinedScanners(157):
>>>> Slow scanner finished in 6.348517 seconds, got 2050 rows
>>>> ...
>>>> 2013-04-08 07:46:22,946 INFO  [main] regionserver.**
>>>> TestJoinedScanners(157):
>>>> Joined scanner finished in 4.587545 seconds, got 2050 rows
>>>>
>>>> Looks like effectiveness of joined scanner is affected by distribution of
>>>> data.
>>>>
>>>> Cheers
>>>>
>>>> On Sun, Apr 7, 2013 at 8:52 PM, lars hofhansl <la...@apache.org> wrote:
>>>>
>>>>   Looking at the joined scanner test code, it sets it up such that 1% of
>>>> the
>>>>
>>>>> rows match, which would somewhat be in line with James' results.
>>>>>
>>>>> In my own testing a while ago I found a 100% improvement with 0% match.
>>>>>
>>>>>
>>>>> -- Lars
>>>>>
>>>>>
>>>>>
>>>>> ______________________________**__
>>>>>    From: Ted Yu <yu...@gmail.com>
>>>>> To: user@hbase.apache.org
>>>>> Sent: Sunday, April 7, 2013 4:13 PM
>>>>> Subject: Re: Essential column family performance
>>>>>
>>>>> I have attached 5416-TestJoinedScanners-0.94.**txt to HBASE-5416 for
>>>>> your
>>>>> reference.
>>>>>
>>>>> On my MacBook, I got the following results from the test:
>>>>>
>>>>> 2013-04-07 16:08:17,474 INFO  [main]
>>>>>
>>>> regionserver.**TestJoinedScanners(157):
>>>>
>>>>> Slow scanner finished in 7.973822 seconds, got 100 rows
>>>>> ...
>>>>> 2013-04-07 16:08:17,946 INFO  [main]
>>>>>
>>>> regionserver.**TestJoinedScanners(157):
>>>>
>>>>> Joined scanner finished in 0.47235 seconds, got 100 rows
>>>>>
>>>>> Cheers
>>>>>
>>>>> On Sun, Apr 7, 2013 at 4:03 PM, Ted Yu <yu...@gmail.com> wrote:
>>>>>
>>>>>   Looking at
>>>>>>   https://issues.apache.org/**jira/secure/attachment/**
>>>> 12564340/5416-0.94-v3.txt<https://issues.apache.org/jira/secure/attachment/12564340/5416-0.94-v3.txt>
>>>> ,
>>>>
>>>>> I found that it didn't contain TestJoinedScanners which shows
>>>>>
>>>>>> difference in scanner performance:
>>>>>>
>>>>>>      LOG.info((slow ? "Slow" : "Joined") + " scanner finished in " +
>>>>>> Double.toString(timeSec)
>>>>>>
>>>>>>         + " seconds, got " + Long.toString(rows_count/2) + " rows");
>>>>>>
>>>>>> The test uses SingleColumnValueFilter:
>>>>>>
>>>>>>       SingleColumnValueFilter filter = new SingleColumnValueFilter(
>>>>>>
>>>>>>           cf_essential, col_name, CompareFilter.CompareOp.EQUAL,
>>>>>>
>>>>> flag_yes);
>>>>> It is possible that the custom filter you were using would exhibit
>>>>>> different access pattern compared to SingleColumnValueFilter. e.g. does
>>>>>> your filter utilize hint ?
>>>>>> It would be easier for me and other people to reproduce the issue you
>>>>>> experienced if you put your scenario in some test similar to
>>>>>> TestJoinedScanners.
>>>>>>
>>>>>> Will take a closer look at the code Monday.
>>>>>>
>>>>>> Cheers
>>>>>>
>>>>>> On Sun, Apr 7, 2013 at 11:37 AM, James Taylor <jtaylor@salesforce.com
>>>>>> wrote:
>>>>>>
>>>>>>   Yes, on 0.94.6. We have our own custom filter derived from FilterBase,
>>>>>> so
>>>>>> filterIfMissing isn't the issue - the results of the scan are correct.
>>>>>>> I can see that if the essential column family has more data compared
>>>>>>>
>>>>>> to
>>>>> the non essential column family that the results would eventually even
>>>>>> out.
>>>>>> I was hoping to always be able to enable the essential column family
>>>>>>> feature. Is there an inherent reason why performance would degrade
>>>>>>>
>>>>>> like
>>>>> this? Does it boil down to a single sequential scan versus many seeks?
>>>>>>> Thanks,
>>>>>>>
>>>>>>> James
>>>>>>>
>>>>>>>
>>>>>>> On 04/07/2013 07:44 AM, Ted Yu wrote:
>>>>>>>
>>>>>>>   James:
>>>>>>>> Your test was based on 0.94.6.1, right ?
>>>>>>>>
>>>>>>>> What Filter were you using ?
>>>>>>>>
>>>>>>>> If you used SingleColumnValueFilter, have you seen my comment here ?
>>>>>>>> https://issues.apache.org/****jira/browse/HBASE-5416?**<https://issues.apache.org/**jira/browse/HBASE-5416?**>
>>>>>>>> focusedCommentId=13541229&****page=com.atlassian.jira.**
>>>>>>>> plugin.system.issuetabpanels:****comment-tabpanel#comment-****
>>>>>>>> 13541229<
>>>>>>>>
>>>>>>> https://issues.apache.org/**jira/browse/HBASE-5416?**
>>>> focusedCommentId=13541229&**page=com.atlassian.jira.**
>>>> plugin.system.issuetabpanels:**comment-tabpanel#comment-**13541229<https://issues.apache.org/jira/browse/HBASE-5416?focusedCommentId=13541229&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13541229>
>>>>
>>>>>   BTW the use case Max Lapan tried to address has non essential column
>>>>>>>> family
>>>>>>>> carrying considerably more data compared to essential column family.
>>>>>>>>
>>>>>>>> Cheers
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, Apr 6, 2013 at 11:05 PM, James Taylor <
>>>>>>>>
>>>>>>> jtaylor@salesforce.com
>>>>>   wrote:
>>>>>>>>    Hello,
>>>>>>>>
>>>>>>>>> We're doing some performance testing of the essential column family
>>>>>>>>> feature, and we're seeing some performance degradation when
>>>>>>>>>
>>>>>>>> comparing
>>>>>   with
>>>>>>>>> and without the feature enabled:
>>>>>>>>>
>>>>>>>>>                              Performance of scan relative
>>>>>>>>> % of rows selected        to not enabling the feature
>>>>>>>>> ---------------------    ------------------------------******--
>>>>>>>>>
>>>>>>>>> 100%                            1.0x
>>>>>>>>>     80%                            2.0x
>>>>>>>>>     60%                            2.3x
>>>>>>>>>     40%                            2.2x
>>>>>>>>>     20%                            1.5x
>>>>>>>>>     10%                            1.0x
>>>>>>>>>      5%                            0.67x
>>>>>>>>>      0%                            0.30%
>>>>>>>>>
>>>>>>>>> In our scenario, we have two column families. The key value from the
>>>>>>>>> essential column family is used in the filter, while the key value
>>>>>>>>>
>>>>>>>> from
>>>>>>   the
>>>>>>>>> other, non essential column family is returned by the scan. Each row
>>>>>>>>> contains values for both key values, with the values being
>>>>>>>>>
>>>>>>>> relatively
>>>>>   narrow (less than 50 bytes). In this scenario, the only time we're
>>>>>>>>> seeing a
>>>>>>>>> performance gain is when less than 10% of the rows are selected.
>>>>>>>>>
>>>>>>>>> Is this a reasonable test? Has anyone else measured this?
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> James
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>


Re: Essential column family performance

Posted by Sergey Shelukhin <se...@hortonworks.com>.
IntegrationTestLazyCfLoading uses randomly distributed keys with the
following condition for filtering:
1 == (Long.parseLong(Bytes.toString(rowKey, 0, 4), 16) & 1); where rowKey
is hex string of MD5 key.
Then, there are 2 "lazy" CFs, each of which has a value of 4-64k.
This test also showed significant improvement IIRC, so random distribution
and high %%ge of values selected should not be a problem as such.

My hunch would be that the additional cost of seeks/merging the results
from two CFs outweights the benefit of lazy loading on such small values
for the "lazy" CF with lots of data selected. This feature definitely makes
no sense if you are selecting all values, because then extra work is being
done for no benefit (everything is read anyway).
So the use cases would be larger "lazy" CFs or/and low percentage of values
selected.

Can you try to increase the 2nd CF values' size and rerun the test?


On Mon, Apr 8, 2013 at 10:38 AM, James Taylor <jt...@salesforce.com>wrote:

> In the TestJoinedScanners.java, is the 40% randomly distributed or
> sequential?
>
> In our test, the % is randomly distributed. Also, our custom filter does
> the same thing that SingleColumnValueFilter does.  On the client-side, we'd
> execute the query in parallel, through multiple scans along the region
> boundaries. Would that have a negative impact on performance for this
> "essential column family" feature?
>
> Thanks,
>
>     James
>
>
> On 04/08/2013 10:10 AM, Anoop John wrote:
>
>> Agree here. The effectiveness depends on what % of data satisfies the
>> condition, how it is distributed across HFile blocks. We will get
>> performance gain when the we will be able to skip some HFile blocks (from
>> non essential CFs). Can test with different HFile block size (lower
>> value)?
>>
>> -Anoop-
>>
>>
>> On Mon, Apr 8, 2013 at 8:19 PM, Ted Yu <yu...@gmail.com> wrote:
>>
>>  I made the following change in TestJoinedScanners.java:
>>>
>>> -      int flag_percent = 1;
>>> +      int flag_percent = 40;
>>>
>>> The test took longer but still favors joined scanner.
>>> I got some new results:
>>>
>>> 2013-04-08 07:46:06,959 INFO  [main] regionserver.**
>>> TestJoinedScanners(157):
>>> Slow scanner finished in 7.424388 seconds, got 2050 rows
>>> ...
>>> 2013-04-08 07:46:12,010 INFO  [main] regionserver.**
>>> TestJoinedScanners(157):
>>> Joined scanner finished in 5.05063 seconds, got 2050 rows
>>>
>>> 2013-04-08 07:46:18,358 INFO  [main] regionserver.**
>>> TestJoinedScanners(157):
>>> Slow scanner finished in 6.348517 seconds, got 2050 rows
>>> ...
>>> 2013-04-08 07:46:22,946 INFO  [main] regionserver.**
>>> TestJoinedScanners(157):
>>> Joined scanner finished in 4.587545 seconds, got 2050 rows
>>>
>>> Looks like effectiveness of joined scanner is affected by distribution of
>>> data.
>>>
>>> Cheers
>>>
>>> On Sun, Apr 7, 2013 at 8:52 PM, lars hofhansl <la...@apache.org> wrote:
>>>
>>>  Looking at the joined scanner test code, it sets it up such that 1% of
>>>>
>>> the
>>>
>>>> rows match, which would somewhat be in line with James' results.
>>>>
>>>> In my own testing a while ago I found a 100% improvement with 0% match.
>>>>
>>>>
>>>> -- Lars
>>>>
>>>>
>>>>
>>>> ______________________________**__
>>>>   From: Ted Yu <yu...@gmail.com>
>>>> To: user@hbase.apache.org
>>>> Sent: Sunday, April 7, 2013 4:13 PM
>>>> Subject: Re: Essential column family performance
>>>>
>>>> I have attached 5416-TestJoinedScanners-0.94.**txt to HBASE-5416 for
>>>> your
>>>> reference.
>>>>
>>>> On my MacBook, I got the following results from the test:
>>>>
>>>> 2013-04-07 16:08:17,474 INFO  [main]
>>>>
>>> regionserver.**TestJoinedScanners(157):
>>>
>>>> Slow scanner finished in 7.973822 seconds, got 100 rows
>>>> ...
>>>> 2013-04-07 16:08:17,946 INFO  [main]
>>>>
>>> regionserver.**TestJoinedScanners(157):
>>>
>>>> Joined scanner finished in 0.47235 seconds, got 100 rows
>>>>
>>>> Cheers
>>>>
>>>> On Sun, Apr 7, 2013 at 4:03 PM, Ted Yu <yu...@gmail.com> wrote:
>>>>
>>>>  Looking at
>>>>>
>>>>>  https://issues.apache.org/**jira/secure/attachment/**
>>> 12564340/5416-0.94-v3.txt<https://issues.apache.org/jira/secure/attachment/12564340/5416-0.94-v3.txt>
>>> ,
>>>
>>>> I found that it didn't contain TestJoinedScanners which shows
>>>>
>>>>> difference in scanner performance:
>>>>>
>>>>>     LOG.info((slow ? "Slow" : "Joined") + " scanner finished in " +
>>>>> Double.toString(timeSec)
>>>>>
>>>>>        + " seconds, got " + Long.toString(rows_count/2) + " rows");
>>>>>
>>>>> The test uses SingleColumnValueFilter:
>>>>>
>>>>>      SingleColumnValueFilter filter = new SingleColumnValueFilter(
>>>>>
>>>>>          cf_essential, col_name, CompareFilter.CompareOp.EQUAL,
>>>>>
>>>> flag_yes);
>>>
>>>> It is possible that the custom filter you were using would exhibit
>>>>> different access pattern compared to SingleColumnValueFilter. e.g. does
>>>>> your filter utilize hint ?
>>>>> It would be easier for me and other people to reproduce the issue you
>>>>> experienced if you put your scenario in some test similar to
>>>>> TestJoinedScanners.
>>>>>
>>>>> Will take a closer look at the code Monday.
>>>>>
>>>>> Cheers
>>>>>
>>>>> On Sun, Apr 7, 2013 at 11:37 AM, James Taylor <jtaylor@salesforce.com
>>>>> wrote:
>>>>>
>>>>>  Yes, on 0.94.6. We have our own custom filter derived from FilterBase,
>>>>>>
>>>>> so
>>>>
>>>>> filterIfMissing isn't the issue - the results of the scan are correct.
>>>>>>
>>>>>> I can see that if the essential column family has more data compared
>>>>>>
>>>>> to
>>>
>>>> the non essential column family that the results would eventually even
>>>>>>
>>>>> out.
>>>>
>>>>> I was hoping to always be able to enable the essential column family
>>>>>> feature. Is there an inherent reason why performance would degrade
>>>>>>
>>>>> like
>>>
>>>> this? Does it boil down to a single sequential scan versus many seeks?
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> James
>>>>>>
>>>>>>
>>>>>> On 04/07/2013 07:44 AM, Ted Yu wrote:
>>>>>>
>>>>>>  James:
>>>>>>> Your test was based on 0.94.6.1, right ?
>>>>>>>
>>>>>>> What Filter were you using ?
>>>>>>>
>>>>>>> If you used SingleColumnValueFilter, have you seen my comment here ?
>>>>>>> https://issues.apache.org/****jira/browse/HBASE-5416?**<https://issues.apache.org/**jira/browse/HBASE-5416?**>
>>>>>>> focusedCommentId=13541229&****page=com.atlassian.jira.**
>>>>>>> plugin.system.issuetabpanels:****comment-tabpanel#comment-****
>>>>>>> 13541229<
>>>>>>>
>>>>>> https://issues.apache.org/**jira/browse/HBASE-5416?**
>>> focusedCommentId=13541229&**page=com.atlassian.jira.**
>>> plugin.system.issuetabpanels:**comment-tabpanel#comment-**13541229<https://issues.apache.org/jira/browse/HBASE-5416?focusedCommentId=13541229&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13541229>
>>>
>>>>  BTW the use case Max Lapan tried to address has non essential column
>>>>>>> family
>>>>>>> carrying considerably more data compared to essential column family.
>>>>>>>
>>>>>>> Cheers
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Sat, Apr 6, 2013 at 11:05 PM, James Taylor <
>>>>>>>
>>>>>> jtaylor@salesforce.com
>>>
>>>>  wrote:
>>>>>>>>
>>>>>>>   Hello,
>>>>>>>
>>>>>>>> We're doing some performance testing of the essential column family
>>>>>>>> feature, and we're seeing some performance degradation when
>>>>>>>>
>>>>>>> comparing
>>>
>>>>  with
>>>>>>>> and without the feature enabled:
>>>>>>>>
>>>>>>>>                             Performance of scan relative
>>>>>>>> % of rows selected        to not enabling the feature
>>>>>>>> ---------------------    ------------------------------******--
>>>>>>>>
>>>>>>>> 100%                            1.0x
>>>>>>>>    80%                            2.0x
>>>>>>>>    60%                            2.3x
>>>>>>>>    40%                            2.2x
>>>>>>>>    20%                            1.5x
>>>>>>>>    10%                            1.0x
>>>>>>>>     5%                            0.67x
>>>>>>>>     0%                            0.30%
>>>>>>>>
>>>>>>>> In our scenario, we have two column families. The key value from the
>>>>>>>> essential column family is used in the filter, while the key value
>>>>>>>>
>>>>>>> from
>>>>
>>>>>  the
>>>>>>>> other, non essential column family is returned by the scan. Each row
>>>>>>>> contains values for both key values, with the values being
>>>>>>>>
>>>>>>> relatively
>>>
>>>>  narrow (less than 50 bytes). In this scenario, the only time we're
>>>>>>>> seeing a
>>>>>>>> performance gain is when less than 10% of the rows are selected.
>>>>>>>>
>>>>>>>> Is this a reasonable test? Has anyone else measured this?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> James
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>

Re: Essential column family performance

Posted by ramkrishna vasudevan <ra...@gmail.com>.
bq. through multiple scans along the region boundaries
Sorry am not able to get what you are saying. Could you elaborate on this?
 I think the validity of this essential CF feature is best tested in real
use cases as that in Phoenix.

Regards
Ram


On Mon, Apr 8, 2013 at 11:12 PM, Ted Yu <yu...@gmail.com> wrote:

> bq. is the 40% randomly distributed or sequential?
> Looks like the distribution is striped:
>
>         if (i % 100 <= flag_percent) {
>
>           put.add(cf_essential, col_name, flag_yes);
> In each stripe, it is sequential.
>
> Let me try simulating random distribution.
>
> On Mon, Apr 8, 2013 at 10:38 AM, James Taylor <jtaylor@salesforce.com
> >wrote:
>
> > In the TestJoinedScanners.java, is the 40% randomly distributed or
> > sequential?
> >
> > In our test, the % is randomly distributed. Also, our custom filter does
> > the same thing that SingleColumnValueFilter does.  On the client-side,
> we'd
> > execute the query in parallel, through multiple scans along the region
> > boundaries. Would that have a negative impact on performance for this
> > "essential column family" feature?
> >
> > Thanks,
> >
> >     James
> >
> >
> > On 04/08/2013 10:10 AM, Anoop John wrote:
> >
> >> Agree here. The effectiveness depends on what % of data satisfies the
> >> condition, how it is distributed across HFile blocks. We will get
> >> performance gain when the we will be able to skip some HFile blocks
> (from
> >> non essential CFs). Can test with different HFile block size (lower
> >> value)?
> >>
> >> -Anoop-
> >>
> >>
> >> On Mon, Apr 8, 2013 at 8:19 PM, Ted Yu <yu...@gmail.com> wrote:
> >>
> >>  I made the following change in TestJoinedScanners.java:
> >>>
> >>> -      int flag_percent = 1;
> >>> +      int flag_percent = 40;
> >>>
> >>> The test took longer but still favors joined scanner.
> >>> I got some new results:
> >>>
> >>> 2013-04-08 07:46:06,959 INFO  [main] regionserver.**
> >>> TestJoinedScanners(157):
> >>> Slow scanner finished in 7.424388 seconds, got 2050 rows
> >>> ...
> >>> 2013-04-08 07:46:12,010 INFO  [main] regionserver.**
> >>> TestJoinedScanners(157):
> >>> Joined scanner finished in 5.05063 seconds, got 2050 rows
> >>>
> >>> 2013-04-08 07:46:18,358 INFO  [main] regionserver.**
> >>> TestJoinedScanners(157):
> >>> Slow scanner finished in 6.348517 seconds, got 2050 rows
> >>> ...
> >>> 2013-04-08 07:46:22,946 INFO  [main] regionserver.**
> >>> TestJoinedScanners(157):
> >>> Joined scanner finished in 4.587545 seconds, got 2050 rows
> >>>
> >>> Looks like effectiveness of joined scanner is affected by distribution
> of
> >>> data.
> >>>
> >>> Cheers
> >>>
> >>> On Sun, Apr 7, 2013 at 8:52 PM, lars hofhansl <la...@apache.org>
> wrote:
> >>>
> >>>  Looking at the joined scanner test code, it sets it up such that 1% of
> >>>>
> >>> the
> >>>
> >>>> rows match, which would somewhat be in line with James' results.
> >>>>
> >>>> In my own testing a while ago I found a 100% improvement with 0%
> match.
> >>>>
> >>>>
> >>>> -- Lars
> >>>>
> >>>>
> >>>>
> >>>> ______________________________**__
> >>>>   From: Ted Yu <yu...@gmail.com>
> >>>> To: user@hbase.apache.org
> >>>> Sent: Sunday, April 7, 2013 4:13 PM
> >>>> Subject: Re: Essential column family performance
> >>>>
> >>>> I have attached 5416-TestJoinedScanners-0.94.**txt to HBASE-5416 for
> >>>> your
> >>>> reference.
> >>>>
> >>>> On my MacBook, I got the following results from the test:
> >>>>
> >>>> 2013-04-07 16:08:17,474 INFO  [main]
> >>>>
> >>> regionserver.**TestJoinedScanners(157):
> >>>
> >>>> Slow scanner finished in 7.973822 seconds, got 100 rows
> >>>> ...
> >>>> 2013-04-07 16:08:17,946 INFO  [main]
> >>>>
> >>> regionserver.**TestJoinedScanners(157):
> >>>
> >>>> Joined scanner finished in 0.47235 seconds, got 100 rows
> >>>>
> >>>> Cheers
> >>>>
> >>>> On Sun, Apr 7, 2013 at 4:03 PM, Ted Yu <yu...@gmail.com> wrote:
> >>>>
> >>>>  Looking at
> >>>>>
> >>>>>  https://issues.apache.org/**jira/secure/attachment/**
> >>> 12564340/5416-0.94-v3.txt<
> https://issues.apache.org/jira/secure/attachment/12564340/5416-0.94-v3.txt
> >
> >>> ,
> >>>
> >>>> I found that it didn't contain TestJoinedScanners which shows
> >>>>
> >>>>> difference in scanner performance:
> >>>>>
> >>>>>     LOG.info((slow ? "Slow" : "Joined") + " scanner finished in " +
> >>>>> Double.toString(timeSec)
> >>>>>
> >>>>>        + " seconds, got " + Long.toString(rows_count/2) + " rows");
> >>>>>
> >>>>> The test uses SingleColumnValueFilter:
> >>>>>
> >>>>>      SingleColumnValueFilter filter = new SingleColumnValueFilter(
> >>>>>
> >>>>>          cf_essential, col_name, CompareFilter.CompareOp.EQUAL,
> >>>>>
> >>>> flag_yes);
> >>>
> >>>> It is possible that the custom filter you were using would exhibit
> >>>>> different access pattern compared to SingleColumnValueFilter. e.g.
> does
> >>>>> your filter utilize hint ?
> >>>>> It would be easier for me and other people to reproduce the issue you
> >>>>> experienced if you put your scenario in some test similar to
> >>>>> TestJoinedScanners.
> >>>>>
> >>>>> Will take a closer look at the code Monday.
> >>>>>
> >>>>> Cheers
> >>>>>
> >>>>> On Sun, Apr 7, 2013 at 11:37 AM, James Taylor <
> jtaylor@salesforce.com
> >>>>> wrote:
> >>>>>
> >>>>>  Yes, on 0.94.6. We have our own custom filter derived from
> FilterBase,
> >>>>>>
> >>>>> so
> >>>>
> >>>>> filterIfMissing isn't the issue - the results of the scan are
> correct.
> >>>>>>
> >>>>>> I can see that if the essential column family has more data compared
> >>>>>>
> >>>>> to
> >>>
> >>>> the non essential column family that the results would eventually even
> >>>>>>
> >>>>> out.
> >>>>
> >>>>> I was hoping to always be able to enable the essential column family
> >>>>>> feature. Is there an inherent reason why performance would degrade
> >>>>>>
> >>>>> like
> >>>
> >>>> this? Does it boil down to a single sequential scan versus many seeks?
> >>>>>>
> >>>>>> Thanks,
> >>>>>>
> >>>>>> James
> >>>>>>
> >>>>>>
> >>>>>> On 04/07/2013 07:44 AM, Ted Yu wrote:
> >>>>>>
> >>>>>>  James:
> >>>>>>> Your test was based on 0.94.6.1, right ?
> >>>>>>>
> >>>>>>> What Filter were you using ?
> >>>>>>>
> >>>>>>> If you used SingleColumnValueFilter, have you seen my comment here
> ?
> >>>>>>> https://issues.apache.org/****jira/browse/HBASE-5416?**<
> https://issues.apache.org/**jira/browse/HBASE-5416?**>
> >>>>>>> focusedCommentId=13541229&****page=com.atlassian.jira.**
> >>>>>>> plugin.system.issuetabpanels:****comment-tabpanel#comment-****
> >>>>>>> 13541229<
> >>>>>>>
> >>>>>> https://issues.apache.org/**jira/browse/HBASE-5416?**
> >>> focusedCommentId=13541229&**page=com.atlassian.jira.**
> >>> plugin.system.issuetabpanels:**comment-tabpanel#comment-**13541229<
> https://issues.apache.org/jira/browse/HBASE-5416?focusedCommentId=13541229&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13541229
> >
> >>>
> >>>> BTW the use case Max Lapan tried to address has non essential column
> >>>>>>> family
> >>>>>>> carrying considerably more data compared to essential column
> family.
> >>>>>>>
> >>>>>>> Cheers
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On Sat, Apr 6, 2013 at 11:05 PM, James Taylor <
> >>>>>>>
> >>>>>> jtaylor@salesforce.com
> >>>
> >>>> wrote:
> >>>>>>>>
> >>>>>>>   Hello,
> >>>>>>>
> >>>>>>>> We're doing some performance testing of the essential column
> family
> >>>>>>>> feature, and we're seeing some performance degradation when
> >>>>>>>>
> >>>>>>> comparing
> >>>
> >>>> with
> >>>>>>>> and without the feature enabled:
> >>>>>>>>
> >>>>>>>>                             Performance of scan relative
> >>>>>>>> % of rows selected        to not enabling the feature
> >>>>>>>> ---------------------    ------------------------------******--
> >>>>>>>>
> >>>>>>>> 100%                            1.0x
> >>>>>>>>    80%                            2.0x
> >>>>>>>>    60%                            2.3x
> >>>>>>>>    40%                            2.2x
> >>>>>>>>    20%                            1.5x
> >>>>>>>>    10%                            1.0x
> >>>>>>>>     5%                            0.67x
> >>>>>>>>     0%                            0.30%
> >>>>>>>>
> >>>>>>>> In our scenario, we have two column families. The key value from
> the
> >>>>>>>> essential column family is used in the filter, while the key value
> >>>>>>>>
> >>>>>>> from
> >>>>
> >>>>> the
> >>>>>>>> other, non essential column family is returned by the scan. Each
> row
> >>>>>>>> contains values for both key values, with the values being
> >>>>>>>>
> >>>>>>> relatively
> >>>
> >>>> narrow (less than 50 bytes). In this scenario, the only time we're
> >>>>>>>> seeing a
> >>>>>>>> performance gain is when less than 10% of the rows are selected.
> >>>>>>>>
> >>>>>>>> Is this a reasonable test? Has anyone else measured this?
> >>>>>>>>
> >>>>>>>> Thanks,
> >>>>>>>>
> >>>>>>>> James
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >
>

Re: Essential column family performance

Posted by Ted Yu <yu...@gmail.com>.
I adopted random distribution for 30% of the rows which were selected.
I still saw meaningful improvement from joined scanners:

2013-04-08 10:54:13,819 INFO  [main] regionserver.TestJoinedScanners(158):
Slow scanner finished in 6.20723 seconds, got 1552 rows
...
2013-04-08 10:54:18,801 INFO  [main] regionserver.TestJoinedScanners(158):
Joined scanner finished in 4.982732 seconds, got 1552 rows

2013-04-08 10:54:23,997 INFO  [main] regionserver.TestJoinedScanners(158):
Slow scanner finished in 5.195658 seconds, got 1552 rows
...
2013-04-08 10:54:28,619 INFO  [main] regionserver.TestJoinedScanners(158):
Joined scanner finished in 4.621337 seconds, got 1552 rows

Cheers

On Mon, Apr 8, 2013 at 10:42 AM, Ted Yu <yu...@gmail.com> wrote:

> bq. is the 40% randomly distributed or sequential?
> Looks like the distribution is striped:
>
>         if (i % 100 <= flag_percent) {
>
>           put.add(cf_essential, col_name, flag_yes);
> In each stripe, it is sequential.
>
> Let me try simulating random distribution.
>
> On Mon, Apr 8, 2013 at 10:38 AM, James Taylor <jt...@salesforce.com>wrote:
>
>> In the TestJoinedScanners.java, is the 40% randomly distributed or
>> sequential?
>>
>> In our test, the % is randomly distributed. Also, our custom filter does
>> the same thing that SingleColumnValueFilter does.  On the client-side, we'd
>> execute the query in parallel, through multiple scans along the region
>> boundaries. Would that have a negative impact on performance for this
>> "essential column family" feature?
>>
>> Thanks,
>>
>>     James
>>
>>
>> On 04/08/2013 10:10 AM, Anoop John wrote:
>>
>>> Agree here. The effectiveness depends on what % of data satisfies the
>>> condition, how it is distributed across HFile blocks. We will get
>>> performance gain when the we will be able to skip some HFile blocks (from
>>> non essential CFs). Can test with different HFile block size (lower
>>> value)?
>>>
>>> -Anoop-
>>>
>>>
>>> On Mon, Apr 8, 2013 at 8:19 PM, Ted Yu <yu...@gmail.com> wrote:
>>>
>>>  I made the following change in TestJoinedScanners.java:
>>>>
>>>> -      int flag_percent = 1;
>>>> +      int flag_percent = 40;
>>>>
>>>> The test took longer but still favors joined scanner.
>>>> I got some new results:
>>>>
>>>> 2013-04-08 07:46:06,959 INFO  [main] regionserver.**
>>>> TestJoinedScanners(157):
>>>> Slow scanner finished in 7.424388 seconds, got 2050 rows
>>>> ...
>>>> 2013-04-08 07:46:12,010 INFO  [main] regionserver.**
>>>> TestJoinedScanners(157):
>>>> Joined scanner finished in 5.05063 seconds, got 2050 rows
>>>>
>>>> 2013-04-08 07:46:18,358 INFO  [main] regionserver.**
>>>> TestJoinedScanners(157):
>>>> Slow scanner finished in 6.348517 seconds, got 2050 rows
>>>> ...
>>>> 2013-04-08 07:46:22,946 INFO  [main] regionserver.**
>>>> TestJoinedScanners(157):
>>>> Joined scanner finished in 4.587545 seconds, got 2050 rows
>>>>
>>>> Looks like effectiveness of joined scanner is affected by distribution
>>>> of
>>>> data.
>>>>
>>>> Cheers
>>>>
>>>> On Sun, Apr 7, 2013 at 8:52 PM, lars hofhansl <la...@apache.org> wrote:
>>>>
>>>>  Looking at the joined scanner test code, it sets it up such that 1% of
>>>>>
>>>> the
>>>>
>>>>> rows match, which would somewhat be in line with James' results.
>>>>>
>>>>> In my own testing a while ago I found a 100% improvement with 0% match.
>>>>>
>>>>>
>>>>> -- Lars
>>>>>
>>>>>
>>>>>
>>>>> ______________________________**__
>>>>>   From: Ted Yu <yu...@gmail.com>
>>>>> To: user@hbase.apache.org
>>>>> Sent: Sunday, April 7, 2013 4:13 PM
>>>>> Subject: Re: Essential column family performance
>>>>>
>>>>> I have attached 5416-TestJoinedScanners-0.94.**txt to HBASE-5416 for
>>>>> your
>>>>> reference.
>>>>>
>>>>> On my MacBook, I got the following results from the test:
>>>>>
>>>>> 2013-04-07 16:08:17,474 INFO  [main]
>>>>>
>>>> regionserver.**TestJoinedScanners(157):
>>>>
>>>>> Slow scanner finished in 7.973822 seconds, got 100 rows
>>>>> ...
>>>>> 2013-04-07 16:08:17,946 INFO  [main]
>>>>>
>>>> regionserver.**TestJoinedScanners(157):
>>>>
>>>>> Joined scanner finished in 0.47235 seconds, got 100 rows
>>>>>
>>>>> Cheers
>>>>>
>>>>> On Sun, Apr 7, 2013 at 4:03 PM, Ted Yu <yu...@gmail.com> wrote:
>>>>>
>>>>>  Looking at
>>>>>>
>>>>>>  https://issues.apache.org/**jira/secure/attachment/**
>>>> 12564340/5416-0.94-v3.txt<https://issues.apache.org/jira/secure/attachment/12564340/5416-0.94-v3.txt>
>>>> ,
>>>>
>>>>> I found that it didn't contain TestJoinedScanners which shows
>>>>>
>>>>>> difference in scanner performance:
>>>>>>
>>>>>>     LOG.info((slow ? "Slow" : "Joined") + " scanner finished in " +
>>>>>> Double.toString(timeSec)
>>>>>>
>>>>>>        + " seconds, got " + Long.toString(rows_count/2) + " rows");
>>>>>>
>>>>>> The test uses SingleColumnValueFilter:
>>>>>>
>>>>>>      SingleColumnValueFilter filter = new SingleColumnValueFilter(
>>>>>>
>>>>>>          cf_essential, col_name, CompareFilter.CompareOp.EQUAL,
>>>>>>
>>>>> flag_yes);
>>>>
>>>>> It is possible that the custom filter you were using would exhibit
>>>>>> different access pattern compared to SingleColumnValueFilter. e.g.
>>>>>> does
>>>>>> your filter utilize hint ?
>>>>>> It would be easier for me and other people to reproduce the issue you
>>>>>> experienced if you put your scenario in some test similar to
>>>>>> TestJoinedScanners.
>>>>>>
>>>>>> Will take a closer look at the code Monday.
>>>>>>
>>>>>> Cheers
>>>>>>
>>>>>> On Sun, Apr 7, 2013 at 11:37 AM, James Taylor <jtaylor@salesforce.com
>>>>>> wrote:
>>>>>>
>>>>>>  Yes, on 0.94.6. We have our own custom filter derived from
>>>>>>> FilterBase,
>>>>>>>
>>>>>> so
>>>>>
>>>>>> filterIfMissing isn't the issue - the results of the scan are correct.
>>>>>>>
>>>>>>> I can see that if the essential column family has more data compared
>>>>>>>
>>>>>> to
>>>>
>>>>> the non essential column family that the results would eventually even
>>>>>>>
>>>>>> out.
>>>>>
>>>>>> I was hoping to always be able to enable the essential column family
>>>>>>> feature. Is there an inherent reason why performance would degrade
>>>>>>>
>>>>>> like
>>>>
>>>>> this? Does it boil down to a single sequential scan versus many seeks?
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> James
>>>>>>>
>>>>>>>
>>>>>>> On 04/07/2013 07:44 AM, Ted Yu wrote:
>>>>>>>
>>>>>>>  James:
>>>>>>>> Your test was based on 0.94.6.1, right ?
>>>>>>>>
>>>>>>>> What Filter were you using ?
>>>>>>>>
>>>>>>>> If you used SingleColumnValueFilter, have you seen my comment here ?
>>>>>>>> https://issues.apache.org/****jira/browse/HBASE-5416?**<https://issues.apache.org/**jira/browse/HBASE-5416?**>
>>>>>>>> focusedCommentId=13541229&****page=com.atlassian.jira.**
>>>>>>>> plugin.system.issuetabpanels:****comment-tabpanel#comment-****
>>>>>>>> 13541229<
>>>>>>>>
>>>>>>> https://issues.apache.org/**jira/browse/HBASE-5416?**
>>>> focusedCommentId=13541229&**page=com.atlassian.jira.**
>>>> plugin.system.issuetabpanels:**comment-tabpanel#comment-**13541229<https://issues.apache.org/jira/browse/HBASE-5416?focusedCommentId=13541229&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13541229>
>>>>
>>>>>  BTW the use case Max Lapan tried to address has non essential column
>>>>>>>> family
>>>>>>>> carrying considerably more data compared to essential column family.
>>>>>>>>
>>>>>>>> Cheers
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, Apr 6, 2013 at 11:05 PM, James Taylor <
>>>>>>>>
>>>>>>> jtaylor@salesforce.com
>>>>
>>>>>  wrote:
>>>>>>>>>
>>>>>>>>   Hello,
>>>>>>>>
>>>>>>>>> We're doing some performance testing of the essential column family
>>>>>>>>> feature, and we're seeing some performance degradation when
>>>>>>>>>
>>>>>>>> comparing
>>>>
>>>>>  with
>>>>>>>>> and without the feature enabled:
>>>>>>>>>
>>>>>>>>>                             Performance of scan relative
>>>>>>>>> % of rows selected        to not enabling the feature
>>>>>>>>> ---------------------    ------------------------------******--
>>>>>>>>>
>>>>>>>>> 100%                            1.0x
>>>>>>>>>    80%                            2.0x
>>>>>>>>>    60%                            2.3x
>>>>>>>>>    40%                            2.2x
>>>>>>>>>    20%                            1.5x
>>>>>>>>>    10%                            1.0x
>>>>>>>>>     5%                            0.67x
>>>>>>>>>     0%                            0.30%
>>>>>>>>>
>>>>>>>>> In our scenario, we have two column families. The key value from
>>>>>>>>> the
>>>>>>>>> essential column family is used in the filter, while the key value
>>>>>>>>>
>>>>>>>> from
>>>>>
>>>>>>  the
>>>>>>>>> other, non essential column family is returned by the scan. Each
>>>>>>>>> row
>>>>>>>>> contains values for both key values, with the values being
>>>>>>>>>
>>>>>>>> relatively
>>>>
>>>>>  narrow (less than 50 bytes). In this scenario, the only time we're
>>>>>>>>> seeing a
>>>>>>>>> performance gain is when less than 10% of the rows are selected.
>>>>>>>>>
>>>>>>>>> Is this a reasonable test? Has anyone else measured this?
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> James
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>
>

Re: Essential column family performance

Posted by Ted Yu <yu...@gmail.com>.
bq. is the 40% randomly distributed or sequential?
Looks like the distribution is striped:

        if (i % 100 <= flag_percent) {

          put.add(cf_essential, col_name, flag_yes);
In each stripe, it is sequential.

Let me try simulating random distribution.

On Mon, Apr 8, 2013 at 10:38 AM, James Taylor <jt...@salesforce.com>wrote:

> In the TestJoinedScanners.java, is the 40% randomly distributed or
> sequential?
>
> In our test, the % is randomly distributed. Also, our custom filter does
> the same thing that SingleColumnValueFilter does.  On the client-side, we'd
> execute the query in parallel, through multiple scans along the region
> boundaries. Would that have a negative impact on performance for this
> "essential column family" feature?
>
> Thanks,
>
>     James
>
>
> On 04/08/2013 10:10 AM, Anoop John wrote:
>
>> Agree here. The effectiveness depends on what % of data satisfies the
>> condition, how it is distributed across HFile blocks. We will get
>> performance gain when the we will be able to skip some HFile blocks (from
>> non essential CFs). Can test with different HFile block size (lower
>> value)?
>>
>> -Anoop-
>>
>>
>> On Mon, Apr 8, 2013 at 8:19 PM, Ted Yu <yu...@gmail.com> wrote:
>>
>>  I made the following change in TestJoinedScanners.java:
>>>
>>> -      int flag_percent = 1;
>>> +      int flag_percent = 40;
>>>
>>> The test took longer but still favors joined scanner.
>>> I got some new results:
>>>
>>> 2013-04-08 07:46:06,959 INFO  [main] regionserver.**
>>> TestJoinedScanners(157):
>>> Slow scanner finished in 7.424388 seconds, got 2050 rows
>>> ...
>>> 2013-04-08 07:46:12,010 INFO  [main] regionserver.**
>>> TestJoinedScanners(157):
>>> Joined scanner finished in 5.05063 seconds, got 2050 rows
>>>
>>> 2013-04-08 07:46:18,358 INFO  [main] regionserver.**
>>> TestJoinedScanners(157):
>>> Slow scanner finished in 6.348517 seconds, got 2050 rows
>>> ...
>>> 2013-04-08 07:46:22,946 INFO  [main] regionserver.**
>>> TestJoinedScanners(157):
>>> Joined scanner finished in 4.587545 seconds, got 2050 rows
>>>
>>> Looks like effectiveness of joined scanner is affected by distribution of
>>> data.
>>>
>>> Cheers
>>>
>>> On Sun, Apr 7, 2013 at 8:52 PM, lars hofhansl <la...@apache.org> wrote:
>>>
>>>  Looking at the joined scanner test code, it sets it up such that 1% of
>>>>
>>> the
>>>
>>>> rows match, which would somewhat be in line with James' results.
>>>>
>>>> In my own testing a while ago I found a 100% improvement with 0% match.
>>>>
>>>>
>>>> -- Lars
>>>>
>>>>
>>>>
>>>> ______________________________**__
>>>>   From: Ted Yu <yu...@gmail.com>
>>>> To: user@hbase.apache.org
>>>> Sent: Sunday, April 7, 2013 4:13 PM
>>>> Subject: Re: Essential column family performance
>>>>
>>>> I have attached 5416-TestJoinedScanners-0.94.**txt to HBASE-5416 for
>>>> your
>>>> reference.
>>>>
>>>> On my MacBook, I got the following results from the test:
>>>>
>>>> 2013-04-07 16:08:17,474 INFO  [main]
>>>>
>>> regionserver.**TestJoinedScanners(157):
>>>
>>>> Slow scanner finished in 7.973822 seconds, got 100 rows
>>>> ...
>>>> 2013-04-07 16:08:17,946 INFO  [main]
>>>>
>>> regionserver.**TestJoinedScanners(157):
>>>
>>>> Joined scanner finished in 0.47235 seconds, got 100 rows
>>>>
>>>> Cheers
>>>>
>>>> On Sun, Apr 7, 2013 at 4:03 PM, Ted Yu <yu...@gmail.com> wrote:
>>>>
>>>>  Looking at
>>>>>
>>>>>  https://issues.apache.org/**jira/secure/attachment/**
>>> 12564340/5416-0.94-v3.txt<https://issues.apache.org/jira/secure/attachment/12564340/5416-0.94-v3.txt>
>>> ,
>>>
>>>> I found that it didn't contain TestJoinedScanners which shows
>>>>
>>>>> difference in scanner performance:
>>>>>
>>>>>     LOG.info((slow ? "Slow" : "Joined") + " scanner finished in " +
>>>>> Double.toString(timeSec)
>>>>>
>>>>>        + " seconds, got " + Long.toString(rows_count/2) + " rows");
>>>>>
>>>>> The test uses SingleColumnValueFilter:
>>>>>
>>>>>      SingleColumnValueFilter filter = new SingleColumnValueFilter(
>>>>>
>>>>>          cf_essential, col_name, CompareFilter.CompareOp.EQUAL,
>>>>>
>>>> flag_yes);
>>>
>>>> It is possible that the custom filter you were using would exhibit
>>>>> different access pattern compared to SingleColumnValueFilter. e.g. does
>>>>> your filter utilize hint ?
>>>>> It would be easier for me and other people to reproduce the issue you
>>>>> experienced if you put your scenario in some test similar to
>>>>> TestJoinedScanners.
>>>>>
>>>>> Will take a closer look at the code Monday.
>>>>>
>>>>> Cheers
>>>>>
>>>>> On Sun, Apr 7, 2013 at 11:37 AM, James Taylor <jtaylor@salesforce.com
>>>>> wrote:
>>>>>
>>>>>  Yes, on 0.94.6. We have our own custom filter derived from FilterBase,
>>>>>>
>>>>> so
>>>>
>>>>> filterIfMissing isn't the issue - the results of the scan are correct.
>>>>>>
>>>>>> I can see that if the essential column family has more data compared
>>>>>>
>>>>> to
>>>
>>>> the non essential column family that the results would eventually even
>>>>>>
>>>>> out.
>>>>
>>>>> I was hoping to always be able to enable the essential column family
>>>>>> feature. Is there an inherent reason why performance would degrade
>>>>>>
>>>>> like
>>>
>>>> this? Does it boil down to a single sequential scan versus many seeks?
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> James
>>>>>>
>>>>>>
>>>>>> On 04/07/2013 07:44 AM, Ted Yu wrote:
>>>>>>
>>>>>>  James:
>>>>>>> Your test was based on 0.94.6.1, right ?
>>>>>>>
>>>>>>> What Filter were you using ?
>>>>>>>
>>>>>>> If you used SingleColumnValueFilter, have you seen my comment here ?
>>>>>>> https://issues.apache.org/****jira/browse/HBASE-5416?**<https://issues.apache.org/**jira/browse/HBASE-5416?**>
>>>>>>> focusedCommentId=13541229&****page=com.atlassian.jira.**
>>>>>>> plugin.system.issuetabpanels:****comment-tabpanel#comment-****
>>>>>>> 13541229<
>>>>>>>
>>>>>> https://issues.apache.org/**jira/browse/HBASE-5416?**
>>> focusedCommentId=13541229&**page=com.atlassian.jira.**
>>> plugin.system.issuetabpanels:**comment-tabpanel#comment-**13541229<https://issues.apache.org/jira/browse/HBASE-5416?focusedCommentId=13541229&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13541229>
>>>
>>>> BTW the use case Max Lapan tried to address has non essential column
>>>>>>> family
>>>>>>> carrying considerably more data compared to essential column family.
>>>>>>>
>>>>>>> Cheers
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Sat, Apr 6, 2013 at 11:05 PM, James Taylor <
>>>>>>>
>>>>>> jtaylor@salesforce.com
>>>
>>>> wrote:
>>>>>>>>
>>>>>>>   Hello,
>>>>>>>
>>>>>>>> We're doing some performance testing of the essential column family
>>>>>>>> feature, and we're seeing some performance degradation when
>>>>>>>>
>>>>>>> comparing
>>>
>>>> with
>>>>>>>> and without the feature enabled:
>>>>>>>>
>>>>>>>>                             Performance of scan relative
>>>>>>>> % of rows selected        to not enabling the feature
>>>>>>>> ---------------------    ------------------------------******--
>>>>>>>>
>>>>>>>> 100%                            1.0x
>>>>>>>>    80%                            2.0x
>>>>>>>>    60%                            2.3x
>>>>>>>>    40%                            2.2x
>>>>>>>>    20%                            1.5x
>>>>>>>>    10%                            1.0x
>>>>>>>>     5%                            0.67x
>>>>>>>>     0%                            0.30%
>>>>>>>>
>>>>>>>> In our scenario, we have two column families. The key value from the
>>>>>>>> essential column family is used in the filter, while the key value
>>>>>>>>
>>>>>>> from
>>>>
>>>>> the
>>>>>>>> other, non essential column family is returned by the scan. Each row
>>>>>>>> contains values for both key values, with the values being
>>>>>>>>
>>>>>>> relatively
>>>
>>>> narrow (less than 50 bytes). In this scenario, the only time we're
>>>>>>>> seeing a
>>>>>>>> performance gain is when less than 10% of the rows are selected.
>>>>>>>>
>>>>>>>> Is this a reasonable test? Has anyone else measured this?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> James
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>

Re: Essential column family performance

Posted by James Taylor <jt...@salesforce.com>.
In the TestJoinedScanners.java, is the 40% randomly distributed or 
sequential?

In our test, the % is randomly distributed. Also, our custom filter does 
the same thing that SingleColumnValueFilter does.  On the client-side, 
we'd execute the query in parallel, through multiple scans along the 
region boundaries. Would that have a negative impact on performance for 
this "essential column family" feature?

Thanks,

     James

On 04/08/2013 10:10 AM, Anoop John wrote:
> Agree here. The effectiveness depends on what % of data satisfies the
> condition, how it is distributed across HFile blocks. We will get
> performance gain when the we will be able to skip some HFile blocks (from
> non essential CFs). Can test with different HFile block size (lower value)?
>
> -Anoop-
>
>
> On Mon, Apr 8, 2013 at 8:19 PM, Ted Yu <yu...@gmail.com> wrote:
>
>> I made the following change in TestJoinedScanners.java:
>>
>> -      int flag_percent = 1;
>> +      int flag_percent = 40;
>>
>> The test took longer but still favors joined scanner.
>> I got some new results:
>>
>> 2013-04-08 07:46:06,959 INFO  [main] regionserver.TestJoinedScanners(157):
>> Slow scanner finished in 7.424388 seconds, got 2050 rows
>> ...
>> 2013-04-08 07:46:12,010 INFO  [main] regionserver.TestJoinedScanners(157):
>> Joined scanner finished in 5.05063 seconds, got 2050 rows
>>
>> 2013-04-08 07:46:18,358 INFO  [main] regionserver.TestJoinedScanners(157):
>> Slow scanner finished in 6.348517 seconds, got 2050 rows
>> ...
>> 2013-04-08 07:46:22,946 INFO  [main] regionserver.TestJoinedScanners(157):
>> Joined scanner finished in 4.587545 seconds, got 2050 rows
>>
>> Looks like effectiveness of joined scanner is affected by distribution of
>> data.
>>
>> Cheers
>>
>> On Sun, Apr 7, 2013 at 8:52 PM, lars hofhansl <la...@apache.org> wrote:
>>
>>> Looking at the joined scanner test code, it sets it up such that 1% of
>> the
>>> rows match, which would somewhat be in line with James' results.
>>>
>>> In my own testing a while ago I found a 100% improvement with 0% match.
>>>
>>>
>>> -- Lars
>>>
>>>
>>>
>>> ________________________________
>>>   From: Ted Yu <yu...@gmail.com>
>>> To: user@hbase.apache.org
>>> Sent: Sunday, April 7, 2013 4:13 PM
>>> Subject: Re: Essential column family performance
>>>
>>> I have attached 5416-TestJoinedScanners-0.94.txt to HBASE-5416 for your
>>> reference.
>>>
>>> On my MacBook, I got the following results from the test:
>>>
>>> 2013-04-07 16:08:17,474 INFO  [main]
>> regionserver.TestJoinedScanners(157):
>>> Slow scanner finished in 7.973822 seconds, got 100 rows
>>> ...
>>> 2013-04-07 16:08:17,946 INFO  [main]
>> regionserver.TestJoinedScanners(157):
>>> Joined scanner finished in 0.47235 seconds, got 100 rows
>>>
>>> Cheers
>>>
>>> On Sun, Apr 7, 2013 at 4:03 PM, Ted Yu <yu...@gmail.com> wrote:
>>>
>>>> Looking at
>>>>
>> https://issues.apache.org/jira/secure/attachment/12564340/5416-0.94-v3.txt
>> ,
>>> I found that it didn't contain TestJoinedScanners which shows
>>>> difference in scanner performance:
>>>>
>>>>     LOG.info((slow ? "Slow" : "Joined") + " scanner finished in " +
>>>> Double.toString(timeSec)
>>>>
>>>>        + " seconds, got " + Long.toString(rows_count/2) + " rows");
>>>>
>>>> The test uses SingleColumnValueFilter:
>>>>
>>>>      SingleColumnValueFilter filter = new SingleColumnValueFilter(
>>>>
>>>>          cf_essential, col_name, CompareFilter.CompareOp.EQUAL,
>> flag_yes);
>>>> It is possible that the custom filter you were using would exhibit
>>>> different access pattern compared to SingleColumnValueFilter. e.g. does
>>>> your filter utilize hint ?
>>>> It would be easier for me and other people to reproduce the issue you
>>>> experienced if you put your scenario in some test similar to
>>>> TestJoinedScanners.
>>>>
>>>> Will take a closer look at the code Monday.
>>>>
>>>> Cheers
>>>>
>>>> On Sun, Apr 7, 2013 at 11:37 AM, James Taylor <jtaylor@salesforce.com
>>>> wrote:
>>>>
>>>>> Yes, on 0.94.6. We have our own custom filter derived from FilterBase,
>>> so
>>>>> filterIfMissing isn't the issue - the results of the scan are correct.
>>>>>
>>>>> I can see that if the essential column family has more data compared
>> to
>>>>> the non essential column family that the results would eventually even
>>> out.
>>>>> I was hoping to always be able to enable the essential column family
>>>>> feature. Is there an inherent reason why performance would degrade
>> like
>>>>> this? Does it boil down to a single sequential scan versus many seeks?
>>>>>
>>>>> Thanks,
>>>>>
>>>>> James
>>>>>
>>>>>
>>>>> On 04/07/2013 07:44 AM, Ted Yu wrote:
>>>>>
>>>>>> James:
>>>>>> Your test was based on 0.94.6.1, right ?
>>>>>>
>>>>>> What Filter were you using ?
>>>>>>
>>>>>> If you used SingleColumnValueFilter, have you seen my comment here ?
>>>>>> https://issues.apache.org/**jira/browse/HBASE-5416?**
>>>>>> focusedCommentId=13541229&**page=com.atlassian.jira.**
>>>>>> plugin.system.issuetabpanels:**comment-tabpanel#comment-**13541229<
>> https://issues.apache.org/jira/browse/HBASE-5416?focusedCommentId=13541229&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13541229
>>>>>> BTW the use case Max Lapan tried to address has non essential column
>>>>>> family
>>>>>> carrying considerably more data compared to essential column family.
>>>>>>
>>>>>> Cheers
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Sat, Apr 6, 2013 at 11:05 PM, James Taylor <
>> jtaylor@salesforce.com
>>>>>>> wrote:
>>>>>>   Hello,
>>>>>>> We're doing some performance testing of the essential column family
>>>>>>> feature, and we're seeing some performance degradation when
>> comparing
>>>>>>> with
>>>>>>> and without the feature enabled:
>>>>>>>
>>>>>>>                             Performance of scan relative
>>>>>>> % of rows selected        to not enabling the feature
>>>>>>> ---------------------    ------------------------------****--
>>>>>>>
>>>>>>> 100%                            1.0x
>>>>>>>    80%                            2.0x
>>>>>>>    60%                            2.3x
>>>>>>>    40%                            2.2x
>>>>>>>    20%                            1.5x
>>>>>>>    10%                            1.0x
>>>>>>>     5%                            0.67x
>>>>>>>     0%                            0.30%
>>>>>>>
>>>>>>> In our scenario, we have two column families. The key value from the
>>>>>>> essential column family is used in the filter, while the key value
>>> from
>>>>>>> the
>>>>>>> other, non essential column family is returned by the scan. Each row
>>>>>>> contains values for both key values, with the values being
>> relatively
>>>>>>> narrow (less than 50 bytes). In this scenario, the only time we're
>>>>>>> seeing a
>>>>>>> performance gain is when less than 10% of the rows are selected.
>>>>>>>
>>>>>>> Is this a reasonable test? Has anyone else measured this?
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> James
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>


Re: Essential column family performance

Posted by Anoop John <an...@gmail.com>.
Agree here. The effectiveness depends on what % of data satisfies the
condition, how it is distributed across HFile blocks. We will get
performance gain when the we will be able to skip some HFile blocks (from
non essential CFs). Can test with different HFile block size (lower value)?

-Anoop-


On Mon, Apr 8, 2013 at 8:19 PM, Ted Yu <yu...@gmail.com> wrote:

> I made the following change in TestJoinedScanners.java:
>
> -      int flag_percent = 1;
> +      int flag_percent = 40;
>
> The test took longer but still favors joined scanner.
> I got some new results:
>
> 2013-04-08 07:46:06,959 INFO  [main] regionserver.TestJoinedScanners(157):
> Slow scanner finished in 7.424388 seconds, got 2050 rows
> ...
> 2013-04-08 07:46:12,010 INFO  [main] regionserver.TestJoinedScanners(157):
> Joined scanner finished in 5.05063 seconds, got 2050 rows
>
> 2013-04-08 07:46:18,358 INFO  [main] regionserver.TestJoinedScanners(157):
> Slow scanner finished in 6.348517 seconds, got 2050 rows
> ...
> 2013-04-08 07:46:22,946 INFO  [main] regionserver.TestJoinedScanners(157):
> Joined scanner finished in 4.587545 seconds, got 2050 rows
>
> Looks like effectiveness of joined scanner is affected by distribution of
> data.
>
> Cheers
>
> On Sun, Apr 7, 2013 at 8:52 PM, lars hofhansl <la...@apache.org> wrote:
>
> > Looking at the joined scanner test code, it sets it up such that 1% of
> the
> > rows match, which would somewhat be in line with James' results.
> >
> > In my own testing a while ago I found a 100% improvement with 0% match.
> >
> >
> > -- Lars
> >
> >
> >
> > ________________________________
> >  From: Ted Yu <yu...@gmail.com>
> > To: user@hbase.apache.org
> > Sent: Sunday, April 7, 2013 4:13 PM
> > Subject: Re: Essential column family performance
> >
> > I have attached 5416-TestJoinedScanners-0.94.txt to HBASE-5416 for your
> > reference.
> >
> > On my MacBook, I got the following results from the test:
> >
> > 2013-04-07 16:08:17,474 INFO  [main]
> regionserver.TestJoinedScanners(157):
> > Slow scanner finished in 7.973822 seconds, got 100 rows
> > ...
> > 2013-04-07 16:08:17,946 INFO  [main]
> regionserver.TestJoinedScanners(157):
> > Joined scanner finished in 0.47235 seconds, got 100 rows
> >
> > Cheers
> >
> > On Sun, Apr 7, 2013 at 4:03 PM, Ted Yu <yu...@gmail.com> wrote:
> >
> > > Looking at
> > >
> >
> https://issues.apache.org/jira/secure/attachment/12564340/5416-0.94-v3.txt
> ,
> > I found that it didn't contain TestJoinedScanners which shows
> > > difference in scanner performance:
> > >
> > >    LOG.info((slow ? "Slow" : "Joined") + " scanner finished in " +
> > > Double.toString(timeSec)
> > >
> > >       + " seconds, got " + Long.toString(rows_count/2) + " rows");
> > >
> > > The test uses SingleColumnValueFilter:
> > >
> > >     SingleColumnValueFilter filter = new SingleColumnValueFilter(
> > >
> > >         cf_essential, col_name, CompareFilter.CompareOp.EQUAL,
> flag_yes);
> > > It is possible that the custom filter you were using would exhibit
> > > different access pattern compared to SingleColumnValueFilter. e.g. does
> > > your filter utilize hint ?
> > > It would be easier for me and other people to reproduce the issue you
> > > experienced if you put your scenario in some test similar to
> > > TestJoinedScanners.
> > >
> > > Will take a closer look at the code Monday.
> > >
> > > Cheers
> > >
> > > On Sun, Apr 7, 2013 at 11:37 AM, James Taylor <jtaylor@salesforce.com
> > >wrote:
> > >
> > >> Yes, on 0.94.6. We have our own custom filter derived from FilterBase,
> > so
> > >> filterIfMissing isn't the issue - the results of the scan are correct.
> > >>
> > >> I can see that if the essential column family has more data compared
> to
> > >> the non essential column family that the results would eventually even
> > out.
> > >> I was hoping to always be able to enable the essential column family
> > >> feature. Is there an inherent reason why performance would degrade
> like
> > >> this? Does it boil down to a single sequential scan versus many seeks?
> > >>
> > >> Thanks,
> > >>
> > >> James
> > >>
> > >>
> > >> On 04/07/2013 07:44 AM, Ted Yu wrote:
> > >>
> > >>> James:
> > >>> Your test was based on 0.94.6.1, right ?
> > >>>
> > >>> What Filter were you using ?
> > >>>
> > >>> If you used SingleColumnValueFilter, have you seen my comment here ?
> > >>> https://issues.apache.org/**jira/browse/HBASE-5416?**
> > >>> focusedCommentId=13541229&**page=com.atlassian.jira.**
> > >>> plugin.system.issuetabpanels:**comment-tabpanel#comment-**13541229<
> >
> https://issues.apache.org/jira/browse/HBASE-5416?focusedCommentId=13541229&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13541229
> > >
> > >>>
> > >>> BTW the use case Max Lapan tried to address has non essential column
> > >>> family
> > >>> carrying considerably more data compared to essential column family.
> > >>>
> > >>> Cheers
> > >>>
> > >>>
> > >>>
> > >>> On Sat, Apr 6, 2013 at 11:05 PM, James Taylor <
> jtaylor@salesforce.com
> > >>> >wrote:
> > >>>
> > >>>  Hello,
> > >>>> We're doing some performance testing of the essential column family
> > >>>> feature, and we're seeing some performance degradation when
> comparing
> > >>>> with
> > >>>> and without the feature enabled:
> > >>>>
> > >>>>                            Performance of scan relative
> > >>>> % of rows selected        to not enabling the feature
> > >>>> ---------------------    ------------------------------****--
> > >>>>
> > >>>> 100%                            1.0x
> > >>>>   80%                            2.0x
> > >>>>   60%                            2.3x
> > >>>>   40%                            2.2x
> > >>>>   20%                            1.5x
> > >>>>   10%                            1.0x
> > >>>>    5%                            0.67x
> > >>>>    0%                            0.30%
> > >>>>
> > >>>> In our scenario, we have two column families. The key value from the
> > >>>> essential column family is used in the filter, while the key value
> > from
> > >>>> the
> > >>>> other, non essential column family is returned by the scan. Each row
> > >>>> contains values for both key values, with the values being
> relatively
> > >>>> narrow (less than 50 bytes). In this scenario, the only time we're
> > >>>> seeing a
> > >>>> performance gain is when less than 10% of the rows are selected.
> > >>>>
> > >>>> Is this a reasonable test? Has anyone else measured this?
> > >>>>
> > >>>> Thanks,
> > >>>>
> > >>>> James
> > >>>>
> > >>>>
> > >>>>
> > >>>>
> > >>>>
> > >>>>
> > >>>>
> > >>
> > >
> >
>

Re: Essential column family performance

Posted by Ted Yu <yu...@gmail.com>.
I made the following change in TestJoinedScanners.java:

-      int flag_percent = 1;
+      int flag_percent = 40;

The test took longer but still favors joined scanner.
I got some new results:

2013-04-08 07:46:06,959 INFO  [main] regionserver.TestJoinedScanners(157):
Slow scanner finished in 7.424388 seconds, got 2050 rows
...
2013-04-08 07:46:12,010 INFO  [main] regionserver.TestJoinedScanners(157):
Joined scanner finished in 5.05063 seconds, got 2050 rows

2013-04-08 07:46:18,358 INFO  [main] regionserver.TestJoinedScanners(157):
Slow scanner finished in 6.348517 seconds, got 2050 rows
...
2013-04-08 07:46:22,946 INFO  [main] regionserver.TestJoinedScanners(157):
Joined scanner finished in 4.587545 seconds, got 2050 rows

Looks like effectiveness of joined scanner is affected by distribution of
data.

Cheers

On Sun, Apr 7, 2013 at 8:52 PM, lars hofhansl <la...@apache.org> wrote:

> Looking at the joined scanner test code, it sets it up such that 1% of the
> rows match, which would somewhat be in line with James' results.
>
> In my own testing a while ago I found a 100% improvement with 0% match.
>
>
> -- Lars
>
>
>
> ________________________________
>  From: Ted Yu <yu...@gmail.com>
> To: user@hbase.apache.org
> Sent: Sunday, April 7, 2013 4:13 PM
> Subject: Re: Essential column family performance
>
> I have attached 5416-TestJoinedScanners-0.94.txt to HBASE-5416 for your
> reference.
>
> On my MacBook, I got the following results from the test:
>
> 2013-04-07 16:08:17,474 INFO  [main] regionserver.TestJoinedScanners(157):
> Slow scanner finished in 7.973822 seconds, got 100 rows
> ...
> 2013-04-07 16:08:17,946 INFO  [main] regionserver.TestJoinedScanners(157):
> Joined scanner finished in 0.47235 seconds, got 100 rows
>
> Cheers
>
> On Sun, Apr 7, 2013 at 4:03 PM, Ted Yu <yu...@gmail.com> wrote:
>
> > Looking at
> >
> https://issues.apache.org/jira/secure/attachment/12564340/5416-0.94-v3.txt,
> I found that it didn't contain TestJoinedScanners which shows
> > difference in scanner performance:
> >
> >    LOG.info((slow ? "Slow" : "Joined") + " scanner finished in " +
> > Double.toString(timeSec)
> >
> >       + " seconds, got " + Long.toString(rows_count/2) + " rows");
> >
> > The test uses SingleColumnValueFilter:
> >
> >     SingleColumnValueFilter filter = new SingleColumnValueFilter(
> >
> >         cf_essential, col_name, CompareFilter.CompareOp.EQUAL, flag_yes);
> > It is possible that the custom filter you were using would exhibit
> > different access pattern compared to SingleColumnValueFilter. e.g. does
> > your filter utilize hint ?
> > It would be easier for me and other people to reproduce the issue you
> > experienced if you put your scenario in some test similar to
> > TestJoinedScanners.
> >
> > Will take a closer look at the code Monday.
> >
> > Cheers
> >
> > On Sun, Apr 7, 2013 at 11:37 AM, James Taylor <jtaylor@salesforce.com
> >wrote:
> >
> >> Yes, on 0.94.6. We have our own custom filter derived from FilterBase,
> so
> >> filterIfMissing isn't the issue - the results of the scan are correct.
> >>
> >> I can see that if the essential column family has more data compared to
> >> the non essential column family that the results would eventually even
> out.
> >> I was hoping to always be able to enable the essential column family
> >> feature. Is there an inherent reason why performance would degrade like
> >> this? Does it boil down to a single sequential scan versus many seeks?
> >>
> >> Thanks,
> >>
> >> James
> >>
> >>
> >> On 04/07/2013 07:44 AM, Ted Yu wrote:
> >>
> >>> James:
> >>> Your test was based on 0.94.6.1, right ?
> >>>
> >>> What Filter were you using ?
> >>>
> >>> If you used SingleColumnValueFilter, have you seen my comment here ?
> >>> https://issues.apache.org/**jira/browse/HBASE-5416?**
> >>> focusedCommentId=13541229&**page=com.atlassian.jira.**
> >>> plugin.system.issuetabpanels:**comment-tabpanel#comment-**13541229<
> https://issues.apache.org/jira/browse/HBASE-5416?focusedCommentId=13541229&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13541229
> >
> >>>
> >>> BTW the use case Max Lapan tried to address has non essential column
> >>> family
> >>> carrying considerably more data compared to essential column family.
> >>>
> >>> Cheers
> >>>
> >>>
> >>>
> >>> On Sat, Apr 6, 2013 at 11:05 PM, James Taylor <jtaylor@salesforce.com
> >>> >wrote:
> >>>
> >>>  Hello,
> >>>> We're doing some performance testing of the essential column family
> >>>> feature, and we're seeing some performance degradation when comparing
> >>>> with
> >>>> and without the feature enabled:
> >>>>
> >>>>                            Performance of scan relative
> >>>> % of rows selected        to not enabling the feature
> >>>> ---------------------    ------------------------------****--
> >>>>
> >>>> 100%                            1.0x
> >>>>   80%                            2.0x
> >>>>   60%                            2.3x
> >>>>   40%                            2.2x
> >>>>   20%                            1.5x
> >>>>   10%                            1.0x
> >>>>    5%                            0.67x
> >>>>    0%                            0.30%
> >>>>
> >>>> In our scenario, we have two column families. The key value from the
> >>>> essential column family is used in the filter, while the key value
> from
> >>>> the
> >>>> other, non essential column family is returned by the scan. Each row
> >>>> contains values for both key values, with the values being relatively
> >>>> narrow (less than 50 bytes). In this scenario, the only time we're
> >>>> seeing a
> >>>> performance gain is when less than 10% of the rows are selected.
> >>>>
> >>>> Is this a reasonable test? Has anyone else measured this?
> >>>>
> >>>> Thanks,
> >>>>
> >>>> James
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>
> >
>

Re: Essential column family performance

Posted by lars hofhansl <la...@apache.org>.
Looking at the joined scanner test code, it sets it up such that 1% of the rows match, which would somewhat be in line with James' results.

In my own testing a while ago I found a 100% improvement with 0% match.


-- Lars



________________________________
 From: Ted Yu <yu...@gmail.com>
To: user@hbase.apache.org 
Sent: Sunday, April 7, 2013 4:13 PM
Subject: Re: Essential column family performance
 
I have attached 5416-TestJoinedScanners-0.94.txt to HBASE-5416 for your
reference.

On my MacBook, I got the following results from the test:

2013-04-07 16:08:17,474 INFO  [main] regionserver.TestJoinedScanners(157):
Slow scanner finished in 7.973822 seconds, got 100 rows
...
2013-04-07 16:08:17,946 INFO  [main] regionserver.TestJoinedScanners(157):
Joined scanner finished in 0.47235 seconds, got 100 rows

Cheers

On Sun, Apr 7, 2013 at 4:03 PM, Ted Yu <yu...@gmail.com> wrote:

> Looking at
> https://issues.apache.org/jira/secure/attachment/12564340/5416-0.94-v3.txt, I found that it didn't contain TestJoinedScanners which shows
> difference in scanner performance:
>
>    LOG.info((slow ? "Slow" : "Joined") + " scanner finished in " +
> Double.toString(timeSec)
>
>       + " seconds, got " + Long.toString(rows_count/2) + " rows");
>
> The test uses SingleColumnValueFilter:
>
>     SingleColumnValueFilter filter = new SingleColumnValueFilter(
>
>         cf_essential, col_name, CompareFilter.CompareOp.EQUAL, flag_yes);
> It is possible that the custom filter you were using would exhibit
> different access pattern compared to SingleColumnValueFilter. e.g. does
> your filter utilize hint ?
> It would be easier for me and other people to reproduce the issue you
> experienced if you put your scenario in some test similar to
> TestJoinedScanners.
>
> Will take a closer look at the code Monday.
>
> Cheers
>
> On Sun, Apr 7, 2013 at 11:37 AM, James Taylor <jt...@salesforce.com>wrote:
>
>> Yes, on 0.94.6. We have our own custom filter derived from FilterBase, so
>> filterIfMissing isn't the issue - the results of the scan are correct.
>>
>> I can see that if the essential column family has more data compared to
>> the non essential column family that the results would eventually even out.
>> I was hoping to always be able to enable the essential column family
>> feature. Is there an inherent reason why performance would degrade like
>> this? Does it boil down to a single sequential scan versus many seeks?
>>
>> Thanks,
>>
>> James
>>
>>
>> On 04/07/2013 07:44 AM, Ted Yu wrote:
>>
>>> James:
>>> Your test was based on 0.94.6.1, right ?
>>>
>>> What Filter were you using ?
>>>
>>> If you used SingleColumnValueFilter, have you seen my comment here ?
>>> https://issues.apache.org/**jira/browse/HBASE-5416?**
>>> focusedCommentId=13541229&**page=com.atlassian.jira.**
>>> plugin.system.issuetabpanels:**comment-tabpanel#comment-**13541229<https://issues.apache.org/jira/browse/HBASE-5416?focusedCommentId=13541229&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13541229>
>>>
>>> BTW the use case Max Lapan tried to address has non essential column
>>> family
>>> carrying considerably more data compared to essential column family.
>>>
>>> Cheers
>>>
>>>
>>>
>>> On Sat, Apr 6, 2013 at 11:05 PM, James Taylor <jtaylor@salesforce.com
>>> >wrote:
>>>
>>>  Hello,
>>>> We're doing some performance testing of the essential column family
>>>> feature, and we're seeing some performance degradation when comparing
>>>> with
>>>> and without the feature enabled:
>>>>
>>>>                            Performance of scan relative
>>>> % of rows selected        to not enabling the feature
>>>> ---------------------    ------------------------------****--
>>>>
>>>> 100%                            1.0x
>>>>   80%                            2.0x
>>>>   60%                            2.3x
>>>>   40%                            2.2x
>>>>   20%                            1.5x
>>>>   10%                            1.0x
>>>>    5%                            0.67x
>>>>    0%                            0.30%
>>>>
>>>> In our scenario, we have two column families. The key value from the
>>>> essential column family is used in the filter, while the key value from
>>>> the
>>>> other, non essential column family is returned by the scan. Each row
>>>> contains values for both key values, with the values being relatively
>>>> narrow (less than 50 bytes). In this scenario, the only time we're
>>>> seeing a
>>>> performance gain is when less than 10% of the rows are selected.
>>>>
>>>> Is this a reasonable test? Has anyone else measured this?
>>>>
>>>> Thanks,
>>>>
>>>> James
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>
>

Re: Essential column family performance

Posted by Ted Yu <yu...@gmail.com>.
I have attached 5416-TestJoinedScanners-0.94.txt to HBASE-5416 for your
reference.

On my MacBook, I got the following results from the test:

2013-04-07 16:08:17,474 INFO  [main] regionserver.TestJoinedScanners(157):
Slow scanner finished in 7.973822 seconds, got 100 rows
...
2013-04-07 16:08:17,946 INFO  [main] regionserver.TestJoinedScanners(157):
Joined scanner finished in 0.47235 seconds, got 100 rows

Cheers

On Sun, Apr 7, 2013 at 4:03 PM, Ted Yu <yu...@gmail.com> wrote:

> Looking at
> https://issues.apache.org/jira/secure/attachment/12564340/5416-0.94-v3.txt, I found that it didn't contain TestJoinedScanners which shows
> difference in scanner performance:
>
>     LOG.info((slow ? "Slow" : "Joined") + " scanner finished in " +
> Double.toString(timeSec)
>
>       + " seconds, got " + Long.toString(rows_count/2) + " rows");
>
> The test uses SingleColumnValueFilter:
>
>     SingleColumnValueFilter filter = new SingleColumnValueFilter(
>
>         cf_essential, col_name, CompareFilter.CompareOp.EQUAL, flag_yes);
> It is possible that the custom filter you were using would exhibit
> different access pattern compared to SingleColumnValueFilter. e.g. does
> your filter utilize hint ?
> It would be easier for me and other people to reproduce the issue you
> experienced if you put your scenario in some test similar to
> TestJoinedScanners.
>
> Will take a closer look at the code Monday.
>
> Cheers
>
> On Sun, Apr 7, 2013 at 11:37 AM, James Taylor <jt...@salesforce.com>wrote:
>
>> Yes, on 0.94.6. We have our own custom filter derived from FilterBase, so
>> filterIfMissing isn't the issue - the results of the scan are correct.
>>
>> I can see that if the essential column family has more data compared to
>> the non essential column family that the results would eventually even out.
>> I was hoping to always be able to enable the essential column family
>> feature. Is there an inherent reason why performance would degrade like
>> this? Does it boil down to a single sequential scan versus many seeks?
>>
>> Thanks,
>>
>> James
>>
>>
>> On 04/07/2013 07:44 AM, Ted Yu wrote:
>>
>>> James:
>>> Your test was based on 0.94.6.1, right ?
>>>
>>> What Filter were you using ?
>>>
>>> If you used SingleColumnValueFilter, have you seen my comment here ?
>>> https://issues.apache.org/**jira/browse/HBASE-5416?**
>>> focusedCommentId=13541229&**page=com.atlassian.jira.**
>>> plugin.system.issuetabpanels:**comment-tabpanel#comment-**13541229<https://issues.apache.org/jira/browse/HBASE-5416?focusedCommentId=13541229&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13541229>
>>>
>>> BTW the use case Max Lapan tried to address has non essential column
>>> family
>>> carrying considerably more data compared to essential column family.
>>>
>>> Cheers
>>>
>>>
>>>
>>> On Sat, Apr 6, 2013 at 11:05 PM, James Taylor <jtaylor@salesforce.com
>>> >wrote:
>>>
>>>  Hello,
>>>> We're doing some performance testing of the essential column family
>>>> feature, and we're seeing some performance degradation when comparing
>>>> with
>>>> and without the feature enabled:
>>>>
>>>>                            Performance of scan relative
>>>> % of rows selected        to not enabling the feature
>>>> ---------------------    ------------------------------****--
>>>>
>>>> 100%                            1.0x
>>>>   80%                            2.0x
>>>>   60%                            2.3x
>>>>   40%                            2.2x
>>>>   20%                            1.5x
>>>>   10%                            1.0x
>>>>    5%                            0.67x
>>>>    0%                            0.30%
>>>>
>>>> In our scenario, we have two column families. The key value from the
>>>> essential column family is used in the filter, while the key value from
>>>> the
>>>> other, non essential column family is returned by the scan. Each row
>>>> contains values for both key values, with the values being relatively
>>>> narrow (less than 50 bytes). In this scenario, the only time we're
>>>> seeing a
>>>> performance gain is when less than 10% of the rows are selected.
>>>>
>>>> Is this a reasonable test? Has anyone else measured this?
>>>>
>>>> Thanks,
>>>>
>>>> James
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>
>

Re: Essential column family performance

Posted by Ted Yu <yu...@gmail.com>.
Looking at
https://issues.apache.org/jira/secure/attachment/12564340/5416-0.94-v3.txt,
I found that it didn't contain TestJoinedScanners which shows
difference in scanner performance:

    LOG.info((slow ? "Slow" : "Joined") + " scanner finished in " +
Double.toString(timeSec)

      + " seconds, got " + Long.toString(rows_count/2) + " rows");

The test uses SingleColumnValueFilter:

    SingleColumnValueFilter filter = new SingleColumnValueFilter(

        cf_essential, col_name, CompareFilter.CompareOp.EQUAL, flag_yes);
It is possible that the custom filter you were using would exhibit
different access pattern compared to SingleColumnValueFilter. e.g. does
your filter utilize hint ?
It would be easier for me and other people to reproduce the issue you
experienced if you put your scenario in some test similar to
TestJoinedScanners.

Will take a closer look at the code Monday.

Cheers

On Sun, Apr 7, 2013 at 11:37 AM, James Taylor <jt...@salesforce.com>wrote:

> Yes, on 0.94.6. We have our own custom filter derived from FilterBase, so
> filterIfMissing isn't the issue - the results of the scan are correct.
>
> I can see that if the essential column family has more data compared to
> the non essential column family that the results would eventually even out.
> I was hoping to always be able to enable the essential column family
> feature. Is there an inherent reason why performance would degrade like
> this? Does it boil down to a single sequential scan versus many seeks?
>
> Thanks,
>
> James
>
>
> On 04/07/2013 07:44 AM, Ted Yu wrote:
>
>> James:
>> Your test was based on 0.94.6.1, right ?
>>
>> What Filter were you using ?
>>
>> If you used SingleColumnValueFilter, have you seen my comment here ?
>> https://issues.apache.org/**jira/browse/HBASE-5416?**
>> focusedCommentId=13541229&**page=com.atlassian.jira.**
>> plugin.system.issuetabpanels:**comment-tabpanel#comment-**13541229<https://issues.apache.org/jira/browse/HBASE-5416?focusedCommentId=13541229&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13541229>
>>
>> BTW the use case Max Lapan tried to address has non essential column
>> family
>> carrying considerably more data compared to essential column family.
>>
>> Cheers
>>
>>
>>
>> On Sat, Apr 6, 2013 at 11:05 PM, James Taylor <jtaylor@salesforce.com
>> >wrote:
>>
>>  Hello,
>>> We're doing some performance testing of the essential column family
>>> feature, and we're seeing some performance degradation when comparing
>>> with
>>> and without the feature enabled:
>>>
>>>                            Performance of scan relative
>>> % of rows selected        to not enabling the feature
>>> ---------------------    ------------------------------****--
>>>
>>> 100%                            1.0x
>>>   80%                            2.0x
>>>   60%                            2.3x
>>>   40%                            2.2x
>>>   20%                            1.5x
>>>   10%                            1.0x
>>>    5%                            0.67x
>>>    0%                            0.30%
>>>
>>> In our scenario, we have two column families. The key value from the
>>> essential column family is used in the filter, while the key value from
>>> the
>>> other, non essential column family is returned by the scan. Each row
>>> contains values for both key values, with the values being relatively
>>> narrow (less than 50 bytes). In this scenario, the only time we're
>>> seeing a
>>> performance gain is when less than 10% of the rows are selected.
>>>
>>> Is this a reasonable test? Has anyone else measured this?
>>>
>>> Thanks,
>>>
>>> James
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>

Re: Essential column family performance

Posted by James Taylor <jt...@salesforce.com>.
Yes, on 0.94.6. We have our own custom filter derived from FilterBase, 
so filterIfMissing isn't the issue - the results of the scan are correct.

I can see that if the essential column family has more data compared to 
the non essential column family that the results would eventually even 
out. I was hoping to always be able to enable the essential column 
family feature. Is there an inherent reason why performance would 
degrade like this? Does it boil down to a single sequential scan versus 
many seeks?

Thanks,

James

On 04/07/2013 07:44 AM, Ted Yu wrote:
> James:
> Your test was based on 0.94.6.1, right ?
>
> What Filter were you using ?
>
> If you used SingleColumnValueFilter, have you seen my comment here ?
> https://issues.apache.org/jira/browse/HBASE-5416?focusedCommentId=13541229&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13541229
>
> BTW the use case Max Lapan tried to address has non essential column family
> carrying considerably more data compared to essential column family.
>
> Cheers
>
>
>
> On Sat, Apr 6, 2013 at 11:05 PM, James Taylor <jt...@salesforce.com>wrote:
>
>> Hello,
>> We're doing some performance testing of the essential column family
>> feature, and we're seeing some performance degradation when comparing with
>> and without the feature enabled:
>>
>>                            Performance of scan relative
>> % of rows selected        to not enabling the feature
>> ---------------------    ------------------------------**--
>> 100%                            1.0x
>>   80%                            2.0x
>>   60%                            2.3x
>>   40%                            2.2x
>>   20%                            1.5x
>>   10%                            1.0x
>>    5%                            0.67x
>>    0%                            0.30%
>>
>> In our scenario, we have two column families. The key value from the
>> essential column family is used in the filter, while the key value from the
>> other, non essential column family is returned by the scan. Each row
>> contains values for both key values, with the values being relatively
>> narrow (less than 50 bytes). In this scenario, the only time we're seeing a
>> performance gain is when less than 10% of the rows are selected.
>>
>> Is this a reasonable test? Has anyone else measured this?
>>
>> Thanks,
>>
>> James
>>
>>
>>
>>
>>
>>


Re: Essential column family performance

Posted by Ted Yu <yu...@gmail.com>.
James:
Your test was based on 0.94.6.1, right ?

What Filter were you using ?

If you used SingleColumnValueFilter, have you seen my comment here ?
https://issues.apache.org/jira/browse/HBASE-5416?focusedCommentId=13541229&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13541229

BTW the use case Max Lapan tried to address has non essential column family
carrying considerably more data compared to essential column family.

Cheers



On Sat, Apr 6, 2013 at 11:05 PM, James Taylor <jt...@salesforce.com>wrote:

> Hello,
> We're doing some performance testing of the essential column family
> feature, and we're seeing some performance degradation when comparing with
> and without the feature enabled:
>
>                           Performance of scan relative
> % of rows selected        to not enabling the feature
> ---------------------    ------------------------------**--
> 100%                            1.0x
>  80%                            2.0x
>  60%                            2.3x
>  40%                            2.2x
>  20%                            1.5x
>  10%                            1.0x
>   5%                            0.67x
>   0%                            0.30%
>
> In our scenario, we have two column families. The key value from the
> essential column family is used in the filter, while the key value from the
> other, non essential column family is returned by the scan. Each row
> contains values for both key values, with the values being relatively
> narrow (less than 50 bytes). In this scenario, the only time we're seeing a
> performance gain is when less than 10% of the rows are selected.
>
> Is this a reasonable test? Has anyone else measured this?
>
> Thanks,
>
> James
>
>
>
>
>
>