You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Jerry Lam <ch...@gmail.com> on 2012/08/27 23:40:50 UTC

setTimeRange and setMaxVersions seem to be inefficient

Hi HBase community:

I tried to use setTimeRange and setMaxVersions to limit the number of KVs
return per column. The behaviour is as I would expect that is
setTimeRange(0, T + 1) and setMaxVersions(1) will give me ONE version of KV
with timestamp that is less than or equal to T.
However, I noticed that all versions of the KeyValue for a particular
column are processed through a custom filter I implemented even though I
specify setMaxVersions(1) and setTimeRange(0, T+1). I expected that if ONE
KV of a particular column has ReturnCode.INCLUDE, the framework will jump
to the next COL instead of iterating through all versions of the column.

Can someone confirm me if this is the expected behaviour (iterating through
all versions of a column before setMaxVersions take effect)? If this is an
expected behaviour, what is your recommendation to speed this up?

Best Regards,

Jerry

Re: setTimeRange and setMaxVersions seem to be inefficient

Posted by Jerry Lam <ch...@gmail.com>.

Hi Ted:

Sure, will do.
I also implement the reset method to set previousIncludedQualifier to null
for the next row to come.

Best Regards,

Jerry

On Wed, Aug 29, 2012 at 1:47 PM, Ted Yu <yu...@gmail.com> wrote:

> Jerry:
> Remember to also implement:
>
> +  @Override
> +  public KeyValue getNextKeyHint(KeyValue currentKV) {
>
> You can log a JIRA for supporting ReturnCode.INCLUDE_AND_NEXT_COL.
>
> Cheers
>
> On Wed, Aug 29, 2012 at 6:59 AM, Jerry Lam <ch...@gmail.com> wrote:
>
> > Hi Lars:
> >
> > Thanks for spending time discussing this with me. I appreciate it.
> >
> > I tried to implement the setMaxVersions(1) inside the filter as follows:
> >
> > @Override
> > public ReturnCode filterKeyValue(KeyValue kv) {
> >
> > // check if the same qualifier as the one that has been included
> > previously. If yes, jump to next column
> > if (previousIncludedQualifier != null &&
> > Bytes.compareTo(previousIncludedQualifier,kv.getQualifier()) == 0) {
> > previousIncludedQualifier = null;
> > return ReturnCode.NEXT_COL;
> > }
> >         // another condition that makes the jump further using HINT
> > if (Bytes.compareTo(this.qualifier, kv.getQualifier()) == 0) {
> > LOG.info("Matched Found.");
> > return ReturnCode.SEEK_NEXT_USING_HINT;
> >
> > }
> >         // include this to the result and keep track of the included
> > qualifier so the next version of the same qualifier will be excluded
> > previousIncludedQualifier = kv.getQualifier();
> > return ReturnCode.INCLUDE;
> > }
> >
> > Does this look reasonable or there is a better way to achieve this? It
> > would be nice to have ReturnCode.INCLUDE_AND_NEXT_COL for this case
> though.
> >
> > Best Regards,
> >
> > Jerry
> >
> >
> > On Wed, Aug 29, 2012 at 2:09 AM, lars hofhansl <lh...@yahoo.com>
> > wrote:
> >
> > > Hi Jerry,
> > >
> > > my answer will be the same again:
> > > Some folks will want the max versions set by the client to be before
> > > filters and some folks will want it to restrict the end result.
> > > It's not possible to have it both ways. Your filter needs to do the
> right
> > > thing.
> > >
> > >
> > > There's a lot of discussion around this in HBASE-5104.
> > >
> > >
> > > -- Lars
> > >
> > >
> > >
> > > ________________________________
> > >  From: Jerry Lam <ch...@gmail.com>
> > > To: user@hbase.apache.org; lars hofhansl <lh...@yahoo.com>
> > > Sent: Tuesday, August 28, 2012 1:52 PM
> > > Subject: Re: setTimeRange and setMaxVersions seem to be inefficient
> > >
> > > Hi Lars:
> > >
> > > I see. Please refer to the inline comment below.
> > >
> > > Best Regards,
> > >
> > > Jerry
> > >
> > > On Tue, Aug 28, 2012 at 2:21 PM, lars hofhansl <lh...@yahoo.com>
> > > wrote:
> > >
> > > > What I was saying was: It depends. :)
> > > >
> > > > First off, how do you get to 1000 versions? In 0.94++ older version
> are
> > > > pruned upon flush, so you need 333 flushes (assuming 3 versions on
> the
> > > CF)
> > > > to get 1000 versions.
> > > >
> > >
> > > I forgot that the default number of version to keep is 3. If this is
> what
> > > people use most of the time, yes you are right for this type of
> scenarios
> > > where the number of version per column to keep is small.
> > >
> > > By that time some compactions will have happened and you're back to
> close
> > > > to 3 versions (maybe 9, 12, or 15 or so, depending on how store files
> > you
> > > > have).
> > > >
> > > > Now, if you have that many version because because you set
> > VERSIONS=>1000
> > > > in your CF... Then imagine you have 100 columns with 1000 versions
> > each.
> > > >
> > >
> > > Yes, imagine I set VERSIONS => Long.MAX_VALUE (i.e. I will manage the
> > > versioning myself)
> > >
> > > In your scenario below you'd do 100000 comparisons if the filter would
> be
> > > > evaluated after the version counting. But only 1100 with the current
> > > code.
> > > > (or at least in that ball park)
> > > >
> > >
> > > This is where I don't quite understand what you mean.
> > >
> > > if the framework counts the number of ReturnCode.INCLUDE and then stops
> > > feeding the KeyValue into the filterKeyValue method after it reaches
> the
> > > count specified in setMaxVersions (i.e. 1 for the case we discussed),
> > > should then be just 100 comparisons only (at most) instead of 1100
> > > comparisons? Maybe I don't understand how the current way is doing...
> > >
> > >
> > >
> > > >
> > > > The gist is: One can construct scenarios where one approach is better
> > > than
> > > > the other. Only one order is possible.
> > > > If you write a custom filter and you care about these things you
> should
> > > > use the seek hints.
> > > >
> > > > -- Lars
> > > >
> > > >
> > > > ----- Original Message -----
> > > > From: Jerry Lam <ch...@gmail.com>
> > > > To: user@hbase.apache.org; lars hofhansl <lh...@yahoo.com>
> > > > Cc:
> > > > Sent: Tuesday, August 28, 2012 7:17 AM
> > > > Subject: Re: setTimeRange and setMaxVersions seem to be inefficient
> > > >
> > > > Hi Lars:
> > > >
> > > > Thanks for the reply.
> > > > I need to understand if I misunderstood the perceived inefficiency
> > > because
> > > > it seems you don't think quite the same.
> > > >
> > > > Let say, as an example, we have 1 row with 2 columns (col-1 and
> col-2)
> > > in a
> > > > table and each column has 1000 versions. Using the following code
> (the
> > > code
> > > > might have errors and don't compile):
> > > > /**
> > > > * This is very simple use case of a ColumnPrefixFilter.
> > > > * In fact all other filters that make use of filterKeyValue will see
> > > > similar
> > > > * performance problems that I have concerned with when the number of
> > > > * versions per column could be huge.
> > > >
> > > > Filter filter = new ColumnPrefixFilter(Bytes.toBytes("col-2"));
> > > > Scan scan = new Scan();
> > > > scan.setFilter(filter);
> > > > ResultScanner scanner = table.getScanner(scan);
> > > > for (Result result : scanner) {
> > > >     for (KeyValue kv : result.raw()) {
> > > >         System.out.println("KV: " + kv + ", Value: " +
> > > >         Bytes.toString(kv.getValue()));
> > > >     }
> > > > }
> > > > scanner.close();
> > > > */
> > > >
> > > > Implicitly, the number of version per column that is going to return
> > is 1
> > > > (the latest version). User might expect that only 2 comparisons for
> > > column
> > > > prefix are needed (1 for col-1 and 1 for col-2) but in fact, it
> > processes
> > > > the filterKeyValue method in ColumnPrefixFilter 1000 times (1 for
> col-1
> > > and
> > > > 1000 for col-2) for col-2 (1 per version) because all versions of the
> > > > column have the same prefix for obvious reason. For col-1, it will
> skip
> > > > using SEEK_NEXT_USING_HINT which should skip the 99 versions of
> col-1.
> > > >
> > > > In summary, the 1000 comparisons (5000 byte comparisons) for the
> column
> > > > prefix "col-2" is wasted because only 1 version is returned to user.
> > > Also,
> > > > I believe this inefficiency is hidden from the user code but it
> affects
> > > all
> > > > filters that use filterKeyValue as the main execution for filtering
> > KVs.
> > > Do
> > > > we have a case to improve HBase to handle this inefficiency? :) It
> > seems
> > > > valid unless you prove otherwise.
> > > >
> > > > Best Regards,
> > > >
> > > > Jerry
> > > >
> > > >
> > > >
> > > > On Tue, Aug 28, 2012 at 12:54 AM, lars hofhansl <lhofhansl@yahoo.com
> >
> > > > wrote:
> > > >
> > > > > First off regarding "inefficiency"... If version counting would
> > happen
> > > > > first and then filter were executed we'd have folks "complaining"
> > about
> > > > > inefficiencies as well:
> > > > > ("Why does the code have to go through the versioning stuff when my
> > > > filter
> > > > > filters the row/column/version anyway?")  ;-)
> > > > >
> > > > >
> > > > > For your problem, you want to make use of "seek hints"...
> > > > >
> > > > > In addition to INCLUDE you can return NEXT_COL, NEXT_ROW, or even
> > > > > SEEK_NEXT_USING_HINT from Filter.filterKeyValue(...).
> > > > >
> > > > > That way the scanning framework will know to skip ahead to the next
> > > > > column, row, or a KV of your choosing. (see Filter.filterKeyValue
> and
> > > > > Filter.getNextKeyHint).
> > > > >
> > > > > (as an aside, it would probably be nice if Filters also had
> > > > > INCLUDE_AND_NEXT_COL, INCLUDE_AND_NEXT_ROW, internally used by
> > > > StoreScanner)
> > > > >
> > > > > Have a look at ColumnPrefixFilter as an example.
> > > > > I also wrote a short post here:
> > > > >
> > > >
> > >
> >
> http://hadoop-hbase.blogspot.com/2012/01/filters-in-hbase-or-intra-row-scanning.html
> > > > >
> > > > > Does that help?
> > > > >
> > > > > -- Lars
> > > > >
> > > > >
> > > > > ----- Original Message -----
> > > > > From: Jerry Lam <ch...@gmail.com>
> > > > > To: "user@hbase.apache.org" <us...@hbase.apache.org>
> > > > > Cc: "user@hbase.apache.org" <us...@hbase.apache.org>
> > > > > Sent: Monday, August 27, 2012 5:59 PM
> > > > > Subject: Re: setTimeRange and setMaxVersions seem to be inefficient
> > > > >
> > > > > Hi Lars:
> > > > >
> > > > > Thanks for confirming the inefficiency of the implementation for
> this
> > > > > case. For my case, a column can have more than 10K versions, I
> need a
> > > > quick
> > > > > way to stop the scan from digging the column once there is a match
> > > > > (ReturnCode.INCLUDE). It would be nice to have a ReturnCode that
> can
> > > > notify
> > > > > the framework to stop and go to next column once the number of
> > versions
> > > > > specify in setMaxVersions is met.
> > > > >
> > > > > For now, I guess I have to hack it in the custom filter (I.e. I
> keep
> > > the
> > > > > count myself)? If you have a better way to achieve this, please
> share
> > > :)
> > > > >
> > > > > Best Regards,
> > > > >
> > > > > Jerry
> > > > >
> > > > > Sent from my iPad (sorry for spelling mistakes)
> > > > >
> > > > > On 2012-08-27, at 20:11, lars hofhansl <lh...@yahoo.com>
> wrote:
> > > > >
> > > > > > Currently filters are evaluated before we do version counting.
> > > > > >
> > > > > > Here's a comment from ScanQueryMatcher.java:
> > > > > >     /**
> > > > > >      * Filters should be checked before checking column trackers.
> > If
> > > we
> > > > > do
> > > > > >      * otherwise, as was previously being done, ColumnTracker may
> > > > > increment its
> > > > > >      * counter for even that KV which may be discarded later on
> by
> > > > > Filter. This
> > > > > >      * would lead to incorrect results in certain cases.
> > > > > >      */
> > > > > >
> > > > > >
> > > > > > So this is by design. (Doesn't mean it's correct or desirable,
> > > though.)
> > > > > >
> > > > > > -- Lars
> > > > > >
> > > > > >
> > > > > > ----- Original Message -----
> > > > > > From: Jerry Lam <ch...@gmail.com>
> > > > > > To: user <us...@hbase.apache.org>
> > > > > > Cc:
> > > > > > Sent: Monday, August 27, 2012 2:40 PM
> > > > > > Subject: setTimeRange and setMaxVersions seem to be inefficient
> > > > > >
> > > > > > Hi HBase community:
> > > > > >
> > > > > > I tried to use setTimeRange and setMaxVersions to limit the
> number
> > of
> > > > KVs
> > > > > > return per column. The behaviour is as I would expect that is
> > > > > > setTimeRange(0, T + 1) and setMaxVersions(1) will give me ONE
> > version
> > > > of
> > > > > KV
> > > > > > with timestamp that is less than or equal to T.
> > > > > > However, I noticed that all versions of the KeyValue for a
> > particular
> > > > > > column are processed through a custom filter I implemented even
> > > though
> > > > I
> > > > > > specify setMaxVersions(1) and setTimeRange(0, T+1). I expected
> that
> > > if
> > > > > ONE
> > > > > > KV of a particular column has ReturnCode.INCLUDE, the framework
> > will
> > > > jump
> > > > > > to the next COL instead of iterating through all versions of the
> > > > column.
> > > > > >
> > > > > > Can someone confirm me if this is the expected behaviour
> (iterating
> > > > > through
> > > > > > all versions of a column before setMaxVersions take effect)? If
> > this
> > > is
> > > > > an
> > > > > > expected behaviour, what is your recommendation to speed this up?
> > > > > >
> > > > > > Best Regards,
> > > > > >
> > > > > > Jerry
> > > > > >
> > > > >
> > > > >
> > > >
> > > >
> > >
> >
>

Re: setTimeRange and setMaxVersions seem to be inefficient

Posted by Ted Yu <yu...@gmail.com>.

Jerry:
Remember to also implement:

+  @Override
+  public KeyValue getNextKeyHint(KeyValue currentKV) {

You can log a JIRA for supporting ReturnCode.INCLUDE_AND_NEXT_COL.

Cheers

On Wed, Aug 29, 2012 at 6:59 AM, Jerry Lam <ch...@gmail.com> wrote:

> Hi Lars:
>
> Thanks for spending time discussing this with me. I appreciate it.
>
> I tried to implement the setMaxVersions(1) inside the filter as follows:
>
> @Override
> public ReturnCode filterKeyValue(KeyValue kv) {
>
> // check if the same qualifier as the one that has been included
> previously. If yes, jump to next column
> if (previousIncludedQualifier != null &&
> Bytes.compareTo(previousIncludedQualifier,kv.getQualifier()) == 0) {
> previousIncludedQualifier = null;
> return ReturnCode.NEXT_COL;
> }
>         // another condition that makes the jump further using HINT
> if (Bytes.compareTo(this.qualifier, kv.getQualifier()) == 0) {
> LOG.info("Matched Found.");
> return ReturnCode.SEEK_NEXT_USING_HINT;
>
> }
>         // include this to the result and keep track of the included
> qualifier so the next version of the same qualifier will be excluded
> previousIncludedQualifier = kv.getQualifier();
> return ReturnCode.INCLUDE;
> }
>
> Does this look reasonable or there is a better way to achieve this? It
> would be nice to have ReturnCode.INCLUDE_AND_NEXT_COL for this case though.
>
> Best Regards,
>
> Jerry
>
>
> On Wed, Aug 29, 2012 at 2:09 AM, lars hofhansl <lh...@yahoo.com>
> wrote:
>
> > Hi Jerry,
> >
> > my answer will be the same again:
> > Some folks will want the max versions set by the client to be before
> > filters and some folks will want it to restrict the end result.
> > It's not possible to have it both ways. Your filter needs to do the right
> > thing.
> >
> >
> > There's a lot of discussion around this in HBASE-5104.
> >
> >
> > -- Lars
> >
> >
> >
> > ________________________________
> >  From: Jerry Lam <ch...@gmail.com>
> > To: user@hbase.apache.org; lars hofhansl <lh...@yahoo.com>
> > Sent: Tuesday, August 28, 2012 1:52 PM
> > Subject: Re: setTimeRange and setMaxVersions seem to be inefficient
> >
> > Hi Lars:
> >
> > I see. Please refer to the inline comment below.
> >
> > Best Regards,
> >
> > Jerry
> >
> > On Tue, Aug 28, 2012 at 2:21 PM, lars hofhansl <lh...@yahoo.com>
> > wrote:
> >
> > > What I was saying was: It depends. :)
> > >
> > > First off, how do you get to 1000 versions? In 0.94++ older version are
> > > pruned upon flush, so you need 333 flushes (assuming 3 versions on the
> > CF)
> > > to get 1000 versions.
> > >
> >
> > I forgot that the default number of version to keep is 3. If this is what
> > people use most of the time, yes you are right for this type of scenarios
> > where the number of version per column to keep is small.
> >
> > By that time some compactions will have happened and you're back to close
> > > to 3 versions (maybe 9, 12, or 15 or so, depending on how store files
> you
> > > have).
> > >
> > > Now, if you have that many version because because you set
> VERSIONS=>1000
> > > in your CF... Then imagine you have 100 columns with 1000 versions
> each.
> > >
> >
> > Yes, imagine I set VERSIONS => Long.MAX_VALUE (i.e. I will manage the
> > versioning myself)
> >
> > In your scenario below you'd do 100000 comparisons if the filter would be
> > > evaluated after the version counting. But only 1100 with the current
> > code.
> > > (or at least in that ball park)
> > >
> >
> > This is where I don't quite understand what you mean.
> >
> > if the framework counts the number of ReturnCode.INCLUDE and then stops
> > feeding the KeyValue into the filterKeyValue method after it reaches the
> > count specified in setMaxVersions (i.e. 1 for the case we discussed),
> > should then be just 100 comparisons only (at most) instead of 1100
> > comparisons? Maybe I don't understand how the current way is doing...
> >
> >
> >
> > >
> > > The gist is: One can construct scenarios where one approach is better
> > than
> > > the other. Only one order is possible.
> > > If you write a custom filter and you care about these things you should
> > > use the seek hints.
> > >
> > > -- Lars
> > >
> > >
> > > ----- Original Message -----
> > > From: Jerry Lam <ch...@gmail.com>
> > > To: user@hbase.apache.org; lars hofhansl <lh...@yahoo.com>
> > > Cc:
> > > Sent: Tuesday, August 28, 2012 7:17 AM
> > > Subject: Re: setTimeRange and setMaxVersions seem to be inefficient
> > >
> > > Hi Lars:
> > >
> > > Thanks for the reply.
> > > I need to understand if I misunderstood the perceived inefficiency
> > because
> > > it seems you don't think quite the same.
> > >
> > > Let say, as an example, we have 1 row with 2 columns (col-1 and col-2)
> > in a
> > > table and each column has 1000 versions. Using the following code (the
> > code
> > > might have errors and don't compile):
> > > /**
> > > * This is very simple use case of a ColumnPrefixFilter.
> > > * In fact all other filters that make use of filterKeyValue will see
> > > similar
> > > * performance problems that I have concerned with when the number of
> > > * versions per column could be huge.
> > >
> > > Filter filter = new ColumnPrefixFilter(Bytes.toBytes("col-2"));
> > > Scan scan = new Scan();
> > > scan.setFilter(filter);
> > > ResultScanner scanner = table.getScanner(scan);
> > > for (Result result : scanner) {
> > >     for (KeyValue kv : result.raw()) {
> > >         System.out.println("KV: " + kv + ", Value: " +
> > >         Bytes.toString(kv.getValue()));
> > >     }
> > > }
> > > scanner.close();
> > > */
> > >
> > > Implicitly, the number of version per column that is going to return
> is 1
> > > (the latest version). User might expect that only 2 comparisons for
> > column
> > > prefix are needed (1 for col-1 and 1 for col-2) but in fact, it
> processes
> > > the filterKeyValue method in ColumnPrefixFilter 1000 times (1 for col-1
> > and
> > > 1000 for col-2) for col-2 (1 per version) because all versions of the
> > > column have the same prefix for obvious reason. For col-1, it will skip
> > > using SEEK_NEXT_USING_HINT which should skip the 99 versions of col-1.
> > >
> > > In summary, the 1000 comparisons (5000 byte comparisons) for the column
> > > prefix "col-2" is wasted because only 1 version is returned to user.
> > Also,
> > > I believe this inefficiency is hidden from the user code but it affects
> > all
> > > filters that use filterKeyValue as the main execution for filtering
> KVs.
> > Do
> > > we have a case to improve HBase to handle this inefficiency? :) It
> seems
> > > valid unless you prove otherwise.
> > >
> > > Best Regards,
> > >
> > > Jerry
> > >
> > >
> > >
> > > On Tue, Aug 28, 2012 at 12:54 AM, lars hofhansl <lh...@yahoo.com>
> > > wrote:
> > >
> > > > First off regarding "inefficiency"... If version counting would
> happen
> > > > first and then filter were executed we'd have folks "complaining"
> about
> > > > inefficiencies as well:
> > > > ("Why does the code have to go through the versioning stuff when my
> > > filter
> > > > filters the row/column/version anyway?")  ;-)
> > > >
> > > >
> > > > For your problem, you want to make use of "seek hints"...
> > > >
> > > > In addition to INCLUDE you can return NEXT_COL, NEXT_ROW, or even
> > > > SEEK_NEXT_USING_HINT from Filter.filterKeyValue(...).
> > > >
> > > > That way the scanning framework will know to skip ahead to the next
> > > > column, row, or a KV of your choosing. (see Filter.filterKeyValue and
> > > > Filter.getNextKeyHint).
> > > >
> > > > (as an aside, it would probably be nice if Filters also had
> > > > INCLUDE_AND_NEXT_COL, INCLUDE_AND_NEXT_ROW, internally used by
> > > StoreScanner)
> > > >
> > > > Have a look at ColumnPrefixFilter as an example.
> > > > I also wrote a short post here:
> > > >
> > >
> >
> http://hadoop-hbase.blogspot.com/2012/01/filters-in-hbase-or-intra-row-scanning.html
> > > >
> > > > Does that help?
> > > >
> > > > -- Lars
> > > >
> > > >
> > > > ----- Original Message -----
> > > > From: Jerry Lam <ch...@gmail.com>
> > > > To: "user@hbase.apache.org" <us...@hbase.apache.org>
> > > > Cc: "user@hbase.apache.org" <us...@hbase.apache.org>
> > > > Sent: Monday, August 27, 2012 5:59 PM
> > > > Subject: Re: setTimeRange and setMaxVersions seem to be inefficient
> > > >
> > > > Hi Lars:
> > > >
> > > > Thanks for confirming the inefficiency of the implementation for this
> > > > case. For my case, a column can have more than 10K versions, I need a
> > > quick
> > > > way to stop the scan from digging the column once there is a match
> > > > (ReturnCode.INCLUDE). It would be nice to have a ReturnCode that can
> > > notify
> > > > the framework to stop and go to next column once the number of
> versions
> > > > specify in setMaxVersions is met.
> > > >
> > > > For now, I guess I have to hack it in the custom filter (I.e. I keep
> > the
> > > > count myself)? If you have a better way to achieve this, please share
> > :)
> > > >
> > > > Best Regards,
> > > >
> > > > Jerry
> > > >
> > > > Sent from my iPad (sorry for spelling mistakes)
> > > >
> > > > On 2012-08-27, at 20:11, lars hofhansl <lh...@yahoo.com> wrote:
> > > >
> > > > > Currently filters are evaluated before we do version counting.
> > > > >
> > > > > Here's a comment from ScanQueryMatcher.java:
> > > > >     /**
> > > > >      * Filters should be checked before checking column trackers.
> If
> > we
> > > > do
> > > > >      * otherwise, as was previously being done, ColumnTracker may
> > > > increment its
> > > > >      * counter for even that KV which may be discarded later on by
> > > > Filter. This
> > > > >      * would lead to incorrect results in certain cases.
> > > > >      */
> > > > >
> > > > >
> > > > > So this is by design. (Doesn't mean it's correct or desirable,
> > though.)
> > > > >
> > > > > -- Lars
> > > > >
> > > > >
> > > > > ----- Original Message -----
> > > > > From: Jerry Lam <ch...@gmail.com>
> > > > > To: user <us...@hbase.apache.org>
> > > > > Cc:
> > > > > Sent: Monday, August 27, 2012 2:40 PM
> > > > > Subject: setTimeRange and setMaxVersions seem to be inefficient
> > > > >
> > > > > Hi HBase community:
> > > > >
> > > > > I tried to use setTimeRange and setMaxVersions to limit the number
> of
> > > KVs
> > > > > return per column. The behaviour is as I would expect that is
> > > > > setTimeRange(0, T + 1) and setMaxVersions(1) will give me ONE
> version
> > > of
> > > > KV
> > > > > with timestamp that is less than or equal to T.
> > > > > However, I noticed that all versions of the KeyValue for a
> particular
> > > > > column are processed through a custom filter I implemented even
> > though
> > > I
> > > > > specify setMaxVersions(1) and setTimeRange(0, T+1). I expected that
> > if
> > > > ONE
> > > > > KV of a particular column has ReturnCode.INCLUDE, the framework
> will
> > > jump
> > > > > to the next COL instead of iterating through all versions of the
> > > column.
> > > > >
> > > > > Can someone confirm me if this is the expected behaviour (iterating
> > > > through
> > > > > all versions of a column before setMaxVersions take effect)? If
> this
> > is
> > > > an
> > > > > expected behaviour, what is your recommendation to speed this up?
> > > > >
> > > > > Best Regards,
> > > > >
> > > > > Jerry
> > > > >
> > > >
> > > >
> > >
> > >
> >
>

Re: setTimeRange and setMaxVersions seem to be inefficient

Posted by Jerry Lam <ch...@gmail.com>.

Hi Lars:

Thanks for spending time discussing this with me. I appreciate it.

I tried to implement the setMaxVersions(1) inside the filter as follows:

@Override
public ReturnCode filterKeyValue(KeyValue kv) {

// check if the same qualifier as the one that has been included
previously. If yes, jump to next column
if (previousIncludedQualifier != null &&
Bytes.compareTo(previousIncludedQualifier,kv.getQualifier()) == 0) {
previousIncludedQualifier = null;
return ReturnCode.NEXT_COL;
}
        // another condition that makes the jump further using HINT
if (Bytes.compareTo(this.qualifier, kv.getQualifier()) == 0) {
LOG.info("Matched Found.");
return ReturnCode.SEEK_NEXT_USING_HINT;

}
        // include this to the result and keep track of the included
qualifier so the next version of the same qualifier will be excluded
previousIncludedQualifier = kv.getQualifier();
return ReturnCode.INCLUDE;
}

Does this look reasonable or there is a better way to achieve this? It
would be nice to have ReturnCode.INCLUDE_AND_NEXT_COL for this case though.

Best Regards,

Jerry


On Wed, Aug 29, 2012 at 2:09 AM, lars hofhansl <lh...@yahoo.com> wrote:

> Hi Jerry,
>
> my answer will be the same again:
> Some folks will want the max versions set by the client to be before
> filters and some folks will want it to restrict the end result.
> It's not possible to have it both ways. Your filter needs to do the right
> thing.
>
>
> There's a lot of discussion around this in HBASE-5104.
>
>
> -- Lars
>
>
>
> ________________________________
>  From: Jerry Lam <ch...@gmail.com>
> To: user@hbase.apache.org; lars hofhansl <lh...@yahoo.com>
> Sent: Tuesday, August 28, 2012 1:52 PM
> Subject: Re: setTimeRange and setMaxVersions seem to be inefficient
>
> Hi Lars:
>
> I see. Please refer to the inline comment below.
>
> Best Regards,
>
> Jerry
>
> On Tue, Aug 28, 2012 at 2:21 PM, lars hofhansl <lh...@yahoo.com>
> wrote:
>
> > What I was saying was: It depends. :)
> >
> > First off, how do you get to 1000 versions? In 0.94++ older version are
> > pruned upon flush, so you need 333 flushes (assuming 3 versions on the
> CF)
> > to get 1000 versions.
> >
>
> I forgot that the default number of version to keep is 3. If this is what
> people use most of the time, yes you are right for this type of scenarios
> where the number of version per column to keep is small.
>
> By that time some compactions will have happened and you're back to close
> > to 3 versions (maybe 9, 12, or 15 or so, depending on how store files you
> > have).
> >
> > Now, if you have that many version because because you set VERSIONS=>1000
> > in your CF... Then imagine you have 100 columns with 1000 versions each.
> >
>
> Yes, imagine I set VERSIONS => Long.MAX_VALUE (i.e. I will manage the
> versioning myself)
>
> In your scenario below you'd do 100000 comparisons if the filter would be
> > evaluated after the version counting. But only 1100 with the current
> code.
> > (or at least in that ball park)
> >
>
> This is where I don't quite understand what you mean.
>
> if the framework counts the number of ReturnCode.INCLUDE and then stops
> feeding the KeyValue into the filterKeyValue method after it reaches the
> count specified in setMaxVersions (i.e. 1 for the case we discussed),
> should then be just 100 comparisons only (at most) instead of 1100
> comparisons? Maybe I don't understand how the current way is doing...
>
>
>
> >
> > The gist is: One can construct scenarios where one approach is better
> than
> > the other. Only one order is possible.
> > If you write a custom filter and you care about these things you should
> > use the seek hints.
> >
> > -- Lars
> >
> >
> > ----- Original Message -----
> > From: Jerry Lam <ch...@gmail.com>
> > To: user@hbase.apache.org; lars hofhansl <lh...@yahoo.com>
> > Cc:
> > Sent: Tuesday, August 28, 2012 7:17 AM
> > Subject: Re: setTimeRange and setMaxVersions seem to be inefficient
> >
> > Hi Lars:
> >
> > Thanks for the reply.
> > I need to understand if I misunderstood the perceived inefficiency
> because
> > it seems you don't think quite the same.
> >
> > Let say, as an example, we have 1 row with 2 columns (col-1 and col-2)
> in a
> > table and each column has 1000 versions. Using the following code (the
> code
> > might have errors and don't compile):
> > /**
> > * This is very simple use case of a ColumnPrefixFilter.
> > * In fact all other filters that make use of filterKeyValue will see
> > similar
> > * performance problems that I have concerned with when the number of
> > * versions per column could be huge.
> >
> > Filter filter = new ColumnPrefixFilter(Bytes.toBytes("col-2"));
> > Scan scan = new Scan();
> > scan.setFilter(filter);
> > ResultScanner scanner = table.getScanner(scan);
> > for (Result result : scanner) {
> >     for (KeyValue kv : result.raw()) {
> >         System.out.println("KV: " + kv + ", Value: " +
> >         Bytes.toString(kv.getValue()));
> >     }
> > }
> > scanner.close();
> > */
> >
> > Implicitly, the number of version per column that is going to return is 1
> > (the latest version). User might expect that only 2 comparisons for
> column
> > prefix are needed (1 for col-1 and 1 for col-2) but in fact, it processes
> > the filterKeyValue method in ColumnPrefixFilter 1000 times (1 for col-1
> and
> > 1000 for col-2) for col-2 (1 per version) because all versions of the
> > column have the same prefix for obvious reason. For col-1, it will skip
> > using SEEK_NEXT_USING_HINT which should skip the 99 versions of col-1.
> >
> > In summary, the 1000 comparisons (5000 byte comparisons) for the column
> > prefix "col-2" is wasted because only 1 version is returned to user.
> Also,
> > I believe this inefficiency is hidden from the user code but it affects
> all
> > filters that use filterKeyValue as the main execution for filtering KVs.
> Do
> > we have a case to improve HBase to handle this inefficiency? :) It seems
> > valid unless you prove otherwise.
> >
> > Best Regards,
> >
> > Jerry
> >
> >
> >
> > On Tue, Aug 28, 2012 at 12:54 AM, lars hofhansl <lh...@yahoo.com>
> > wrote:
> >
> > > First off regarding "inefficiency"... If version counting would happen
> > > first and then filter were executed we'd have folks "complaining" about
> > > inefficiencies as well:
> > > ("Why does the code have to go through the versioning stuff when my
> > filter
> > > filters the row/column/version anyway?")  ;-)
> > >
> > >
> > > For your problem, you want to make use of "seek hints"...
> > >
> > > In addition to INCLUDE you can return NEXT_COL, NEXT_ROW, or even
> > > SEEK_NEXT_USING_HINT from Filter.filterKeyValue(...).
> > >
> > > That way the scanning framework will know to skip ahead to the next
> > > column, row, or a KV of your choosing. (see Filter.filterKeyValue and
> > > Filter.getNextKeyHint).
> > >
> > > (as an aside, it would probably be nice if Filters also had
> > > INCLUDE_AND_NEXT_COL, INCLUDE_AND_NEXT_ROW, internally used by
> > StoreScanner)
> > >
> > > Have a look at ColumnPrefixFilter as an example.
> > > I also wrote a short post here:
> > >
> >
> http://hadoop-hbase.blogspot.com/2012/01/filters-in-hbase-or-intra-row-scanning.html
> > >
> > > Does that help?
> > >
> > > -- Lars
> > >
> > >
> > > ----- Original Message -----
> > > From: Jerry Lam <ch...@gmail.com>
> > > To: "user@hbase.apache.org" <us...@hbase.apache.org>
> > > Cc: "user@hbase.apache.org" <us...@hbase.apache.org>
> > > Sent: Monday, August 27, 2012 5:59 PM
> > > Subject: Re: setTimeRange and setMaxVersions seem to be inefficient
> > >
> > > Hi Lars:
> > >
> > > Thanks for confirming the inefficiency of the implementation for this
> > > case. For my case, a column can have more than 10K versions, I need a
> > quick
> > > way to stop the scan from digging the column once there is a match
> > > (ReturnCode.INCLUDE). It would be nice to have a ReturnCode that can
> > notify
> > > the framework to stop and go to next column once the number of versions
> > > specify in setMaxVersions is met.
> > >
> > > For now, I guess I have to hack it in the custom filter (I.e. I keep
> the
> > > count myself)? If you have a better way to achieve this, please share
> :)
> > >
> > > Best Regards,
> > >
> > > Jerry
> > >
> > > Sent from my iPad (sorry for spelling mistakes)
> > >
> > > On 2012-08-27, at 20:11, lars hofhansl <lh...@yahoo.com> wrote:
> > >
> > > > Currently filters are evaluated before we do version counting.
> > > >
> > > > Here's a comment from ScanQueryMatcher.java:
> > > >     /**
> > > >      * Filters should be checked before checking column trackers. If
> we
> > > do
> > > >      * otherwise, as was previously being done, ColumnTracker may
> > > increment its
> > > >      * counter for even that KV which may be discarded later on by
> > > Filter. This
> > > >      * would lead to incorrect results in certain cases.
> > > >      */
> > > >
> > > >
> > > > So this is by design. (Doesn't mean it's correct or desirable,
> though.)
> > > >
> > > > -- Lars
> > > >
> > > >
> > > > ----- Original Message -----
> > > > From: Jerry Lam <ch...@gmail.com>
> > > > To: user <us...@hbase.apache.org>
> > > > Cc:
> > > > Sent: Monday, August 27, 2012 2:40 PM
> > > > Subject: setTimeRange and setMaxVersions seem to be inefficient
> > > >
> > > > Hi HBase community:
> > > >
> > > > I tried to use setTimeRange and setMaxVersions to limit the number of
> > KVs
> > > > return per column. The behaviour is as I would expect that is
> > > > setTimeRange(0, T + 1) and setMaxVersions(1) will give me ONE version
> > of
> > > KV
> > > > with timestamp that is less than or equal to T.
> > > > However, I noticed that all versions of the KeyValue for a particular
> > > > column are processed through a custom filter I implemented even
> though
> > I
> > > > specify setMaxVersions(1) and setTimeRange(0, T+1). I expected that
> if
> > > ONE
> > > > KV of a particular column has ReturnCode.INCLUDE, the framework will
> > jump
> > > > to the next COL instead of iterating through all versions of the
> > column.
> > > >
> > > > Can someone confirm me if this is the expected behaviour (iterating
> > > through
> > > > all versions of a column before setMaxVersions take effect)? If this
> is
> > > an
> > > > expected behaviour, what is your recommendation to speed this up?
> > > >
> > > > Best Regards,
> > > >
> > > > Jerry
> > > >
> > >
> > >
> >
> >
>

Re: setTimeRange and setMaxVersions seem to be inefficient

Posted by lars hofhansl <lh...@yahoo.com>.

Hi Jerry,

my answer will be the same again:
Some folks will want the max versions set by the client to be before filters and some folks will want it to restrict the end result.
It's not possible to have it both ways. Your filter needs to do the right thing.


There's a lot of discussion around this in HBASE-5104.


-- Lars



________________________________
 From: Jerry Lam <ch...@gmail.com>
To: user@hbase.apache.org; lars hofhansl <lh...@yahoo.com> 
Sent: Tuesday, August 28, 2012 1:52 PM
Subject: Re: setTimeRange and setMaxVersions seem to be inefficient
 
Hi Lars:

I see. Please refer to the inline comment below.

Best Regards,

Jerry

On Tue, Aug 28, 2012 at 2:21 PM, lars hofhansl <lh...@yahoo.com> wrote:

> What I was saying was: It depends. :)
>
> First off, how do you get to 1000 versions? In 0.94++ older version are
> pruned upon flush, so you need 333 flushes (assuming 3 versions on the CF)
> to get 1000 versions.
>

I forgot that the default number of version to keep is 3. If this is what
people use most of the time, yes you are right for this type of scenarios
where the number of version per column to keep is small.

By that time some compactions will have happened and you're back to close
> to 3 versions (maybe 9, 12, or 15 or so, depending on how store files you
> have).
>
> Now, if you have that many version because because you set VERSIONS=>1000
> in your CF... Then imagine you have 100 columns with 1000 versions each.
>

Yes, imagine I set VERSIONS => Long.MAX_VALUE (i.e. I will manage the
versioning myself)

In your scenario below you'd do 100000 comparisons if the filter would be
> evaluated after the version counting. But only 1100 with the current code.
> (or at least in that ball park)
>

This is where I don't quite understand what you mean.

if the framework counts the number of ReturnCode.INCLUDE and then stops
feeding the KeyValue into the filterKeyValue method after it reaches the
count specified in setMaxVersions (i.e. 1 for the case we discussed),
should then be just 100 comparisons only (at most) instead of 1100
comparisons? Maybe I don't understand how the current way is doing...



>
> The gist is: One can construct scenarios where one approach is better than
> the other. Only one order is possible.
> If you write a custom filter and you care about these things you should
> use the seek hints.
>
> -- Lars
>
>
> ----- Original Message -----
> From: Jerry Lam <ch...@gmail.com>
> To: user@hbase.apache.org; lars hofhansl <lh...@yahoo.com>
> Cc:
> Sent: Tuesday, August 28, 2012 7:17 AM
> Subject: Re: setTimeRange and setMaxVersions seem to be inefficient
>
> Hi Lars:
>
> Thanks for the reply.
> I need to understand if I misunderstood the perceived inefficiency because
> it seems you don't think quite the same.
>
> Let say, as an example, we have 1 row with 2 columns (col-1 and col-2) in a
> table and each column has 1000 versions. Using the following code (the code
> might have errors and don't compile):
> /**
> * This is very simple use case of a ColumnPrefixFilter.
> * In fact all other filters that make use of filterKeyValue will see
> similar
> * performance problems that I have concerned with when the number of
> * versions per column could be huge.
>
> Filter filter = new ColumnPrefixFilter(Bytes.toBytes("col-2"));
> Scan scan = new Scan();
> scan.setFilter(filter);
> ResultScanner scanner = table.getScanner(scan);
> for (Result result : scanner) {
>     for (KeyValue kv : result.raw()) {
>         System.out.println("KV: " + kv + ", Value: " +
>         Bytes.toString(kv.getValue()));
>     }
> }
> scanner.close();
> */
>
> Implicitly, the number of version per column that is going to return is 1
> (the latest version). User might expect that only 2 comparisons for column
> prefix are needed (1 for col-1 and 1 for col-2) but in fact, it processes
> the filterKeyValue method in ColumnPrefixFilter 1000 times (1 for col-1 and
> 1000 for col-2) for col-2 (1 per version) because all versions of the
> column have the same prefix for obvious reason. For col-1, it will skip
> using SEEK_NEXT_USING_HINT which should skip the 99 versions of col-1.
>
> In summary, the 1000 comparisons (5000 byte comparisons) for the column
> prefix "col-2" is wasted because only 1 version is returned to user. Also,
> I believe this inefficiency is hidden from the user code but it affects all
> filters that use filterKeyValue as the main execution for filtering KVs. Do
> we have a case to improve HBase to handle this inefficiency? :) It seems
> valid unless you prove otherwise.
>
> Best Regards,
>
> Jerry
>
>
>
> On Tue, Aug 28, 2012 at 12:54 AM, lars hofhansl <lh...@yahoo.com>
> wrote:
>
> > First off regarding "inefficiency"... If version counting would happen
> > first and then filter were executed we'd have folks "complaining" about
> > inefficiencies as well:
> > ("Why does the code have to go through the versioning stuff when my
> filter
> > filters the row/column/version anyway?")  ;-)
> >
> >
> > For your problem, you want to make use of "seek hints"...
> >
> > In addition to INCLUDE you can return NEXT_COL, NEXT_ROW, or even
> > SEEK_NEXT_USING_HINT from Filter.filterKeyValue(...).
> >
> > That way the scanning framework will know to skip ahead to the next
> > column, row, or a KV of your choosing. (see Filter.filterKeyValue and
> > Filter.getNextKeyHint).
> >
> > (as an aside, it would probably be nice if Filters also had
> > INCLUDE_AND_NEXT_COL, INCLUDE_AND_NEXT_ROW, internally used by
> StoreScanner)
> >
> > Have a look at ColumnPrefixFilter as an example.
> > I also wrote a short post here:
> >
> http://hadoop-hbase.blogspot.com/2012/01/filters-in-hbase-or-intra-row-scanning.html
> >
> > Does that help?
> >
> > -- Lars
> >
> >
> > ----- Original Message -----
> > From: Jerry Lam <ch...@gmail.com>
> > To: "user@hbase.apache.org" <us...@hbase.apache.org>
> > Cc: "user@hbase.apache.org" <us...@hbase.apache.org>
> > Sent: Monday, August 27, 2012 5:59 PM
> > Subject: Re: setTimeRange and setMaxVersions seem to be inefficient
> >
> > Hi Lars:
> >
> > Thanks for confirming the inefficiency of the implementation for this
> > case. For my case, a column can have more than 10K versions, I need a
> quick
> > way to stop the scan from digging the column once there is a match
> > (ReturnCode.INCLUDE). It would be nice to have a ReturnCode that can
> notify
> > the framework to stop and go to next column once the number of versions
> > specify in setMaxVersions is met.
> >
> > For now, I guess I have to hack it in the custom filter (I.e. I keep the
> > count myself)? If you have a better way to achieve this, please share :)
> >
> > Best Regards,
> >
> > Jerry
> >
> > Sent from my iPad (sorry for spelling mistakes)
> >
> > On 2012-08-27, at 20:11, lars hofhansl <lh...@yahoo.com> wrote:
> >
> > > Currently filters are evaluated before we do version counting.
> > >
> > > Here's a comment from ScanQueryMatcher.java:
> > >     /**
> > >      * Filters should be checked before checking column trackers. If we
> > do
> > >      * otherwise, as was previously being done, ColumnTracker may
> > increment its
> > >      * counter for even that KV which may be discarded later on by
> > Filter. This
> > >      * would lead to incorrect results in certain cases.
> > >      */
> > >
> > >
> > > So this is by design. (Doesn't mean it's correct or desirable, though.)
> > >
> > > -- Lars
> > >
> > >
> > > ----- Original Message -----
> > > From: Jerry Lam <ch...@gmail.com>
> > > To: user <us...@hbase.apache.org>
> > > Cc:
> > > Sent: Monday, August 27, 2012 2:40 PM
> > > Subject: setTimeRange and setMaxVersions seem to be inefficient
> > >
> > > Hi HBase community:
> > >
> > > I tried to use setTimeRange and setMaxVersions to limit the number of
> KVs
> > > return per column. The behaviour is as I would expect that is
> > > setTimeRange(0, T + 1) and setMaxVersions(1) will give me ONE version
> of
> > KV
> > > with timestamp that is less than or equal to T.
> > > However, I noticed that all versions of the KeyValue for a particular
> > > column are processed through a custom filter I implemented even though
> I
> > > specify setMaxVersions(1) and setTimeRange(0, T+1). I expected that if
> > ONE
> > > KV of a particular column has ReturnCode.INCLUDE, the framework will
> jump
> > > to the next COL instead of iterating through all versions of the
> column.
> > >
> > > Can someone confirm me if this is the expected behaviour (iterating
> > through
> > > all versions of a column before setMaxVersions take effect)? If this is
> > an
> > > expected behaviour, what is your recommendation to speed this up?
> > >
> > > Best Regards,
> > >
> > > Jerry
> > >
> >
> >
>
>

Re: setTimeRange and setMaxVersions seem to be inefficient

Posted by Jerry Lam <ch...@gmail.com>.

Hi Lars:

I see. Please refer to the inline comment below.

Best Regards,

Jerry

On Tue, Aug 28, 2012 at 2:21 PM, lars hofhansl <lh...@yahoo.com> wrote:

> What I was saying was: It depends. :)
>
> First off, how do you get to 1000 versions? In 0.94++ older version are
> pruned upon flush, so you need 333 flushes (assuming 3 versions on the CF)
> to get 1000 versions.
>

I forgot that the default number of version to keep is 3. If this is what
people use most of the time, yes you are right for this type of scenarios
where the number of version per column to keep is small.

By that time some compactions will have happened and you're back to close
> to 3 versions (maybe 9, 12, or 15 or so, depending on how store files you
> have).
>
> Now, if you have that many version because because you set VERSIONS=>1000
> in your CF... Then imagine you have 100 columns with 1000 versions each.
>

Yes, imagine I set VERSIONS => Long.MAX_VALUE (i.e. I will manage the
versioning myself)

In your scenario below you'd do 100000 comparisons if the filter would be
> evaluated after the version counting. But only 1100 with the current code.
> (or at least in that ball park)
>

This is where I don't quite understand what you mean.

if the framework counts the number of ReturnCode.INCLUDE and then stops
feeding the KeyValue into the filterKeyValue method after it reaches the
count specified in setMaxVersions (i.e. 1 for the case we discussed),
should then be just 100 comparisons only (at most) instead of 1100
comparisons? Maybe I don't understand how the current way is doing...



>
> The gist is: One can construct scenarios where one approach is better than
> the other. Only one order is possible.
> If you write a custom filter and you care about these things you should
> use the seek hints.
>
> -- Lars
>
>
> ----- Original Message -----
> From: Jerry Lam <ch...@gmail.com>
> To: user@hbase.apache.org; lars hofhansl <lh...@yahoo.com>
> Cc:
> Sent: Tuesday, August 28, 2012 7:17 AM
> Subject: Re: setTimeRange and setMaxVersions seem to be inefficient
>
> Hi Lars:
>
> Thanks for the reply.
> I need to understand if I misunderstood the perceived inefficiency because
> it seems you don't think quite the same.
>
> Let say, as an example, we have 1 row with 2 columns (col-1 and col-2) in a
> table and each column has 1000 versions. Using the following code (the code
> might have errors and don't compile):
> /**
> * This is very simple use case of a ColumnPrefixFilter.
> * In fact all other filters that make use of filterKeyValue will see
> similar
> * performance problems that I have concerned with when the number of
> * versions per column could be huge.
>
> Filter filter = new ColumnPrefixFilter(Bytes.toBytes("col-2"));
> Scan scan = new Scan();
> scan.setFilter(filter);
> ResultScanner scanner = table.getScanner(scan);
> for (Result result : scanner) {
>     for (KeyValue kv : result.raw()) {
>         System.out.println("KV: " + kv + ", Value: " +
>         Bytes.toString(kv.getValue()));
>     }
> }
> scanner.close();
> */
>
> Implicitly, the number of version per column that is going to return is 1
> (the latest version). User might expect that only 2 comparisons for column
> prefix are needed (1 for col-1 and 1 for col-2) but in fact, it processes
> the filterKeyValue method in ColumnPrefixFilter 1000 times (1 for col-1 and
> 1000 for col-2) for col-2 (1 per version) because all versions of the
> column have the same prefix for obvious reason. For col-1, it will skip
> using SEEK_NEXT_USING_HINT which should skip the 99 versions of col-1.
>
> In summary, the 1000 comparisons (5000 byte comparisons) for the column
> prefix "col-2" is wasted because only 1 version is returned to user. Also,
> I believe this inefficiency is hidden from the user code but it affects all
> filters that use filterKeyValue as the main execution for filtering KVs. Do
> we have a case to improve HBase to handle this inefficiency? :) It seems
> valid unless you prove otherwise.
>
> Best Regards,
>
> Jerry
>
>
>
> On Tue, Aug 28, 2012 at 12:54 AM, lars hofhansl <lh...@yahoo.com>
> wrote:
>
> > First off regarding "inefficiency"... If version counting would happen
> > first and then filter were executed we'd have folks "complaining" about
> > inefficiencies as well:
> > ("Why does the code have to go through the versioning stuff when my
> filter
> > filters the row/column/version anyway?")  ;-)
> >
> >
> > For your problem, you want to make use of "seek hints"...
> >
> > In addition to INCLUDE you can return NEXT_COL, NEXT_ROW, or even
> > SEEK_NEXT_USING_HINT from Filter.filterKeyValue(...).
> >
> > That way the scanning framework will know to skip ahead to the next
> > column, row, or a KV of your choosing. (see Filter.filterKeyValue and
> > Filter.getNextKeyHint).
> >
> > (as an aside, it would probably be nice if Filters also had
> > INCLUDE_AND_NEXT_COL, INCLUDE_AND_NEXT_ROW, internally used by
> StoreScanner)
> >
> > Have a look at ColumnPrefixFilter as an example.
> > I also wrote a short post here:
> >
> http://hadoop-hbase.blogspot.com/2012/01/filters-in-hbase-or-intra-row-scanning.html
> >
> > Does that help?
> >
> > -- Lars
> >
> >
> > ----- Original Message -----
> > From: Jerry Lam <ch...@gmail.com>
> > To: "user@hbase.apache.org" <us...@hbase.apache.org>
> > Cc: "user@hbase.apache.org" <us...@hbase.apache.org>
> > Sent: Monday, August 27, 2012 5:59 PM
> > Subject: Re: setTimeRange and setMaxVersions seem to be inefficient
> >
> > Hi Lars:
> >
> > Thanks for confirming the inefficiency of the implementation for this
> > case. For my case, a column can have more than 10K versions, I need a
> quick
> > way to stop the scan from digging the column once there is a match
> > (ReturnCode.INCLUDE). It would be nice to have a ReturnCode that can
> notify
> > the framework to stop and go to next column once the number of versions
> > specify in setMaxVersions is met.
> >
> > For now, I guess I have to hack it in the custom filter (I.e. I keep the
> > count myself)? If you have a better way to achieve this, please share :)
> >
> > Best Regards,
> >
> > Jerry
> >
> > Sent from my iPad (sorry for spelling mistakes)
> >
> > On 2012-08-27, at 20:11, lars hofhansl <lh...@yahoo.com> wrote:
> >
> > > Currently filters are evaluated before we do version counting.
> > >
> > > Here's a comment from ScanQueryMatcher.java:
> > >     /**
> > >      * Filters should be checked before checking column trackers. If we
> > do
> > >      * otherwise, as was previously being done, ColumnTracker may
> > increment its
> > >      * counter for even that KV which may be discarded later on by
> > Filter. This
> > >      * would lead to incorrect results in certain cases.
> > >      */
> > >
> > >
> > > So this is by design. (Doesn't mean it's correct or desirable, though.)
> > >
> > > -- Lars
> > >
> > >
> > > ----- Original Message -----
> > > From: Jerry Lam <ch...@gmail.com>
> > > To: user <us...@hbase.apache.org>
> > > Cc:
> > > Sent: Monday, August 27, 2012 2:40 PM
> > > Subject: setTimeRange and setMaxVersions seem to be inefficient
> > >
> > > Hi HBase community:
> > >
> > > I tried to use setTimeRange and setMaxVersions to limit the number of
> KVs
> > > return per column. The behaviour is as I would expect that is
> > > setTimeRange(0, T + 1) and setMaxVersions(1) will give me ONE version
> of
> > KV
> > > with timestamp that is less than or equal to T.
> > > However, I noticed that all versions of the KeyValue for a particular
> > > column are processed through a custom filter I implemented even though
> I
> > > specify setMaxVersions(1) and setTimeRange(0, T+1). I expected that if
> > ONE
> > > KV of a particular column has ReturnCode.INCLUDE, the framework will
> jump
> > > to the next COL instead of iterating through all versions of the
> column.
> > >
> > > Can someone confirm me if this is the expected behaviour (iterating
> > through
> > > all versions of a column before setMaxVersions take effect)? If this is
> > an
> > > expected behaviour, what is your recommendation to speed this up?
> > >
> > > Best Regards,
> > >
> > > Jerry
> > >
> >
> >
>
>

Re: setTimeRange and setMaxVersions seem to be inefficient

Posted by lars hofhansl <lh...@yahoo.com>.

What I was saying was: It depends. :)

First off, how do you get to 1000 versions? In 0.94++ older version are pruned upon flush, so you need 333 flushes (assuming 3 versions on the CF) to get 1000 versions.
By that time some compactions will have happened and you're back to close to 3 versions (maybe 9, 12, or 15 or so, depending on how store files you have).

Now, if you have that many version because because you set VERSIONS=>1000 in your CF... Then imagine you have 100 columns with 1000 versions each.
In your scenario below you'd do 100000 comparisons if the filter would be evaluated after the version counting. But only 1100 with the current code.
(or at least in that ball park)

The gist is: One can construct scenarios where one approach is better than the other. Only one order is possible.
If you write a custom filter and you care about these things you should use the seek hints.

-- Lars

----- Original Message -----
From: Jerry Lam <ch...@gmail.com>
To: user@hbase.apache.org; lars hofhansl <lh...@yahoo.com>
Cc: 
Sent: Tuesday, August 28, 2012 7:17 AM
Subject: Re: setTimeRange and setMaxVersions seem to be inefficient

Hi Lars:

Thanks for the reply.
I need to understand if I misunderstood the perceived inefficiency because
it seems you don't think quite the same.

Let say, as an example, we have 1 row with 2 columns (col-1 and col-2) in a
table and each column has 1000 versions. Using the following code (the code
might have errors and don't compile):
/**
* This is very simple use case of a ColumnPrefixFilter.
* In fact all other filters that make use of filterKeyValue will see
similar
* performance problems that I have concerned with when the number of
* versions per column could be huge.

Filter filter = new ColumnPrefixFilter(Bytes.toBytes("col-2"));
Scan scan = new Scan();
scan.setFilter(filter);
ResultScanner scanner = table.getScanner(scan);
for (Result result : scanner) {
    for (KeyValue kv : result.raw()) {
        System.out.println("KV: " + kv + ", Value: " +
        Bytes.toString(kv.getValue()));
    }
}
scanner.close();
*/

Implicitly, the number of version per column that is going to return is 1
(the latest version). User might expect that only 2 comparisons for column
prefix are needed (1 for col-1 and 1 for col-2) but in fact, it processes
the filterKeyValue method in ColumnPrefixFilter 1000 times (1 for col-1 and
1000 for col-2) for col-2 (1 per version) because all versions of the
column have the same prefix for obvious reason. For col-1, it will skip
using SEEK_NEXT_USING_HINT which should skip the 99 versions of col-1.

In summary, the 1000 comparisons (5000 byte comparisons) for the column
prefix "col-2" is wasted because only 1 version is returned to user. Also,
I believe this inefficiency is hidden from the user code but it affects all
filters that use filterKeyValue as the main execution for filtering KVs. Do
we have a case to improve HBase to handle this inefficiency? :) It seems
valid unless you prove otherwise.

Best Regards,

Jerry

On Tue, Aug 28, 2012 at 12:54 AM, lars hofhansl <lh...@yahoo.com> wrote:

> First off regarding "inefficiency"... If version counting would happen
> first and then filter were executed we'd have folks "complaining" about
> inefficiencies as well:
> ("Why does the code have to go through the versioning stuff when my filter
> filters the row/column/version anyway?")  ;-)
>
>
> For your problem, you want to make use of "seek hints"...
>
> In addition to INCLUDE you can return NEXT_COL, NEXT_ROW, or even
> SEEK_NEXT_USING_HINT from Filter.filterKeyValue(...).
>
> That way the scanning framework will know to skip ahead to the next
> column, row, or a KV of your choosing. (see Filter.filterKeyValue and
> Filter.getNextKeyHint).
>
> (as an aside, it would probably be nice if Filters also had
> INCLUDE_AND_NEXT_COL, INCLUDE_AND_NEXT_ROW, internally used by StoreScanner)
>
> Have a look at ColumnPrefixFilter as an example.
> I also wrote a short post here:
> http://hadoop-hbase.blogspot.com/2012/01/filters-in-hbase-or-intra-row-scanning.html
>
> Does that help?
>
> -- Lars
>
>
> ----- Original Message -----
> From: Jerry Lam <ch...@gmail.com>
> To: "user@hbase.apache.org" <us...@hbase.apache.org>
> Cc: "user@hbase.apache.org" <us...@hbase.apache.org>
> Sent: Monday, August 27, 2012 5:59 PM
> Subject: Re: setTimeRange and setMaxVersions seem to be inefficient
>
> Hi Lars:
>
> Thanks for confirming the inefficiency of the implementation for this
> case. For my case, a column can have more than 10K versions, I need a quick
> way to stop the scan from digging the column once there is a match
> (ReturnCode.INCLUDE). It would be nice to have a ReturnCode that can notify
> the framework to stop and go to next column once the number of versions
> specify in setMaxVersions is met.
>
> For now, I guess I have to hack it in the custom filter (I.e. I keep the
> count myself)? If you have a better way to achieve this, please share :)
>
> Best Regards,
>
> Jerry
>
> Sent from my iPad (sorry for spelling mistakes)
>
> On 2012-08-27, at 20:11, lars hofhansl <lh...@yahoo.com> wrote:
>
> > Currently filters are evaluated before we do version counting.
> >
> > Here's a comment from ScanQueryMatcher.java:
> >     /**
> >      * Filters should be checked before checking column trackers. If we
> do
> >      * otherwise, as was previously being done, ColumnTracker may
> increment its
> >      * counter for even that KV which may be discarded later on by
> Filter. This
> >      * would lead to incorrect results in certain cases.
> >      */
> >
> >
> > So this is by design. (Doesn't mean it's correct or desirable, though.)
> >
> > -- Lars
> >
> >
> > ----- Original Message -----
> > From: Jerry Lam <ch...@gmail.com>
> > To: user <us...@hbase.apache.org>
> > Cc:
> > Sent: Monday, August 27, 2012 2:40 PM
> > Subject: setTimeRange and setMaxVersions seem to be inefficient
> >
> > Hi HBase community:
> >
> > I tried to use setTimeRange and setMaxVersions to limit the number of KVs
> > return per column. The behaviour is as I would expect that is
> > setTimeRange(0, T + 1) and setMaxVersions(1) will give me ONE version of
> KV
> > with timestamp that is less than or equal to T.
> > However, I noticed that all versions of the KeyValue for a particular
> > column are processed through a custom filter I implemented even though I
> > specify setMaxVersions(1) and setTimeRange(0, T+1). I expected that if
> ONE
> > KV of a particular column has ReturnCode.INCLUDE, the framework will jump
> > to the next COL instead of iterating through all versions of the column.
> >
> > Can someone confirm me if this is the expected behaviour (iterating
> through
> > all versions of a column before setMaxVersions take effect)? If this is
> an
> > expected behaviour, what is your recommendation to speed this up?
> >
> > Best Regards,
> >
> > Jerry
> >
>
>

Re: setTimeRange and setMaxVersions seem to be inefficient

Posted by Jerry Lam <ch...@gmail.com>.

Hi Lars:

Thanks for the reply.
I need to understand if I misunderstood the perceived inefficiency because
it seems you don't think quite the same.

Let say, as an example, we have 1 row with 2 columns (col-1 and col-2) in a
table and each column has 1000 versions. Using the following code (the code
might have errors and don't compile):
/**
 * This is very simple use case of a ColumnPrefixFilter.
 * In fact all other filters that make use of filterKeyValue will see
similar
 * performance problems that I have concerned with when the number of
 * versions per column could be huge.

Filter filter = new ColumnPrefixFilter(Bytes.toBytes("col-2"));
Scan scan = new Scan();
scan.setFilter(filter);
ResultScanner scanner = table.getScanner(scan);
for (Result result : scanner) {
    for (KeyValue kv : result.raw()) {
        System.out.println("KV: " + kv + ", Value: " +
        Bytes.toString(kv.getValue()));
    }
}
scanner.close();
*/

Implicitly, the number of version per column that is going to return is 1
(the latest version). User might expect that only 2 comparisons for column
prefix are needed (1 for col-1 and 1 for col-2) but in fact, it processes
the filterKeyValue method in ColumnPrefixFilter 1000 times (1 for col-1 and
1000 for col-2) for col-2 (1 per version) because all versions of the
column have the same prefix for obvious reason. For col-1, it will skip
using SEEK_NEXT_USING_HINT which should skip the 99 versions of col-1.

In summary, the 1000 comparisons (5000 byte comparisons) for the column
prefix "col-2" is wasted because only 1 version is returned to user. Also,
I believe this inefficiency is hidden from the user code but it affects all
filters that use filterKeyValue as the main execution for filtering KVs. Do
we have a case to improve HBase to handle this inefficiency? :) It seems
valid unless you prove otherwise.

Best Regards,

Jerry



On Tue, Aug 28, 2012 at 12:54 AM, lars hofhansl <lh...@yahoo.com> wrote:

> First off regarding "inefficiency"... If version counting would happen
> first and then filter were executed we'd have folks "complaining" about
> inefficiencies as well:
> ("Why does the code have to go through the versioning stuff when my filter
> filters the row/column/version anyway?")  ;-)
>
>
> For your problem, you want to make use of "seek hints"...
>
> In addition to INCLUDE you can return NEXT_COL, NEXT_ROW, or even
> SEEK_NEXT_USING_HINT from Filter.filterKeyValue(...).
>
> That way the scanning framework will know to skip ahead to the next
> column, row, or a KV of your choosing. (see Filter.filterKeyValue and
> Filter.getNextKeyHint).
>
> (as an aside, it would probably be nice if Filters also had
> INCLUDE_AND_NEXT_COL, INCLUDE_AND_NEXT_ROW, internally used by StoreScanner)
>
> Have a look at ColumnPrefixFilter as an example.
> I also wrote a short post here:
> http://hadoop-hbase.blogspot.com/2012/01/filters-in-hbase-or-intra-row-scanning.html
>
> Does that help?
>
> -- Lars
>
>
> ----- Original Message -----
> From: Jerry Lam <ch...@gmail.com>
> To: "user@hbase.apache.org" <us...@hbase.apache.org>
> Cc: "user@hbase.apache.org" <us...@hbase.apache.org>
> Sent: Monday, August 27, 2012 5:59 PM
> Subject: Re: setTimeRange and setMaxVersions seem to be inefficient
>
> Hi Lars:
>
> Thanks for confirming the inefficiency of the implementation for this
> case. For my case, a column can have more than 10K versions, I need a quick
> way to stop the scan from digging the column once there is a match
> (ReturnCode.INCLUDE). It would be nice to have a ReturnCode that can notify
> the framework to stop and go to next column once the number of versions
> specify in setMaxVersions is met.
>
> For now, I guess I have to hack it in the custom filter (I.e. I keep the
> count myself)? If you have a better way to achieve this, please share :)
>
> Best Regards,
>
> Jerry
>
> Sent from my iPad (sorry for spelling mistakes)
>
> On 2012-08-27, at 20:11, lars hofhansl <lh...@yahoo.com> wrote:
>
> > Currently filters are evaluated before we do version counting.
> >
> > Here's a comment from ScanQueryMatcher.java:
> >     /**
> >      * Filters should be checked before checking column trackers. If we
> do
> >      * otherwise, as was previously being done, ColumnTracker may
> increment its
> >      * counter for even that KV which may be discarded later on by
> Filter. This
> >      * would lead to incorrect results in certain cases.
> >      */
> >
> >
> > So this is by design. (Doesn't mean it's correct or desirable, though.)
> >
> > -- Lars
> >
> >
> > ----- Original Message -----
> > From: Jerry Lam <ch...@gmail.com>
> > To: user <us...@hbase.apache.org>
> > Cc:
> > Sent: Monday, August 27, 2012 2:40 PM
> > Subject: setTimeRange and setMaxVersions seem to be inefficient
> >
> > Hi HBase community:
> >
> > I tried to use setTimeRange and setMaxVersions to limit the number of KVs
> > return per column. The behaviour is as I would expect that is
> > setTimeRange(0, T + 1) and setMaxVersions(1) will give me ONE version of
> KV
> > with timestamp that is less than or equal to T.
> > However, I noticed that all versions of the KeyValue for a particular
> > column are processed through a custom filter I implemented even though I
> > specify setMaxVersions(1) and setTimeRange(0, T+1). I expected that if
> ONE
> > KV of a particular column has ReturnCode.INCLUDE, the framework will jump
> > to the next COL instead of iterating through all versions of the column.
> >
> > Can someone confirm me if this is the expected behaviour (iterating
> through
> > all versions of a column before setMaxVersions take effect)? If this is
> an
> > expected behaviour, what is your recommendation to speed this up?
> >
> > Best Regards,
> >
> > Jerry
> >
>
>

Re: setTimeRange and setMaxVersions seem to be inefficient

Posted by lars hofhansl <lh...@yahoo.com>.

First off regarding "inefficiency"... If version counting would happen first and then filter were executed we'd have folks "complaining" about inefficiencies as well:
("Why does the code have to go through the versioning stuff when my filter filters the row/column/version anyway?")  ;-)

For your problem, you want to make use of "seek hints"...

In addition to INCLUDE you can return NEXT_COL, NEXT_ROW, or even SEEK_NEXT_USING_HINT from Filter.filterKeyValue(...).

That way the scanning framework will know to skip ahead to the next column, row, or a KV of your choosing. (see Filter.filterKeyValue and Filter.getNextKeyHint).

(as an aside, it would probably be nice if Filters also had INCLUDE_AND_NEXT_COL, INCLUDE_AND_NEXT_ROW, internally used by StoreScanner)

Have a look at ColumnPrefixFilter as an example.
I also wrote a short post here: http://hadoop-hbase.blogspot.com/2012/01/filters-in-hbase-or-intra-row-scanning.html

Does that help?

-- Lars

----- Original Message -----
From: Jerry Lam <ch...@gmail.com>
To: "user@hbase.apache.org" <us...@hbase.apache.org>
Cc: "user@hbase.apache.org" <us...@hbase.apache.org>
Sent: Monday, August 27, 2012 5:59 PM
Subject: Re: setTimeRange and setMaxVersions seem to be inefficient

Hi Lars:

Thanks for confirming the inefficiency of the implementation for this case. For my case, a column can have more than 10K versions, I need a quick way to stop the scan from digging the column once there is a match (ReturnCode.INCLUDE). It would be nice to have a ReturnCode that can notify the framework to stop and go to next column once the number of versions specify in setMaxVersions is met. 

For now, I guess I have to hack it in the custom filter (I.e. I keep the count myself)? If you have a better way to achieve this, please share :)

Best Regards,

Jerry

Sent from my iPad (sorry for spelling mistakes)

On 2012-08-27, at 20:11, lars hofhansl <lh...@yahoo.com> wrote:

> Currently filters are evaluated before we do version counting.
> 
> Here's a comment from ScanQueryMatcher.java:
>     /**
>      * Filters should be checked before checking column trackers. If we do
>      * otherwise, as was previously being done, ColumnTracker may increment its
>      * counter for even that KV which may be discarded later on by Filter. This
>      * would lead to incorrect results in certain cases.
>      */
> 
> 
> So this is by design. (Doesn't mean it's correct or desirable, though.)
> 
> -- Lars
> 
> 
> ----- Original Message -----
> From: Jerry Lam <ch...@gmail.com>
> To: user <us...@hbase.apache.org>
> Cc: 
> Sent: Monday, August 27, 2012 2:40 PM
> Subject: setTimeRange and setMaxVersions seem to be inefficient
> 
> Hi HBase community:
> 
> I tried to use setTimeRange and setMaxVersions to limit the number of KVs
> return per column. The behaviour is as I would expect that is
> setTimeRange(0, T + 1) and setMaxVersions(1) will give me ONE version of KV
> with timestamp that is less than or equal to T.
> However, I noticed that all versions of the KeyValue for a particular
> column are processed through a custom filter I implemented even though I
> specify setMaxVersions(1) and setTimeRange(0, T+1). I expected that if ONE
> KV of a particular column has ReturnCode.INCLUDE, the framework will jump
> to the next COL instead of iterating through all versions of the column.
> 
> Can someone confirm me if this is the expected behaviour (iterating through
> all versions of a column before setMaxVersions take effect)? If this is an
> expected behaviour, what is your recommendation to speed this up?
> 
> Best Regards,
> 
> Jerry
>

Re: setTimeRange and setMaxVersions seem to be inefficient

Posted by Jerry Lam <ch...@gmail.com>.

Hi Lars:

Thanks for confirming the inefficiency of the implementation for this case. For my case, a column can have more than 10K versions, I need a quick way to stop the scan from digging the column once there is a match (ReturnCode.INCLUDE). It would be nice to have a ReturnCode that can notify the framework to stop and go to next column once the number of versions specify in setMaxVersions is met. 

For now, I guess I have to hack it in the custom filter (I.e. I keep the count myself)? If you have a better way to achieve this, please share :)

Best Regards,

Jerry

Sent from my iPad (sorry for spelling mistakes)

On 2012-08-27, at 20:11, lars hofhansl <lh...@yahoo.com> wrote:

> Currently filters are evaluated before we do version counting.
> 
> Here's a comment from ScanQueryMatcher.java:
>     /**
>      * Filters should be checked before checking column trackers. If we do
>      * otherwise, as was previously being done, ColumnTracker may increment its
>      * counter for even that KV which may be discarded later on by Filter. This
>      * would lead to incorrect results in certain cases.
>      */
> 
> 
> So this is by design. (Doesn't mean it's correct or desirable, though.)
> 
> -- Lars
> 
> 
> ----- Original Message -----
> From: Jerry Lam <ch...@gmail.com>
> To: user <us...@hbase.apache.org>
> Cc: 
> Sent: Monday, August 27, 2012 2:40 PM
> Subject: setTimeRange and setMaxVersions seem to be inefficient
> 
> Hi HBase community:
> 
> I tried to use setTimeRange and setMaxVersions to limit the number of KVs
> return per column. The behaviour is as I would expect that is
> setTimeRange(0, T + 1) and setMaxVersions(1) will give me ONE version of KV
> with timestamp that is less than or equal to T.
> However, I noticed that all versions of the KeyValue for a particular
> column are processed through a custom filter I implemented even though I
> specify setMaxVersions(1) and setTimeRange(0, T+1). I expected that if ONE
> KV of a particular column has ReturnCode.INCLUDE, the framework will jump
> to the next COL instead of iterating through all versions of the column.
> 
> Can someone confirm me if this is the expected behaviour (iterating through
> all versions of a column before setMaxVersions take effect)? If this is an
> expected behaviour, what is your recommendation to speed this up?
> 
> Best Regards,
> 
> Jerry
>

Re: setTimeRange and setMaxVersions seem to be inefficient

Posted by lars hofhansl <lh...@yahoo.com>.

Currently filters are evaluated before we do version counting.

Here's a comment from ScanQueryMatcher.java:
    /**
     * Filters should be checked before checking column trackers. If we do
     * otherwise, as was previously being done, ColumnTracker may increment its
     * counter for even that KV which may be discarded later on by Filter. This
     * would lead to incorrect results in certain cases.
     */


So this is by design. (Doesn't mean it's correct or desirable, though.)

-- Lars


----- Original Message -----
From: Jerry Lam <ch...@gmail.com>
To: user <us...@hbase.apache.org>
Cc: 
Sent: Monday, August 27, 2012 2:40 PM
Subject: setTimeRange and setMaxVersions seem to be inefficient

Hi HBase community:

I tried to use setTimeRange and setMaxVersions to limit the number of KVs
return per column. The behaviour is as I would expect that is
setTimeRange(0, T + 1) and setMaxVersions(1) will give me ONE version of KV
with timestamp that is less than or equal to T.
However, I noticed that all versions of the KeyValue for a particular
column are processed through a custom filter I implemented even though I
specify setMaxVersions(1) and setTimeRange(0, T+1). I expected that if ONE
KV of a particular column has ReturnCode.INCLUDE, the framework will jump
to the next COL instead of iterating through all versions of the column.

Can someone confirm me if this is the expected behaviour (iterating through
all versions of a column before setMaxVersions take effect)? If this is an
expected behaviour, what is your recommendation to speed this up?

Best Regards,

Jerry