You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hbase.apache.org by Dave Latham <la...@davelink.net> on 2015/04/09 06:09:22 UTC

remove scanner caching?

After debugging a scans missing data issue while migrating to 0.98 (thanks
Andrew, Jonathon, Josh, and Lars for the help), I'm left wondering why we
have both caching and maxResultSize for scans.  It seems to be more client
api complexity than it's worth.  Why would someone need to set caching when
maxResultSize is available?  Indeed, the first patch proposed by some
fellow in HBASE-1996 simply replaced caching with maxResultSize.  Can we
deprecate and eventually remove caching?  Is there a good case for keeping
it in the client API surface?

Dave

Re: remove scanner caching?

Posted by Dave Latham <la...@davelink.net>.
Yes also on keeping the default configuration (rowLimit =
Integer.MAX_VALUE, bufferSize = 2MB, allowPartialResults = false).

Re: remove scanner caching?

Posted by Dave Latham <la...@davelink.net>.
Sounds like you and others are already ahead of me.  Thanks for
opening HBASE-13441 and your related work.  Some responses below:

> Nice idea! I agree that the Scan API would be cleaned up by your
> suggestions, especially the doc updates. Some comments below:
>
> > Scan.bufferSize (instead of maxResultSize for the target over-the-wire
> > size - though this is still confusing because it's common to go over this
> > size)
> Ya this setting will always have a little ambiguity associated with it (at
> least until such a time where we are able to enforce it at the byte level
> i.e. send back partial cells). Scan.bufferSize sounds okay. As a note,
> there was some discussion in HBASE-11544 about renaming this field and one
> of the recommendations was Scan.rpcChunkSize.

rpcChunkSize sounds fine to me too - much better than maxResultSize

> > Scan.limitRows (instead of caching - along with true client side support)
> Makes sense. I think that client side support is actually already there (at
> least it is in ClientScanner via the countdown variable that is used as the
> caching value for new scanner callables).

Gotcha - but I would envision the client actually closing the scanner
(Iterable<Result>) once the row limit is hit.  Changing the meaning
from something about how the data transfer is implemented to an actual
visible query limit.

>
> > Scan.allowPartialResults  (to indicate it's ok to break up rows across
> Results...)
> With HBASE-11544 in branch-1+ the server will stop adding Cells as soon as
> the buffer fills and send back the accumulated Results to the client (last
> Result may be a partial of its row). In the case that allow partial results
> is false, the ClientScanner handles reassembling the partials into a
> complete view of the row before releasing the Result to the application.

That's awesome.  Great work.

> With this proposed cleanup, are you recommending that we do away with
> Scan.setBatch? Would the default configuration remain as it is now in
> branch-1+ (rowLimit = Integer.MAX_VALUE, bufferSize = 2MB,
> allowPartialResults = false)?

Yes, I was thinking of dropping setBatch also.

Re: remove scanner caching?

Posted by Jonathan Lawlor <jo...@cloudera.com>.
Nice idea! I agree that the Scan API would be cleaned up by your
suggestions, especially the doc updates. Some comments below:

> Scan.bufferSize (instead of maxResultSize for the target over-the-wire
size - though this is still confusing because it's common to go over this
size)
Ya this setting will always have a little ambiguity associated with it (at
least until such a time where we are able to enforce it at the byte level
i.e. send back partial cells). Scan.bufferSize sounds okay. As a note,
there was some discussion in HBASE-11544 about renaming this field and one
of the recommendations was Scan.rpcChunkSize.

> Scan.limitRows (instead of caching - along with true client side support)
Makes sense. I think that client side support is actually already there (at
least it is in ClientScanner via the countdown variable that is used as the
caching value for new scanner callables).

> Scan.allowPartialResults  (to indicate it's ok to break up rows across
Results...)
With HBASE-11544 in branch-1+ the server will stop adding Cells as soon as
the buffer fills and send back the accumulated Results to the client (last
Result may be a partial of its row). In the case that allow partial results
is false, the ClientScanner handles reassembling the partials into a
complete view of the row before releasing the Result to the application.

With this proposed cleanup, are you recommending that we do away with
Scan.setBatch? Would the default configuration remain as it is now in
branch-1+ (rowLimit = Integer.MAX_VALUE, bufferSize = 2MB,
allowPartialResults = false)?

Jonathan

On Thu, Apr 9, 2015 at 6:25 AM, Dave Latham <la...@davelink.net> wrote:

> We definitely wouldn't want to remove it too soon, for compatibility
> reasons.
>
> Adding a "limitRows" notion sounds reasonable, but I'd argue is actually
> something different than caching was.  If an app is relying on the scanner
> to actually limit the number of rows returned, the current caching limit
> won't work for scans that cross region boundaries.  We would need to keep
> the state client side in addition to server side and decrement as we
> traverse regions.
>
> Looking into the branch-1 API, I can see there is now also
> Scan.allowPartialResults in addition to Scan.batch.  For most cases, I'd
> expect batching is just to avoid memory issues for wide rows, in which case
> allowPartialResults could be a better, simpler interface to tell HBase not
> to overflow a small buffer with wide rows.  Though it looks like at the
> moment that doesn't happen.
>
> As an app developer myself, the interaction between batch, caching,
> maxResultSize, allowPartialResults is confusing (as well as the similarly
> named cacheBlocks).  The names could be better (maxResultSize has actually
> nothing to do with the maximum size of the Result's returned - it's rather
> the internal buffer size, batch is a max Cells per Result [I think.  does
> it reset between rows?], and of course caching is an internal
> maxRowsPerRPC).  The documentation is limited and out of date (e.g.
> https://hbase.apache.org/book.html#perf.hbase.client.caching).  Some
> things
> could at least be more consistent (getAllowPartialResults instead of
> isAllowPartialResults like almost all the other boolean properties).
>
> As I poke around and write this out, I guess I'd argue instead that it's
> time (or past time) to clean up the Scan API and document it more clearly.
> Which is a scary task I know.  But for a newcomer, it's a scary API right
> now.
>
> What about something like:
> Scan.bufferSize (instead of maxResultSize for the target over-the-wire size
> - though this is still confusing because it's common to go over this size)
> Scan.limitRows (instead of caching - along with true client side support)
> Scan.allowPartialResults (to indicate it's ok to break up rows across
> Results. it is transmitted to the server to indicate stop adding Cells to
> the buffer as soon as it fills rather than at the end of the row.  if a
> client needs true pagination for Cells within a row it can be done with a
> Filter.)
> Scan.cacheBlocks (less confusing without other things called "caching")
>
> Dave
>
>
>
>
>
> On Wed, Apr 8, 2015 at 10:00 PM, lars hofhansl <la...@apache.org> wrote:
>
> > Scanner caching (in 1.1 and 2.0) is now a _limit_. I.e. normally you
> leave
> > it disabled (the default of Long.MAX_VALUE) unless you know ahead of time
> > that you'll only look at the first N rows returned. In that case you'd
> set
> > it to N. I thought we had renamed it from "caching" to "limit" but
> looking
> > at the code, that is not the case.
> >
> > In 0.98 and 1.0.x we need to keep it around defaulting to 100 for
> > backwards compatibility.
> >
> > -- Lars
> >       From: Dave Latham <la...@davelink.net>
> >  To: dev@hbase.apache.org
> >  Sent: Wednesday, April 8, 2015 9:09 PM
> >  Subject: remove scanner caching?
> >
> > After debugging a scans missing data issue while migrating to 0.98
> (thanks
> > Andrew, Jonathon, Josh, and Lars for the help), I'm left wondering why we
> > have both caching and maxResultSize for scans.  It seems to be more
> client
> > api complexity than it's worth.  Why would someone need to set caching
> when
> > maxResultSize is available?  Indeed, the first patch proposed by some
> > fellow in HBASE-1996 simply replaced caching with maxResultSize.  Can we
> > deprecate and eventually remove caching?  Is there a good case for
> keeping
> > it in the client API surface?
> >
> > Dave
> >
> >
> >
> >
>

Re: remove scanner caching?

Posted by Stack <st...@duboce.net>.
On Thu, Apr 9, 2015 at 6:25 AM, Dave Latham <la...@davelink.net> wrote:

> We definitely wouldn't want to remove it too soon, for compatibility
> reasons.
>
> Adding a "limitRows" notion sounds reasonable, but I'd argue is actually
> something different than caching was.  If an app is relying on the scanner
> to actually limit the number of rows returned, the current caching limit
> won't work for scans that cross region boundaries.  We would need to keep
> the state client side in addition to server side and decrement as we
> traverse regions.
>
> Looking into the branch-1 API, I can see there is now also
> Scan.allowPartialResults in addition to Scan.batch.  For most cases, I'd
> expect batching is just to avoid memory issues for wide rows, in which case
> allowPartialResults could be a better, simpler interface to tell HBase not
> to overflow a small buffer with wide rows.  Though it looks like at the
> moment that doesn't happen.
>
> As an app developer myself, the interaction between batch, caching,
> maxResultSize, allowPartialResults is confusing (as well as the similarly
> named cacheBlocks).  The names could be better (maxResultSize has actually
> nothing to do with the maximum size of the Result's returned - it's rather
> the internal buffer size, batch is a max Cells per Result [I think.  does
> it reset between rows?], and of course caching is an internal
> maxRowsPerRPC).  The documentation is limited and out of date (e.g.
> https://hbase.apache.org/book.html#perf.hbase.client.caching).  Some
> things
> could at least be more consistent (getAllowPartialResults instead of
> isAllowPartialResults like almost all the other boolean properties).
>
> As I poke around and write this out, I guess I'd argue instead that it's
> time (or past time) to clean up the Scan API and document it more clearly.
> Which is a scary task I know.  But for a newcomer, it's a scary API right
> now.
>
> What about something like:
> Scan.bufferSize (instead of maxResultSize for the target over-the-wire size
> - though this is still confusing because it's common to go over this size)
> Scan.limitRows (instead of caching - along with true client side support)
> Scan.allowPartialResults (to indicate it's ok to break up rows across
> Results. it is transmitted to the server to indicate stop adding Cells to
> the buffer as soon as it fills rather than at the end of the row.  if a
> client needs true pagination for Cells within a row it can be done with a
> Filter.)
> Scan.cacheBlocks (less confusing without other things called "caching")
>
>
I think you've identified next logical follow-on to the work that
JonathanL, JoshE, Lars et al. have been at in Scanners recently. A Scanner
2.0 project sounds good to me (with appropriate bridging from old API
through to the new -- configs and docs too).

St.Ack




> Dave
>
>
>
>
>
> On Wed, Apr 8, 2015 at 10:00 PM, lars hofhansl <la...@apache.org> wrote:
>
> > Scanner caching (in 1.1 and 2.0) is now a _limit_. I.e. normally you
> leave
> > it disabled (the default of Long.MAX_VALUE) unless you know ahead of time
> > that you'll only look at the first N rows returned. In that case you'd
> set
> > it to N. I thought we had renamed it from "caching" to "limit" but
> looking
> > at the code, that is not the case.
> >
> > In 0.98 and 1.0.x we need to keep it around defaulting to 100 for
> > backwards compatibility.
> >
> > -- Lars
> >       From: Dave Latham <la...@davelink.net>
> >  To: dev@hbase.apache.org
> >  Sent: Wednesday, April 8, 2015 9:09 PM
> >  Subject: remove scanner caching?
> >
> > After debugging a scans missing data issue while migrating to 0.98
> (thanks
> > Andrew, Jonathon, Josh, and Lars for the help), I'm left wondering why we
> > have both caching and maxResultSize for scans.  It seems to be more
> client
> > api complexity than it's worth.  Why would someone need to set caching
> when
> > maxResultSize is available?  Indeed, the first patch proposed by some
> > fellow in HBASE-1996 simply replaced caching with maxResultSize.  Can we
> > deprecate and eventually remove caching?  Is there a good case for
> keeping
> > it in the client API surface?
> >
> > Dave
> >
> >
> >
> >
>

Re: remove scanner caching?

Posted by Dave Latham <la...@davelink.net>.
We definitely wouldn't want to remove it too soon, for compatibility
reasons.

Adding a "limitRows" notion sounds reasonable, but I'd argue is actually
something different than caching was.  If an app is relying on the scanner
to actually limit the number of rows returned, the current caching limit
won't work for scans that cross region boundaries.  We would need to keep
the state client side in addition to server side and decrement as we
traverse regions.

Looking into the branch-1 API, I can see there is now also
Scan.allowPartialResults in addition to Scan.batch.  For most cases, I'd
expect batching is just to avoid memory issues for wide rows, in which case
allowPartialResults could be a better, simpler interface to tell HBase not
to overflow a small buffer with wide rows.  Though it looks like at the
moment that doesn't happen.

As an app developer myself, the interaction between batch, caching,
maxResultSize, allowPartialResults is confusing (as well as the similarly
named cacheBlocks).  The names could be better (maxResultSize has actually
nothing to do with the maximum size of the Result's returned - it's rather
the internal buffer size, batch is a max Cells per Result [I think.  does
it reset between rows?], and of course caching is an internal
maxRowsPerRPC).  The documentation is limited and out of date (e.g.
https://hbase.apache.org/book.html#perf.hbase.client.caching).  Some things
could at least be more consistent (getAllowPartialResults instead of
isAllowPartialResults like almost all the other boolean properties).

As I poke around and write this out, I guess I'd argue instead that it's
time (or past time) to clean up the Scan API and document it more clearly.
Which is a scary task I know.  But for a newcomer, it's a scary API right
now.

What about something like:
Scan.bufferSize (instead of maxResultSize for the target over-the-wire size
- though this is still confusing because it's common to go over this size)
Scan.limitRows (instead of caching - along with true client side support)
Scan.allowPartialResults (to indicate it's ok to break up rows across
Results. it is transmitted to the server to indicate stop adding Cells to
the buffer as soon as it fills rather than at the end of the row.  if a
client needs true pagination for Cells within a row it can be done with a
Filter.)
Scan.cacheBlocks (less confusing without other things called "caching")

Dave





On Wed, Apr 8, 2015 at 10:00 PM, lars hofhansl <la...@apache.org> wrote:

> Scanner caching (in 1.1 and 2.0) is now a _limit_. I.e. normally you leave
> it disabled (the default of Long.MAX_VALUE) unless you know ahead of time
> that you'll only look at the first N rows returned. In that case you'd set
> it to N. I thought we had renamed it from "caching" to "limit" but looking
> at the code, that is not the case.
>
> In 0.98 and 1.0.x we need to keep it around defaulting to 100 for
> backwards compatibility.
>
> -- Lars
>       From: Dave Latham <la...@davelink.net>
>  To: dev@hbase.apache.org
>  Sent: Wednesday, April 8, 2015 9:09 PM
>  Subject: remove scanner caching?
>
> After debugging a scans missing data issue while migrating to 0.98 (thanks
> Andrew, Jonathon, Josh, and Lars for the help), I'm left wondering why we
> have both caching and maxResultSize for scans.  It seems to be more client
> api complexity than it's worth.  Why would someone need to set caching when
> maxResultSize is available?  Indeed, the first patch proposed by some
> fellow in HBASE-1996 simply replaced caching with maxResultSize.  Can we
> deprecate and eventually remove caching?  Is there a good case for keeping
> it in the client API surface?
>
> Dave
>
>
>
>

Re: remove scanner caching?

Posted by lars hofhansl <la...@apache.org>.
Scanner caching (in 1.1 and 2.0) is now a _limit_. I.e. normally you leave it disabled (the default of Long.MAX_VALUE) unless you know ahead of time that you'll only look at the first N rows returned. In that case you'd set it to N. I thought we had renamed it from "caching" to "limit" but looking at the code, that is not the case.

In 0.98 and 1.0.x we need to keep it around defaulting to 100 for backwards compatibility.

-- Lars
      From: Dave Latham <la...@davelink.net>
 To: dev@hbase.apache.org 
 Sent: Wednesday, April 8, 2015 9:09 PM
 Subject: remove scanner caching?
   
After debugging a scans missing data issue while migrating to 0.98 (thanks
Andrew, Jonathon, Josh, and Lars for the help), I'm left wondering why we
have both caching and maxResultSize for scans.  It seems to be more client
api complexity than it's worth.  Why would someone need to set caching when
maxResultSize is available?  Indeed, the first patch proposed by some
fellow in HBASE-1996 simply replaced caching with maxResultSize.  Can we
deprecate and eventually remove caching?  Is there a good case for keeping
it in the client API surface?

Dave