You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@kudu.apache.org by "Andrew Wong (Code Review)" <ge...@cloudera.org> on 2018/04/04 06:05:10 UTC

[kudu-CR] WIP KUDU-16 pt 3: add per-scan-token limits

Andrew Wong has uploaded this change for review. ( http://gerrit.cloudera.org:8080/9923


Change subject: WIP KUDU-16 pt 3: add per-scan-token limits
......................................................................

WIP KUDU-16 pt 3: add per-scan-token limits

This patch implements scan-limiting on a per-token basis. Kudu only
exposes a builder API for batches of tokens. These tokens get sent out
and instantiate scanners with any specified limit. This means that if
the builder limit is l, with t tokens, the number of rows returned will
be at most l * t. This is preferable to splitting the limit across
tokens, e.g. by limiting each token to l / t rows, as this may yield
fewer than l rows across tokens, when more than l rows actually exist,
which doesn't seem desirable.

WIP Maybe we should expose setting individual token limits. Also I'm not
sure this is actually a useful API or the one that we want to expose.

Change-Id: I655f07f10a9a99e9766402729361de97e929136a
---
M src/kudu/client/client.cc
M src/kudu/client/client.h
M src/kudu/client/scan_token-internal.cc
M src/kudu/client/scan_token-test.cc
4 files changed, 37 insertions(+), 1 deletion(-)



  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/23/9923/1
-- 
To view, visit http://gerrit.cloudera.org:8080/9923
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I655f07f10a9a99e9766402729361de97e929136a
Gerrit-Change-Number: 9923
Gerrit-PatchSet: 1
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>

[kudu-CR] WIP KUDU-16 pt 3: add per-scan-token limits

Posted by "Grant Henke (Code Review)" <ge...@cloudera.org>.
Grant Henke has posted comments on this change. ( http://gerrit.cloudera.org:8080/9923 )

Change subject: WIP KUDU-16 pt 3: add per-scan-token limits
......................................................................


Patch Set 1:

Revisiting this. I was a bit surprised to see the limit used in the Java client, but not in the C++ client. Given the limit is already defined and used on the scan token it seams like plumbing it through cant hurt and would make the client implementations more consistent.


-- 
To view, visit http://gerrit.cloudera.org:8080/9923
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I655f07f10a9a99e9766402729361de97e929136a
Gerrit-Change-Number: 9923
Gerrit-PatchSet: 1
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Mon, 22 Jun 2020 15:23:51 +0000
Gerrit-HasComments: No

[kudu-CR] WIP KUDU-16 pt 3: add per-scan-token limits

Posted by "Dan Burkert (Code Review)" <ge...@cloudera.org>.
Dan Burkert has posted comments on this change. ( http://gerrit.cloudera.org:8080/9923 )

Change subject: WIP KUDU-16 pt 3: add per-scan-token limits
......................................................................


Patch Set 1:

(1 comment)

Code LGTM, but I could go either way on whether to include this at all.

http://gerrit.cloudera.org:8080/#/c/9923/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/9923/1//COMMIT_MSG@19
PS1, Line 19: sure this is actually a useful API or the one that we want to expose.
> yea, Im not sure this API provides much value beyond the caller just specif
We chatted about this on Slack yesterday, but my thoughts are that we should definitely support setting a limit after re-hydrating the token, but providing this API as well might be useful, and it seems like it will be low maintenance overhead.  I checked out the Spark integration to see if we could take advantage of it their, but it doesn't look like the RelationProvider interface exposes any kind of Limit hooks, so I guess I don't have any concrete examples where it will be useful.



-- 
To view, visit http://gerrit.cloudera.org:8080/9923
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I655f07f10a9a99e9766402729361de97e929136a
Gerrit-Change-Number: 9923
Gerrit-PatchSet: 1
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Thu, 05 Apr 2018 18:09:53 +0000
Gerrit-HasComments: Yes

[kudu-CR] WIP KUDU-16 pt 3: add per-scan-token limits

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/9923 )

Change subject: WIP KUDU-16 pt 3: add per-scan-token limits
......................................................................


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/9923/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/9923/1//COMMIT_MSG@19
PS1, Line 19: sure this is actually a useful API or the one that we want to expose.
> We chatted about this on Slack yesterday, but my thoughts are that we shoul
I agree with both of your points; if a token is a logical concept that represents some optimization on a query plan, limiting probably doesn't make much sense (unless we're doing something fancy like sampling and joining, but AFAIK we're [Impala/Spark] not). That said, as an optimization for how Impala or some other parallel-scanning service might handles limits, it still might be a convenient API to have.

I took a look at Spark and it seems like limiting is handled at a different layer of abstraction than the one at which KuduRDD sits (something like https://spark.apache.org/docs/1.2.0/api/java/org/apache/spark/sql/execution/Limit.html). I'll have to dig a bit more, still not sure how KuduRDD fits in.

As for Impala, yeah it'd be helpful to get some feedback from the Impala devs. I pinged Thomas since he worked on runtime filters and runtime limits seems in a similar ballpark. In the meantime, I've filed IMPALA-6821 and pointed it at this CR.



-- 
To view, visit http://gerrit.cloudera.org:8080/9923
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I655f07f10a9a99e9766402729361de97e929136a
Gerrit-Change-Number: 9923
Gerrit-PatchSet: 1
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Sat, 07 Apr 2018 06:58:30 +0000
Gerrit-HasComments: Yes

[kudu-CR] WIP KUDU-16 pt 3: add per-scan-token limits

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/9923 )

Change subject: WIP KUDU-16 pt 3: add per-scan-token limits
......................................................................


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/9923/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/9923/1//COMMIT_MSG@19
PS1, Line 19: sure this is actually a useful API or the one that we want to expose.
yea, Im not sure this API provides much value beyond the caller just specifying the limit when they hydrate the token before they call Open() on the scan, which has the same net effect.

In general I feel like the APIs we provide for scan tokens should be things that offer some optimization opportunity at that "planning" stage. For example, it's beneficial to express predicates because it may reduce the number of generated tokens based on partition pruning, but for something like limit, especially given the odd semantics that you described, it's less obvious. A further reason it might be better for the user to handle the limits is that, at least in the case of Impala, a single backend is responsible for processing several tokens (in parallel, even), and so they'll still need to do their own limit tracking there to early-stop scanning after the limit is exhausted.

Maybe worth chatting with the Impala folks to see whether they think they would prefer to use an API like this to attach the limit to the scan tokens or whether or would be more useful for them to manage the limit themselves?



-- 
To view, visit http://gerrit.cloudera.org:8080/9923
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I655f07f10a9a99e9766402729361de97e929136a
Gerrit-Change-Number: 9923
Gerrit-PatchSet: 1
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Wed, 04 Apr 2018 19:23:10 +0000
Gerrit-HasComments: Yes

[kudu-CR] WIP KUDU-16 pt 3: add per-scan-token limits

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has abandoned this change. ( http://gerrit.cloudera.org:8080/9923 )

Change subject: WIP KUDU-16 pt 3: add per-scan-token limits
......................................................................


Abandoned

Deemed not that useful
-- 
To view, visit http://gerrit.cloudera.org:8080/9923
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: abandon
Gerrit-Change-Id: I655f07f10a9a99e9766402729361de97e929136a
Gerrit-Change-Number: 9923
Gerrit-PatchSet: 1
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] WIP KUDU-16 pt 3: add per-scan-token limits

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/9923 )

Change subject: WIP KUDU-16 pt 3: add per-scan-token limits
......................................................................


Patch Set 1:

> Patch Set 1:
> 
> (1 comment)

After some discussion with Thomas from the Impala team, I think the consensus is that this API isn't that useful. Right now, Impala can't really take advantage of this top-level LimitPerToken API because the scan limit isn't known at the time of token building (although there are likely ways around this through updating Impala's source).

The ideal API would be one where a limit could be set per grouop of tokens and the tokens would be able to coordinate with each other to satisfy the exact limit. This is difficult to do, so it's probably off the table for now.

That said, the limit is definitely known during the re-hydrating call DeserializeIntoScanner(), so Impala should be able to use the API defined in [725c6ce6391b42331e4ab4210ae4117779cba4a2](https://github.com/apache/kudu/commit/725c6ce6391b42331e4ab4210ae4117779cba4a2) in a similar way to that in this patch, i.e. specifying the overall limit per scanner such that l * t rows are returned across all scanners and returning the first l, since this is still a perf win.


-- 
To view, visit http://gerrit.cloudera.org:8080/9923
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I655f07f10a9a99e9766402729361de97e929136a
Gerrit-Change-Number: 9923
Gerrit-PatchSet: 1
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Mon, 09 Apr 2018 21:43:58 +0000
Gerrit-HasComments: No