You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@calcite.apache.org by "Julian Hyde (JIRA)" <ji...@apache.org> on 2019/07/09 05:40:00 UTC
[jira] [Commented] (CALCITE-3181) Support limit per group in Window

    [ https://issues.apache.org/jira/browse/CALCITE-3181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16880945#comment-16880945 ] 

Julian Hyde commented on CALCITE-3181:
--------------------------------------

It seems to make sense. Can you devise an example on a "real" data set, and involving not just ROW_NUMBER but an aggregate function such as AVG or SUM.

I know it is not standard SQL, but can you sketch out how the query would look if we allowed "OVER (PARTITION BY ... ORDER BY ... FETCH ...)"?

A while ago I was thinking about "partitioned sort-limit". We support "Sort(Scan(EMP), key=[salary DESC], fetch=10)" (the 10 top-earning employees) but we don't support "Sort(Scan(EMP), partition=[deptno], key=[salary DESC], fetch=10)" (the top 10 top-earning employees in each department).

This came up in CALCITE-1317, the "WinMagic" optimization. Regular Sort with limit can handle non-correlated sub-queries, but for correlated sub-queries we needed partitioned Sort with limit. Thus it seems a natural and useful extension to Sort. I haven't thought about whether the extension of Window that you propose is as natural.

> Support limit per group in Window
> ---------------------------------
>
>                 Key: CALCITE-3181
>                 URL: https://issues.apache.org/jira/browse/CALCITE-3181
>             Project: Calcite
>          Issue Type: Improvement
>          Components: core
>            Reporter: Haisheng Yuan
>            Priority: Major
>
> We have a lot of queries like the following to retrieve top N tuples per group:
> {code:java}
> SELECT x, y FROM
>      (SELECT x, y, ROW_NUMBER() OVER (PARTITION BY x ORDER BY y) 
>      AS rn FROM t1) t2 WHERE rn <= 3;
> {code}
> The performance is not good if each group has a lot more tuples than wanted, because we will retrieve and sort all the tuples, instead of just doing a top-N heap sort.
> In order to do optimization for this kind of query, we need to extend window to support limit, if and only if there is only 1 window function, and it is {{row_number()}}. We also need a substitute rule to push the limit into window. Of course, we also need to modify executor to support this optimization (can be later).
> {code:java}
> Filter (rn <= 3)
>   +- Window (window#0={Partition by x order by y ROW_NUMBER()})
> {code}
> to
> {code:java}
> Filter (rn <= 3)
>   +- Window (window#0={Partition by x order by y limit 3 ROW_NUMBER()})
> {code}
> Thoughts? Objections?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)