You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "ASF subversion and git services (Jira)" <ji...@apache.org> on 2021/01/23 10:16:00 UTC
[jira] [Commented] (IMPALA-10296) Fix analytic limit pushdown when predicates are present

    [ https://issues.apache.org/jira/browse/IMPALA-10296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17270610#comment-17270610 ] 

ASF subversion and git services commented on IMPALA-10296:
----------------------------------------------------------

Commit 1ada739e81416a6f49f2e8bb287a007252371444 in impala's branch refs/heads/master from Tim Armstrong
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=1ada739 ]

IMPALA-10296: Fix analytic limit pushdown when predicates are present

This fixes the analytic push down optimization for the case where
the ORDER BY expressions are compatible with the partitioning of the
analytic *and* there is a rank() or row_number() predicate.

In this case the rows returned are going to come from the first partitions,
i.e. if the limit is 100, if we go through the partitions in order until
the row count adds up to 100, then we know that the rows must come from
those partitions.

The problem is that predicates can discard rows from the partitions,
meaning that a limit naively pushed down to the top-n will filter
out rows that could be returned from the query.

We can avoid the problem in the case where the partition limit >=
order by limit, however.

In this case the relevant set of partitions is the set of partitions
that include the first <limit> rows, since the top-level limit
generally kicks in before the per-partition limit. The only twist
is that the orderings may be different within a partition, so we
need to make sure to include all of the rows in the final partition.

The solution implemented in this patch is to increase the pushed
down limit so that it is always guaranteed to include all of the
rows in the final partition to be returned. E.g. if you had a
row_number() <= 100 predicate and limit 100, if you pushed down
limit 200, then you'd be guaranteed to capture all of the rows
in the final partition. One case we need to handle is that,
in the case of a rank() predicate, we can have more than that
number of rows in the partition because of ties.

This patch implements tie handling in the backend (I took most
of that implementation from my in-progress partitioned top-n patch,
with the intention of rebasing that onto this patch).

This also adds a check against TOPN_BYTES_LIMIT so that
the limit can't be increased to an arbitarily large value.

Testing:
* Add new planner test with negative case where it's rejected
  because the transformation is incorrect.
* Update other planner tests to reflect new limit calculation
  + tie handling required for correctness.
* Add planner test for very high rank predicate that overflows int32
* Add planner test that checks TOPN_BYTES_LIMIT handling
* Add planner test that checks that dense_rank() can't be pushed.
* Existing planner tests already have adequate coverage for predicates
  : <=, <, = and row_number().
* Add some end-to-end tests that repro bugs that fall under the jira
* Add an end-to-end test on TPC-H with more data to exercise the
  tie-handling logic in the execnode more.

Perf:
Ran TPC-DS q67 with mt_dop=1 on a single node, confirmed there was
no measurable change in performance as a result of this patchset.

Ran TPC-H scale 30 on a single node, no significant perf change.

Ran a targeted query to check for regressions in the top-n node.
The elapsed time for this targeted query did not change:

  use tpch30_parquet;
  set mt_dop=1;
  select l_extendedprice from lineitem
  order by 1 limit 100

Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509
Reviewed-on: http://gerrit.cloudera.org:8080/16942
Reviewed-by: Impala Public Jenkins <im...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>


> Fix analytic limit pushdown when predicates are present
> -------------------------------------------------------
>
>                 Key: IMPALA-10296
>                 URL: https://issues.apache.org/jira/browse/IMPALA-10296
>             Project: IMPALA
>          Issue Type: Sub-task
>          Components: Frontend
>    Affects Versions: Impala 4.0
>            Reporter: Tim Armstrong
>            Assignee: Tim Armstrong
>            Priority: Blocker
>              Labels: correctness
>
> This is to fix case 1 of the parent JIRA.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org