You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "Aman Sinha (Jira)" <ji...@apache.org> on 2021/06/21 17:55:00 UTC

[jira] [Commented] (IMPALA-10755) Wrong results for a query with predicate on an analytic function

    [ https://issues.apache.org/jira/browse/IMPALA-10755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17366761#comment-17366761 ] 

Aman Sinha commented on IMPALA-10755:
-------------------------------------

As part of the changes in IMPALA-9979 (part 2 partitioned top-n) we made changes to push down predicates that reference analytic tuple into the inline view [1]. The regression got introduced because in the above example, the BinaryPredicate nr_pvp = max_nr_pvp is actually referencing 2 tuple ids, one of which (the right side) is the analytic tuple. In this case, the pushdown logic does not detect it and pushes into the inline view and removes the predicate from the unassigned conjuncts list thinking the analytic planner will take care of it. However, the analytic planner does not find the unassigned conjunct likely because it involves a non analytic tuple id (I need to debug this part a little more.  This is based on my initial investigation).  I have a potential fix that I am testing. 

[1] https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java#L1333

> Wrong results for a query with predicate on an analytic function
> ----------------------------------------------------------------
>
>                 Key: IMPALA-10755
>                 URL: https://issues.apache.org/jira/browse/IMPALA-10755
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Frontend
>    Affects Versions: Impala 4.0
>            Reporter: Aman Sinha
>            Assignee: Aman Sinha
>            Priority: Critical
>              Labels: correctness, wrongresults
>
> In the master branch, the following query with equality predicate on analytic function produces 13 rows whereas the correct result is 3 rows (as verified with Hive):
> {noformat}
> use tpch;
> select *
>   FROM (
>     SELECT
>       l_quantity, l_linenumber,
>       nr_pvp,
>       max(nr_pvp) over (partition BY l_quantity) max_nr_pvp
>     FROM (
>       SELECT
>         l_quantity, l_linenumber,
>         count(1) AS nr_pvp
>       FROM
>         lineitem, partsupp
>         where l_partkey	= ps_partkey
>         and ps_suppkey = 10 and  l_quantity in (1, 2)
>         GROUP BY l_quantity, l_linenumber
>     ) a
>   ) b
>   WHERE nr_pvp = max_nr_pvp
>   order by 1, 2, 3, 4;
> {noformat}
> Wrong results:
> {noformat}
> +------------+--------------+--------+------------+                                                      
> | l_quantity | l_linenumber | nr_pvp | max_nr_pvp |
> +------------+--------------+--------+------------+
> | 1.00       | 1            | 11     | 12         |
> | 1.00       | 2            | 12     | 12         |
> | 1.00       | 3            | 10     | 12         |
> | 1.00       | 4            | 5      | 12         |
> | 1.00       | 5            | 4      | 12         |
> | 1.00       | 6            | 5      | 12         |
> | 1.00       | 7            | 2      | 12         |
> | 2.00       | 1            | 12     | 12         |
> | 2.00       | 2            | 12     | 12         |
> | 2.00       | 3            | 11     | 12         |
> | 2.00       | 4            | 8      | 12         |
> | 2.00       | 5            | 3      | 12         |
> | 2.00       | 6            | 2      | 12         |
> +------------+--------------+--------+------------+
> Fetched 13 row(s) in 1.36s
> {noformat}
> Expected results (verified with Hive):
> {noformat}
> +---------------+-----------------+-----------+---------------+
> | b.l_quantity  | b.l_linenumber  | b.nr_pvp  | b.max_nr_pvp  |
> +---------------+-----------------+-----------+---------------+
> | 1.00          | 2               | 12        | 12            |
> | 2.00          | 1               | 12        | 12            |
> | 2.00          | 2               | 12        | 12            |
> +---------------+-----------------+-----------+---------------+
> {noformat}
> Explain plan for Impala shows that the equality predicate 'nr_pvp = max_nr_pvp' which is supposed to be applied on top of the analytic operator is missing:
> {noformat}
> +---------------------------------------------------------------------------+
> | Explain String                                                            |
> +---------------------------------------------------------------------------+
> | Max Per-Host Resource Reservation: Memory=42.88MB Threads=7               |
> | Per-Host Resource Estimates: Memory=483MB                                 |
> |                                                                           |
> | PLAN-ROOT SINK                                                            |
> | |                                                                         |
> | 11:MERGING-EXCHANGE [UNPARTITIONED]                                       |
> | |  order by: l_quantity ASC, l_linenumber ASC, nr_pvp ASC, max_nr_pvp ASC |
> | |                                                                         |
> | 06:SORT                                                                   |
> | |  order by: l_quantity ASC, l_linenumber ASC, nr_pvp ASC, max_nr_pvp ASC |
> | |  row-size=28B cardinality=357                                           |
> | |                                                                         |
> | 05:ANALYTIC                                                               |
> | |  functions: max(count(*))                                               |
> | |  partition by: l_quantity                                               |
> | |  row-size=28B cardinality=357                                           |
> | |                                                                         |
> | 04:SORT                                                                   |
> | |  order by: l_quantity ASC NULLS LAST                                    |
> | |  row-size=20B cardinality=357                                           |
> | |                                                                         |
> | 10:EXCHANGE [HASH(l_quantity)]                                            |
> | |                                                                         |
> | 09:AGGREGATE [FINALIZE]                                                   |
> | |  output: count:merge(*)                                                 |
> | |  group by: l_quantity, l_linenumber                                     |
> | |  row-size=20B cardinality=357                                           |
> | |                                                                         |
> | 08:EXCHANGE [HASH(l_quantity,l_linenumber)]                               |
> | |                                                                         |
> | 03:AGGREGATE [STREAMING]                                                  |
> | |  output: count(*)                                                       |
> | |  group by: l_quantity, l_linenumber                                     |
> | |  row-size=20B cardinality=357                                           |
> | |                                                                         |
> | 02:HASH JOIN [INNER JOIN, BROADCAST]                                      |
> | |  hash predicates: l_partkey = ps_partkey                                |
> | |  runtime filters: RF000 <- ps_partkey                                   |
> | |  row-size=36B cardinality=2.45K                                         |
> | |                                                                         |
> | |--07:EXCHANGE [BROADCAST]                                                |
> | |  |                                                                      |
> | |  01:SCAN HDFS [tpch.partsupp]                                           |
> | |     HDFS partitions=1/1 files=1 size=112.71MB                           |
> | |     predicates: ps_suppkey = 10                                         |
> | |     row-size=16B cardinality=82                                         |
> | |                                                                         |
> | 00:SCAN HDFS [tpch.lineitem]                                              |
> |    HDFS partitions=1/1 files=1 size=718.94MB                              |
> |    predicates: l_quantity IN (1, 2)                                       |
> |    runtime filters: RF000 -> l_partkey                                    |
> |    row-size=20B cardinality=235.34K                                       |
> +---------------------------------------------------------------------------+
> {noformat}
> This is likely a regression since based on some internal testing this was working correctly on an earlier version.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org