You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "Aman Sinha (Jira)" <ji...@apache.org> on 2023/05/26 01:19:00 UTC
[jira] [Assigned] (IMPALA-12168) Aggregate's cardinality overestimated for a count distinct query with grouping on same column

     [ https://issues.apache.org/jira/browse/IMPALA-12168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Aman Sinha reassigned IMPALA-12168:
-----------------------------------

    Assignee: Aman Sinha

> Aggregate's cardinality overestimated for a count distinct query with grouping on same column
> ---------------------------------------------------------------------------------------------
>
>                 Key: IMPALA-12168
>                 URL: https://issues.apache.org/jira/browse/IMPALA-12168
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Frontend
>            Reporter: Aman Sinha
>            Assignee: Aman Sinha
>            Priority: Major
>
> When the COUNT(DISTINCT col) is on the same column as the GROUP BY, we currently overestimate the cardinality: 
> {noformat}
> [localhost:21050] tpch> explain select l_shipdate, l_quantity, count(distinct l_shipdate) from lineitem group by l_shipdate, l_quantity;
> Query: explain select l_shipdate, l_quantity, count(distinct l_shipdate) from lineitem group by l_shipdate, l_quantity
> +-------------------------------------------------------------+
> | Explain String                                              |
> +-------------------------------------------------------------+
> | Max Per-Host Resource Reservation: Memory=89.75MB Threads=5 |
> | Per-Host Resource Estimates: Memory=517MB                   |
> |                                                             |
> | PLAN-ROOT SINK                                              |
> | |                                                           |
> | 07:EXCHANGE [UNPARTITIONED]                                 |
> | |                                                           |
> | 06:AGGREGATE [FINALIZE]                                     |
> | |  output: count:merge(l_shipdate)                          |
> | |  group by: l_shipdate, l_quantity                         |
> | |  row-size=38B cardinality=134.08K                         |
> | |                                                           |
> | 05:EXCHANGE [HASH(l_shipdate,l_quantity)]                   |
> | |                                                           |
> | 02:AGGREGATE [STREAMING]                                    |
> | |  output: count(l_shipdate)                                |
> | |  group by: l_shipdate, l_quantity                         |
> | |  row-size=38B cardinality=134.08K                         |
> | |                                                           |
> | 04:AGGREGATE                                                |
> | |  group by: l_shipdate, l_quantity, l_shipdate             |
> | |  row-size=52B cardinality=6.00M                           |
> | |                                                           |
> | 03:EXCHANGE [HASH(l_shipdate,l_quantity,l_shipdate)]        |
> | |                                                           |
> | 01:AGGREGATE [STREAMING]                                    |
> | |  group by: l_shipdate, l_quantity, l_shipdate             |
> | |  row-size=52B cardinality=6.00M                           |
> | |                                                           |
> | 00:SCAN HDFS [tpch.lineitem]                                |
> |    HDFS partitions=1/1 files=1 size=718.94MB                |
> |    row-size=30B cardinality=6.00M                           |
> +-------------------------------------------------------------+
> {noformat}
> Here the lower Streaming Agg's cardinality is 6M rows whereas the actual rowcount for that phase is 376K:
> {noformat}
> 01:AGGREGATE               3      3  327.441ms  350.542ms  376.80K       6.00M   36.11 MB       91.57 MB  STREAMING  
> {noformat}
> The NDV column stats for this table:
> {noformat}
> [localhost:21050] tpch> show column stats lineitem;
> Query: show column stats lineitem
> +-----------------+---------------+------------------+--------+----------+---------------+--------+---------+
> | Column          | Type          | #Distinct Values | #Nulls | Max Size | Avg Size      | #Trues | #Falses |
> +-----------------+---------------+------------------+--------+----------+---------------+--------+---------+
> | l_orderkey      | BIGINT        | 1563438          | 0      | 8        | 8.0           | -1     | -1      |
> | l_partkey       | BIGINT        | 200516           | 0      | 8        | 8.0           | -1     | -1      |
> | l_suppkey       | BIGINT        | 9712             | 0      | 8        | 8.0           | -1     | -1      |
> | l_linenumber    | INT           | 7                | 0      | 4        | 4.0           | -1     | -1      |
> | l_quantity      | DECIMAL(12,2) | 51               | 0      | 8        | 8.0           | -1     | -1      |
> | l_extendedprice | DECIMAL(12,2) | 868550           | 0      | 8        | 8.0           | -1     | -1      |
> | l_discount      | DECIMAL(12,2) | 11               | 0      | 8        | 8.0           | -1     | -1      |
> | l_tax           | DECIMAL(12,2) | 9                | 0      | 8        | 8.0           | -1     | -1      |
> | l_returnflag    | STRING        | 3                | 0      | 1        | 1.0           | -1     | -1      |
> | l_linestatus    | STRING        | 2                | 0      | 1        | 1.0           | -1     | -1      |
> | l_shipdate      | STRING        | 2629             | 0      | 10       | 10.0          | -1     | -1      |
> | l_commitdate    | STRING        | 2559             | 0      | 10       | 10.0          | -1     | -1      |
> | l_receiptdate   | STRING        | 2658             | 0      | 10       | 10.0          | -1     | -1      |
> | l_shipinstruct  | STRING        | 4                | 0      | 17       | 11.9986381531 | -1     | -1      |
> | l_shipmode      | STRING        | 7                | 0      | 7        | 4.28530454636 | -1     | -1      |
> | l_comment       | STRING        | 4652621          | 0      | 43       | 26.4941692352 | -1     | -1      |
> +-----------------+---------------+------------------+--------+----------+---------------+--------+---------+
> {noformat}
> The reason for this overestimation is the group by exprs contain duplicate column l_shipdate: 
> {noformat}
> group by: l_shipdate, l_quantity, l_shipdate       
> {noformat}
> Due to this extra duplicate column, we compute the cardinality as 
> NDV(l_shipdate) * NDV(l_quantity)  * NDV(l_shipdate) = 352M  which is then capped at 6M since the child produces 6M rows. 
> The right estimate should be NDV(l_shipdate) *  NDV(l_quantity) = 134K



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org