You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "Aman Sinha (Jira)" <ji...@apache.org> on 2023/05/26 01:19:00 UTC
[jira] [Assigned] (IMPALA-12168) Aggregate's cardinality overestimated for a count distinct query with grouping on same column
[ https://issues.apache.org/jira/browse/IMPALA-12168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Aman Sinha reassigned IMPALA-12168:
-----------------------------------
Assignee: Aman Sinha
> Aggregate's cardinality overestimated for a count distinct query with grouping on same column
> ---------------------------------------------------------------------------------------------
>
> Key: IMPALA-12168
> URL: https://issues.apache.org/jira/browse/IMPALA-12168
> Project: IMPALA
> Issue Type: Bug
> Components: Frontend
> Reporter: Aman Sinha
> Assignee: Aman Sinha
> Priority: Major
>
> When the COUNT(DISTINCT col) is on the same column as the GROUP BY, we currently overestimate the cardinality:
> {noformat}
> [localhost:21050] tpch> explain select l_shipdate, l_quantity, count(distinct l_shipdate) from lineitem group by l_shipdate, l_quantity;
> Query: explain select l_shipdate, l_quantity, count(distinct l_shipdate) from lineitem group by l_shipdate, l_quantity
> +-------------------------------------------------------------+
> | Explain String |
> +-------------------------------------------------------------+
> | Max Per-Host Resource Reservation: Memory=89.75MB Threads=5 |
> | Per-Host Resource Estimates: Memory=517MB |
> | |
> | PLAN-ROOT SINK |
> | | |
> | 07:EXCHANGE [UNPARTITIONED] |
> | | |
> | 06:AGGREGATE [FINALIZE] |
> | | output: count:merge(l_shipdate) |
> | | group by: l_shipdate, l_quantity |
> | | row-size=38B cardinality=134.08K |
> | | |
> | 05:EXCHANGE [HASH(l_shipdate,l_quantity)] |
> | | |
> | 02:AGGREGATE [STREAMING] |
> | | output: count(l_shipdate) |
> | | group by: l_shipdate, l_quantity |
> | | row-size=38B cardinality=134.08K |
> | | |
> | 04:AGGREGATE |
> | | group by: l_shipdate, l_quantity, l_shipdate |
> | | row-size=52B cardinality=6.00M |
> | | |
> | 03:EXCHANGE [HASH(l_shipdate,l_quantity,l_shipdate)] |
> | | |
> | 01:AGGREGATE [STREAMING] |
> | | group by: l_shipdate, l_quantity, l_shipdate |
> | | row-size=52B cardinality=6.00M |
> | | |
> | 00:SCAN HDFS [tpch.lineitem] |
> | HDFS partitions=1/1 files=1 size=718.94MB |
> | row-size=30B cardinality=6.00M |
> +-------------------------------------------------------------+
> {noformat}
> Here the lower Streaming Agg's cardinality is 6M rows whereas the actual rowcount for that phase is 376K:
> {noformat}
> 01:AGGREGATE 3 3 327.441ms 350.542ms 376.80K 6.00M 36.11 MB 91.57 MB STREAMING
> {noformat}
> The NDV column stats for this table:
> {noformat}
> [localhost:21050] tpch> show column stats lineitem;
> Query: show column stats lineitem
> +-----------------+---------------+------------------+--------+----------+---------------+--------+---------+
> | Column | Type | #Distinct Values | #Nulls | Max Size | Avg Size | #Trues | #Falses |
> +-----------------+---------------+------------------+--------+----------+---------------+--------+---------+
> | l_orderkey | BIGINT | 1563438 | 0 | 8 | 8.0 | -1 | -1 |
> | l_partkey | BIGINT | 200516 | 0 | 8 | 8.0 | -1 | -1 |
> | l_suppkey | BIGINT | 9712 | 0 | 8 | 8.0 | -1 | -1 |
> | l_linenumber | INT | 7 | 0 | 4 | 4.0 | -1 | -1 |
> | l_quantity | DECIMAL(12,2) | 51 | 0 | 8 | 8.0 | -1 | -1 |
> | l_extendedprice | DECIMAL(12,2) | 868550 | 0 | 8 | 8.0 | -1 | -1 |
> | l_discount | DECIMAL(12,2) | 11 | 0 | 8 | 8.0 | -1 | -1 |
> | l_tax | DECIMAL(12,2) | 9 | 0 | 8 | 8.0 | -1 | -1 |
> | l_returnflag | STRING | 3 | 0 | 1 | 1.0 | -1 | -1 |
> | l_linestatus | STRING | 2 | 0 | 1 | 1.0 | -1 | -1 |
> | l_shipdate | STRING | 2629 | 0 | 10 | 10.0 | -1 | -1 |
> | l_commitdate | STRING | 2559 | 0 | 10 | 10.0 | -1 | -1 |
> | l_receiptdate | STRING | 2658 | 0 | 10 | 10.0 | -1 | -1 |
> | l_shipinstruct | STRING | 4 | 0 | 17 | 11.9986381531 | -1 | -1 |
> | l_shipmode | STRING | 7 | 0 | 7 | 4.28530454636 | -1 | -1 |
> | l_comment | STRING | 4652621 | 0 | 43 | 26.4941692352 | -1 | -1 |
> +-----------------+---------------+------------------+--------+----------+---------------+--------+---------+
> {noformat}
> The reason for this overestimation is the group by exprs contain duplicate column l_shipdate:
> {noformat}
> group by: l_shipdate, l_quantity, l_shipdate
> {noformat}
> Due to this extra duplicate column, we compute the cardinality as
> NDV(l_shipdate) * NDV(l_quantity) * NDV(l_shipdate) = 352M which is then capped at 6M since the child produces 6M rows.
> The right estimate should be NDV(l_shipdate) * NDV(l_quantity) = 134K
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org