You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "Quanlong Huang (Jira)" <ji...@apache.org> on 2021/09/27 09:58:00 UTC
[jira] [Updated] (IMPALA-10681) JOIN cardinality is wrong for INNER
joins when combined with aggregations
[ https://issues.apache.org/jira/browse/IMPALA-10681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Quanlong Huang updated IMPALA-10681:
------------------------------------
Fix Version/s: Impala 4.0.1
> JOIN cardinality is wrong for INNER joins when combined with aggregations
> -------------------------------------------------------------------------
>
> Key: IMPALA-10681
> URL: https://issues.apache.org/jira/browse/IMPALA-10681
> Project: IMPALA
> Issue Type: Bug
> Components: Frontend
> Reporter: Zoltán Borók-Nagy
> Assignee: Aman Sinha
> Priority: Major
> Fix For: Impala 4.1.0, Impala 4.0.1
>
>
> JOIN cardinality estimate can be off for INNER joins. Consider the following LEFT SEMI JOIN which estimates the cardinalities well:
> {noformat}
> [localhost:21050] tpcds_parquet> explain select * from store_sales left semi join (select max(s_store_sk) as max_store_sk from store) v on ss_store_sk = max_store_sk;
> Query: explain select * from store_sales left semi join (select max(s_store_sk) as max_store_sk from store) v on ss_store_sk = max_store_sk
> +-------------------------------------------------------------+
> | Explain String |
> +-------------------------------------------------------------+
> | Max Per-Host Resource Reservation: Memory=14.95MB Threads=6 |
> | Per-Host Resource Estimates: Memory=139MB |
> | |
> | PLAN-ROOT SINK |
> | | |
> | 07:EXCHANGE [UNPARTITIONED] |
> | | |
> | 03:HASH JOIN [LEFT SEMI JOIN, BROADCAST] |
> | | hash predicates: ss_store_sk = max(s_store_sk) |
> | | runtime filters: RF000 <- max(s_store_sk) |
> | | row-size=100B cardinality=480.07K |
> | | |
> | |--06:EXCHANGE [BROADCAST] |
> | | | |
> | | 05:AGGREGATE [FINALIZE] |
> | | | output: max:merge(s_store_sk) |
> | | | row-size=4B cardinality=1 |
> | | | |
> | | 04:EXCHANGE [UNPARTITIONED] |
> | | | |
> | | 02:AGGREGATE |
> | | | output: max(s_store_sk) |
> | | | row-size=4B cardinality=1 |
> | | | |
> | | 01:SCAN HDFS [tpcds_parquet.store] |
> | | HDFS partitions=1/1 files=1 size=9.93KB |
> | | row-size=4B cardinality=12 |
> | | |
> | 00:SCAN HDFS [tpcds_parquet.store_sales] |
> | HDFS partitions=1824/1824 files=1824 size=200.93MB |
> | runtime filters: RF000 -> ss_store_sk |
> | row-size=100B cardinality=2.88M |
> +-------------------------------------------------------------+
> {noformat}
> JOIN cardinality is 1/6 of LHS scan node cardinality which seems reasonable, since LHS NDV is 6, and the right side only has one row.
> Now let's switch to an INNER join:
> {noformat}
> [localhost:21050] tpcds_parquet> explain select * from store_sales inner join (select max(s_store_sk) as max_store_sk from store) v on ss_store_sk = max_store_sk;
> Query: explain select * from store_sales inner join (select max(s_store_sk) as max_store_sk from store) v on ss_store_sk = max_store_sk
> +-------------------------------------------------------------+
> | Explain String |
> +-------------------------------------------------------------+
> | Max Per-Host Resource Reservation: Memory=14.95MB Threads=6 |
> | Per-Host Resource Estimates: Memory=193MB |
> | |
> | PLAN-ROOT SINK |
> | | |
> | 07:EXCHANGE [UNPARTITIONED] |
> | | |
> | 03:HASH JOIN [INNER JOIN, BROADCAST] |
> | | hash predicates: ss_store_sk = max(s_store_sk) |
> | | runtime filters: RF000 <- max(s_store_sk) |
> | | row-size=104B cardinality=2.88M |
> | | |
> | |--06:EXCHANGE [BROADCAST] |
> | | | |
> | | 05:AGGREGATE [FINALIZE] |
> | | | output: max:merge(s_store_sk) |
> | | | row-size=4B cardinality=1 |
> | | | |
> | | 04:EXCHANGE [UNPARTITIONED] |
> | | | |
> | | 02:AGGREGATE |
> | | | output: max(s_store_sk) |
> | | | row-size=4B cardinality=1 |
> | | | |
> | | 01:SCAN HDFS [tpcds_parquet.store] |
> | | HDFS partitions=1/1 files=1 size=9.93KB |
> | | row-size=4B cardinality=12 |
> | | |
> | 00:SCAN HDFS [tpcds_parquet.store_sales] |
> | HDFS partitions=1824/1824 files=1824 size=200.93MB |
> | runtime filters: RF000 -> ss_store_sk |
> | row-size=100B cardinality=2.88M |
> +-------------------------------------------------------------+
> {noformat}
> The JOIN cardinality equals to the lhs cardinality even when the rhs cardinality is only one.
> SEMI JOIN cardinality is calculated differently than INNER join cardinality.
> SEMI JOIN cardinality:
> https://github.com/apache/impala/blob/c65d7861d9ae28f6fc592727ff699a8155dcda2c/fe/src/main/java/org/apache/impala/planner/JoinNode.java#L486-L562
> INNER JOIN cardinality:
> https://github.com/apache/impala/blob/c65d7861d9ae28f6fc592727ff699a8155dcda2c/fe/src/main/java/org/apache/impala/planner/JoinNode.java#L242-L308
> The problem is that the latter doesn't find the equi join conjunct "ss_store_sk = max(s_store_sk)" eligible, so it returns lhs cardinality:
> https://github.com/apache/impala/blob/c65d7861d9ae28f6fc592727ff699a8155dcda2c/fe/src/main/java/org/apache/impala/planner/JoinNode.java#L296-L300
> ss_store_sk = max(s_store_sk) is not eligible because Expr.findSrcScanSlot() returns NULL for "max(s_store_sk)."
> https://github.com/apache/impala/blob/c65d7861d9ae28f6fc592727ff699a8155dcda2c/fe/src/main/java/org/apache/impala/planner/JoinNode.java#L449
> I think the solution should be to either change Expr.findSrcScanSlot() to return the scan slot. Or, change getJoinCardinality() to return an estimation similar to the SEMI JOIN. Or fix both.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org