You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Apache Spark (Jira)" <ji...@apache.org> on 2022/12/07 02:30:00 UTC

[jira] [Commented] (SPARK-41416) Rewrite self join in in predicate to aggregate

    [ https://issues.apache.org/jira/browse/SPARK-41416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17644091#comment-17644091 ] 

Apache Spark commented on SPARK-41416:
--------------------------------------

User 'wankunde' has created a pull request for this issue:
https://github.com/apache/spark/pull/38951

> Rewrite self join in in predicate to aggregate
> ----------------------------------------------
>
>                 Key: SPARK-41416
>                 URL: https://issues.apache.org/jira/browse/SPARK-41416
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.4.0
>            Reporter: Wan Kun
>            Priority: Major
>
> Transforms the SelfJoin resulting in duplicate rows used for IN predicate to aggregation.
> For IN predicate, duplicate rows does not have any value. It will be overhead.
> Ex: TPCDS Q95: following CTE is used only in IN predicates for only one column comparison ({@code ws_order_number}).
> This results in exponential increase in Joined rows with too many duplicate rows.
> {code:java}
> WITH ws_wh AS
> (
>        SELECT ws1.ws_order_number,
>               ws1.ws_warehouse_sk wh1,
>               ws2.ws_warehouse_sk wh2
>        FROM   web_sales ws1,
>               web_sales ws2
>        WHERE  ws1.ws_order_number = ws2.ws_order_number
>        AND    ws1.ws_warehouse_sk <> ws2.ws_warehouse_sk)
> {code}
> Could be optimized as below:
> {code:java}
> WITH ws_wh AS
>     (SELECT ws_order_number
>       FROM  web_sales
>       GROUP BY ws_order_number
>       HAVING COUNT(DISTINCT ws_warehouse_sk) > 1)
> {code}
> Optimized CTE scans table only once and results in unique rows.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org