You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Wan Kun (Jira)" <ji...@apache.org> on 2022/12/07 02:18:00 UTC

[jira] [Created] (SPARK-41416) Rewrite self join in in predicate to aggregate

Wan Kun created SPARK-41416:
-------------------------------

             Summary: Rewrite self join in in predicate to aggregate
                 Key: SPARK-41416
                 URL: https://issues.apache.org/jira/browse/SPARK-41416
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 3.4.0
            Reporter: Wan Kun


Transforms the SelfJoin resulting in duplicate rows used for IN predicate to aggregation.
For IN predicate, duplicate rows does not have any value. It will be overhead.

Ex: TPCDS Q95: following CTE is used only in IN predicates for only one column comparison ({@code ws_order_number}).
This results in exponential increase in Joined rows with too many duplicate rows.


{code:java}
WITH ws_wh AS
(
       SELECT ws1.ws_order_number,
              ws1.ws_warehouse_sk wh1,
              ws2.ws_warehouse_sk wh2
       FROM   web_sales ws1,
              web_sales ws2
       WHERE  ws1.ws_order_number = ws2.ws_order_number
       AND    ws1.ws_warehouse_sk <> ws2.ws_warehouse_sk)
{code}



Could be optimized as below:


{code:java}
WITH ws_wh AS
    (SELECT ws_order_number
      FROM  web_sales
      GROUP BY ws_order_number
      HAVING COUNT(DISTINCT ws_warehouse_sk) > 1)
{code}


Optimized CTE scans table only once and results in unique rows.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org