You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Apache Spark (Jira)" <ji...@apache.org> on 2022/12/07 02:30:00 UTC
[jira] [Commented] (SPARK-41416) Rewrite self join in in predicate to aggregate
[ https://issues.apache.org/jira/browse/SPARK-41416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17644091#comment-17644091 ]
Apache Spark commented on SPARK-41416:
--------------------------------------
User 'wankunde' has created a pull request for this issue:
https://github.com/apache/spark/pull/38951
> Rewrite self join in in predicate to aggregate
> ----------------------------------------------
>
> Key: SPARK-41416
> URL: https://issues.apache.org/jira/browse/SPARK-41416
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 3.4.0
> Reporter: Wan Kun
> Priority: Major
>
> Transforms the SelfJoin resulting in duplicate rows used for IN predicate to aggregation.
> For IN predicate, duplicate rows does not have any value. It will be overhead.
> Ex: TPCDS Q95: following CTE is used only in IN predicates for only one column comparison ({@code ws_order_number}).
> This results in exponential increase in Joined rows with too many duplicate rows.
> {code:java}
> WITH ws_wh AS
> (
> SELECT ws1.ws_order_number,
> ws1.ws_warehouse_sk wh1,
> ws2.ws_warehouse_sk wh2
> FROM web_sales ws1,
> web_sales ws2
> WHERE ws1.ws_order_number = ws2.ws_order_number
> AND ws1.ws_warehouse_sk <> ws2.ws_warehouse_sk)
> {code}
> Could be optimized as below:
> {code:java}
> WITH ws_wh AS
> (SELECT ws_order_number
> FROM web_sales
> GROUP BY ws_order_number
> HAVING COUNT(DISTINCT ws_warehouse_sk) > 1)
> {code}
> Optimized CTE scans table only once and results in unique rows.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org