You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Xiao Li (JIRA)" <ji...@apache.org> on 2018/09/08 18:39:00 UTC
[jira] [Assigned] (SPARK-20636) Eliminate unnecessary shuffle with
adjacent Window expressions
[ https://issues.apache.org/jira/browse/SPARK-20636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Xiao Li reassigned SPARK-20636:
-------------------------------
Assignee: Michael Styles (was: Apache Spark)
> Eliminate unnecessary shuffle with adjacent Window expressions
> --------------------------------------------------------------
>
> Key: SPARK-20636
> URL: https://issues.apache.org/jira/browse/SPARK-20636
> Project: Spark
> Issue Type: Improvement
> Components: Optimizer
> Affects Versions: 2.1.1
> Reporter: Michael Styles
> Assignee: Michael Styles
> Priority: Major
> Fix For: 3.0.0
>
>
> Consider the following example:
> {noformat}
> w1 = Window.partitionBy("sno")
> w2 = Window.partitionBy("sno", "pno")
> supply \
> .select('sno', 'pno', 'qty', F.sum('qty').over(w2).alias('sum_qty_2')) \
> .select('sno', 'pno', 'qty', F.col('sum_qty_2'), F.sum('qty').over(w1).alias('sum_qty_1')) \
> .explain()
> == Optimized Logical Plan ==
> Window [sum(qty#982L) windowspecdefinition(sno#980, ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS sum_qty_1#1112L], [sno#980]
> +- Window [sum(qty#982L) windowspecdefinition(sno#980, pno#981, ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS sum_qty_2#1105L], [sno#980, pno#981]
> +- Relation[sno#980,pno#981,qty#982L] parquet
> == Physical Plan ==
> Window [sum(qty#982L) windowspecdefinition(sno#980, ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS sum_qty_1#1112L], [sno#980]
> +- *Sort [sno#980 ASC NULLS FIRST], false, 0
> +- Exchange hashpartitioning(sno#980, 200)
> +- Window [sum(qty#982L) windowspecdefinition(sno#980, pno#981, ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS sum_qty_2#1105L], [sno#980, pno#981]
> +- *Sort [sno#980 ASC NULLS FIRST, pno#981 ASC NULLS FIRST], false, 0
> +- Exchange hashpartitioning(sno#980, pno#981, 200)
> +- *FileScan parquet [sno#980,pno#981,qty#982L] ...
> {noformat}
> A more efficient query plan can be achieved by flipping the Window expressions to eliminate an unnecessary shuffle as follows:
> {noformat}
> == Optimized Logical Plan ==
> Window [sum(qty#982L) windowspecdefinition(sno#980, pno#981, ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS sum_qty_2#1087L], [sno#980, pno#981]
> +- Window [sum(qty#982L) windowspecdefinition(sno#980, ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS sum_qty_1#1085L], [sno#980]
> +- Relation[sno#980,pno#981,qty#982L] parquet
> == Physical Plan ==
> Window [sum(qty#982L) windowspecdefinition(sno#980, pno#981, ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS sum_qty_2#1087L], [sno#980, pno#981]
> +- *Sort [sno#980 ASC NULLS FIRST, pno#981 ASC NULLS FIRST], false, 0
> +- Window [sum(qty#982L) windowspecdefinition(sno#980, ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS sum_qty_1#1085L], [sno#980]
> +- *Sort [sno#980 ASC NULLS FIRST], false, 0
> +- Exchange hashpartitioning(sno#980, 200)
> +- *FileScan parquet [sno#980,pno#981,qty#982L] ...
> {noformat}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org