You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (JIRA)" <ji...@apache.org> on 2019/05/21 03:59:23 UTC
[jira] [Updated] (SPARK-22270) Renaming DF column breaks
sparkPlan.outputOrdering
[ https://issues.apache.org/jira/browse/SPARK-22270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hyukjin Kwon updated SPARK-22270:
---------------------------------
Labels: bulk-closed (was: )
> Renaming DF column breaks sparkPlan.outputOrdering
> --------------------------------------------------
>
> Key: SPARK-22270
> URL: https://issues.apache.org/jira/browse/SPARK-22270
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 2.1.0, 2.2.0
> Reporter: Yuri Bogomolov
> Priority: Major
> Labels: bulk-closed
>
> Renaming columns doesn't update ordering/distribution metadata. This may cause unnecessary data shuffles, and significantly affect performance.
> {code:java}
> val df = spark.sqlContext.range(0, 10)
> val sorted = df.sort("id")
> val renamed = sorted.withColumnRenamed("id", "id2")
> val sortedAgain = renamed.sort("id2")
> sortedAgain.explain(true)
> == Analyzed Logical Plan ==
> id2: bigint
> Sort [id2#6L ASC NULLS FIRST], true
> +- Project [id#0L AS id2#6L]
> +- Sort [id#0L ASC NULLS FIRST], true
> +- Range (0, 10, step=1, splits=Some(4))
> == Optimized Logical Plan ==
> Sort [id2#6L ASC NULLS FIRST], true
> +- Project [id#0L AS id2#6L]
> +- Sort [id#0L ASC NULLS FIRST], true
> +- Range (0, 10, step=1, splits=Some(4))
> == Physical Plan ==
> *Sort [id2#6L ASC NULLS FIRST], true, 0
> +- Exchange rangepartitioning(id2#6L ASC NULLS FIRST, 200)
> +- *Project [id#0L AS id2#6L]
> +- *Sort [id#0L ASC NULLS FIRST], true, 0
> +- Exchange rangepartitioning(id#0L ASC NULLS FIRST, 200)
> +- *Range (0, 10, step=1, splits=4)
> {code}
> You can see that the dataset is going to be sorted twice.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org