You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Hyukjin Kwon (JIRA)" <ji...@apache.org> on 2019/05/21 04:13:25 UTC

[jira] [Resolved] (SPARK-22270) Renaming DF column breaks sparkPlan.outputOrdering

     [ https://issues.apache.org/jira/browse/SPARK-22270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hyukjin Kwon resolved SPARK-22270.
----------------------------------
    Resolution: Incomplete

> Renaming DF column breaks sparkPlan.outputOrdering
> --------------------------------------------------
>
>                 Key: SPARK-22270
>                 URL: https://issues.apache.org/jira/browse/SPARK-22270
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.1.0, 2.2.0
>            Reporter: Yuri Bogomolov
>            Priority: Major
>              Labels: bulk-closed
>
> Renaming columns doesn't update ordering/distribution metadata. This may cause unnecessary data shuffles, and significantly affect performance.
> {code:java}
> val df = spark.sqlContext.range(0, 10)
> val sorted = df.sort("id")
> val renamed = sorted.withColumnRenamed("id", "id2")
> val sortedAgain = renamed.sort("id2")
> sortedAgain.explain(true)
> == Analyzed Logical Plan ==
> id2: bigint
> Sort [id2#6L ASC NULLS FIRST], true
> +- Project [id#0L AS id2#6L]
>    +- Sort [id#0L ASC NULLS FIRST], true
>       +- Range (0, 10, step=1, splits=Some(4))
> == Optimized Logical Plan ==
> Sort [id2#6L ASC NULLS FIRST], true
> +- Project [id#0L AS id2#6L]
>    +- Sort [id#0L ASC NULLS FIRST], true
>       +- Range (0, 10, step=1, splits=Some(4))
> == Physical Plan ==
> *Sort [id2#6L ASC NULLS FIRST], true, 0
> +- Exchange rangepartitioning(id2#6L ASC NULLS FIRST, 200)
>    +- *Project [id#0L AS id2#6L]
>       +- *Sort [id#0L ASC NULLS FIRST], true, 0
>          +- Exchange rangepartitioning(id#0L ASC NULLS FIRST, 200)
>             +- *Range (0, 10, step=1, splits=4)
> {code}
> You can see that the dataset is going to be sorted twice.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org