You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Takeshi Yamamuro (JIRA)" <ji...@apache.org> on 2017/10/13 05:41:00 UTC

[jira] [Commented] (SPARK-22270) Renaming DF column breaks sparkPlan.outputOrdering

    [ https://issues.apache.org/jira/browse/SPARK-22270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16203081#comment-16203081 ] 

Takeshi Yamamuro commented on SPARK-22270:
------------------------------------------

Probably, this is duplicate to https://issues.apache.org/jira/browse/SPARK-19981. Currently, aliases drop output partitioning and ordering.

> Renaming DF column breaks sparkPlan.outputOrdering
> --------------------------------------------------
>
>                 Key: SPARK-22270
>                 URL: https://issues.apache.org/jira/browse/SPARK-22270
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.1.0, 2.2.0
>            Reporter: Yuri Bogomolov
>
> Renaming columns doesn't update ordering/distribution metadata. This may cause unnecessary data shuffles, and significantly affect performance.
> {code:java}
> val df = spark.sqlContext.range(0, 10)
> val sorted = df.sort("id")
> val renamed = sorted.withColumnRenamed("id", "id2")
> val sortedAgain = renamed.sort("id2")
> sortedAgain.explain(true)
> == Analyzed Logical Plan ==
> id2: bigint
> Sort [id2#6L ASC NULLS FIRST], true
> +- Project [id#0L AS id2#6L]
>    +- Sort [id#0L ASC NULLS FIRST], true
>       +- Range (0, 10, step=1, splits=Some(4))
> == Optimized Logical Plan ==
> Sort [id2#6L ASC NULLS FIRST], true
> +- Project [id#0L AS id2#6L]
>    +- Sort [id#0L ASC NULLS FIRST], true
>       +- Range (0, 10, step=1, splits=Some(4))
> == Physical Plan ==
> *Sort [id2#6L ASC NULLS FIRST], true, 0
> +- Exchange rangepartitioning(id2#6L ASC NULLS FIRST, 200)
>    +- *Project [id#0L AS id2#6L]
>       +- *Sort [id#0L ASC NULLS FIRST], true, 0
>          +- Exchange rangepartitioning(id#0L ASC NULLS FIRST, 200)
>             +- *Range (0, 10, step=1, splits=4)
> {code}
> You can see that the dataset is going to be sorted twice.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org