You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Josh Rosen (JIRA)" <ji...@apache.org> on 2017/01/06 01:17:58 UTC

[jira] [Commented] (SPARK-19091) createDataset(sc.parallelize(x: Seq)) should be equivalent to createDataset(x: Seq)

    [ https://issues.apache.org/jira/browse/SPARK-19091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15803105#comment-15803105 ] 

Josh Rosen commented on SPARK-19091:
------------------------------------

This is a pretty easy change but it does impact things slightly in the case where a user relies on the degree of parallelism in sc.parallelize(). Thus maybe this isn't as obvious of an optimization. I'll just leave this JIRA here as documentation of the odd performance variation so users can judge the appropriate method themselves based on their use-case.

> createDataset(sc.parallelize(x: Seq)) should be equivalent to createDataset(x: Seq)
> -----------------------------------------------------------------------------------
>
>                 Key: SPARK-19091
>                 URL: https://issues.apache.org/jira/browse/SPARK-19091
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>            Reporter: Josh Rosen
>
> It turns out that spark.createDataset(sc.parallelize(x: Seq)) and spark.createaDataSet(x: Seq) produce different plans, where the former is much less efficient due to a lack of accurate size estimation. We should modify SparkSession to special-case the situation where createDataset is called on a ParallelCollectionRDD in order to remove this source of performance variation between the two plans.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org