You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Yin Huai (JIRA)" <ji...@apache.org> on 2016/01/06 17:27:39 UTC

[jira] [Commented] (SPARK-12662) Add a local sort operator to DataFrame used by randomSplit

    [ https://issues.apache.org/jira/browse/SPARK-12662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15085764#comment-15085764 ] 

Yin Huai commented on SPARK-12662:
----------------------------------

btw, with local sort operator, we can make row ordering in a partition deterministic. However, if the partition that a row belongs to is not deterministic (e.g. the DF shuffles data randomly), the returned DFs of randomSplit still have this issue. Let's also add doc to the randomSplit to let users know that random shuffle should not be used for the input DF of randomSplit.

> Add a local sort operator to DataFrame used by randomSplit
> ----------------------------------------------------------
>
>                 Key: SPARK-12662
>                 URL: https://issues.apache.org/jira/browse/SPARK-12662
>             Project: Spark
>          Issue Type: Bug
>          Components: Documentation, SQL
>            Reporter: Yin Huai
>            Assignee: Sameer Agarwal
>
> With {{./bin/spark-shell --master=local-cluster[2,1,2014]}}, the following code will provide overlapped rows for two DFs returned by the randomSplit. 
> {code}
> sqlContext.sql("drop table if exists test")
> val x = sc.parallelize(1 to 210)
> case class R(ID : Int)
> sqlContext.createDataFrame(x.map {R(_)}).write.format("json").saveAsTable("bugsc1597")
> var df = sql("select distinct ID from test")
> var Array(a, b) = df.randomSplit(Array(0.333, 0.667), 1234L)
> a.registerTempTable("a")
> b.registerTempTable("b")
> val intersectDF = a.intersect(b)
> intersectDF.show
> {code}
> The reason is that {{sql("select distinct ID from test")} does not guarantee the ordering rows in a partition. It will be good to add a local sort operator to make row ordering within a partition deterministic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org