You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Gaurav Kumar (JIRA)" <ji...@apache.org> on 2015/12/31 07:09:49 UTC

[jira] [Created] (SPARK-12590) Inconsistent behavior of randomSplit in YARN mode

Gaurav Kumar created SPARK-12590:
------------------------------------

             Summary: Inconsistent behavior of randomSplit in YARN mode
                 Key: SPARK-12590
                 URL: https://issues.apache.org/jira/browse/SPARK-12590
             Project: Spark
          Issue Type: Bug
          Components: MLlib, Spark Core
    Affects Versions: 1.5.2
         Environment: YARN mode
            Reporter: Gaurav Kumar


I noticed an inconsistent behavior when using rdd.randomSplit when the source rdd is repartitioned, but only in YARN mode. It works fine in local mode though.

*Code:*
val rdd = sc.parallelize(1 to 1000000)
val rdd2 = rdd.repartition(64)
rdd.partitions.size
rdd2.partitions.size
val Array(train, test) = rdd2.randomSplit(Array(70, 30), 1)
train.takeOrdered(10)
test.takeOrdered(10)

*Master: local*
Both the take statements produce consistent results and have no overlap in numbers being outputted.

*Master: YARN*
However, when these are run on YARN mode, these produce random results every time and also the train and test have overlap in the numbers being outputted.
If I use rdd.randomSplit, then it works fine even on YARN.

So, it concludes that the repartition is being evaluated every time the splitting occurs.

Interestingly, if I cache the rdd2 before splitting it, then we can expect consistent behavior since repartition is not evaluated again and again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org