You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by tuan3w <ne...@gmail.com> on 2016/02/21 04:01:10 UTC

Element appear in both 2 splits of RDD after using randomSplit

I'm training a model using MLLib. When I try to split data into training and
test data, I found a weird problem. I can't figure what problem is happening
here.

Here is my code in experiment: 

val  logData = rdd.map(x => (x._1, x._2)).distinct()
val ratings: RDD[Rating] = logData.map(x => Rating(x._1, x._2, 1))
val userProducts = ratings.map(x => (x.user, x.product))
val splits = userProducts.randomSplit(Array(0.7, 0.3))
val train = splits(0)
train.count() // 1660895
val test = splits(1)
test.count() // 712306
// test if an element appear in both splits
train.map(x => (x._1 + "_" + x._2, 1)).join(test.map(x => (x._1 + "_" +
x._2, 2))).take(5)
//return res153: Array[(String, (Int, Int))] = Array((1172491_2899,(1,2)),
(1206777_1567,(1,2)), (91828_571,(1,2)), (329210_2435,(1,2)),
(24356_135,(1,2)))

If I try to save to hdfs and load RDD from HDFS this problem doesn't happen.

userProducts.map(x => x._1 + ":" +
x._2).saveAsTextFile("/user/tuannd/test2.txt")
val userProducts = sc.textFile("/user/tuannd/test2.txt").map(x => {
val d =x.split(":")
(d(0).toInt(), d(1).toInt())
})
// other steps are as same as above

I'm using spark 1.5.2.
Thanks for all your help.




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Element-appear-in-both-2-splits-of-RDD-after-using-randomSplit-tp26281.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Re: Element appear in both 2 splits of RDD after using randomSplit

Posted by nguyen duc tuan <ne...@gmail.com>.
That's very useful information.
The reason for weird problem is because of the non-determination of RDD
before applying randomSplit.
By caching RDD, we can make RDD become deterministic and so problem is
solved.
Thank you for your help.

2016-02-21 11:12 GMT+07:00 Ted Yu <yu...@gmail.com>:

> Have you looked at:
> SPARK-12662 Fix DataFrame.randomSplit to avoid creating overlapping splits
>
> Cheers
>
> On Sat, Feb 20, 2016 at 7:01 PM, tuan3w <ne...@gmail.com> wrote:
>
>> I'm training a model using MLLib. When I try to split data into training
>> and
>> test data, I found a weird problem. I can't figure what problem is
>> happening
>> here.
>>
>> Here is my code in experiment:
>>
>> val  logData = rdd.map(x => (x._1, x._2)).distinct()
>> val ratings: RDD[Rating] = logData.map(x => Rating(x._1, x._2, 1))
>> val userProducts = ratings.map(x => (x.user, x.product))
>> val splits = userProducts.randomSplit(Array(0.7, 0.3))
>> val train = splits(0)
>> train.count() // 1660895
>> val test = splits(1)
>> test.count() // 712306
>> // test if an element appear in both splits
>> train.map(x => (x._1 + "_" + x._2, 1)).join(test.map(x => (x._1 + "_" +
>> x._2, 2))).take(5)
>> //return res153: Array[(String, (Int, Int))] = Array((1172491_2899,(1,2)),
>> (1206777_1567,(1,2)), (91828_571,(1,2)), (329210_2435,(1,2)),
>> (24356_135,(1,2)))
>>
>> If I try to save to hdfs and load RDD from HDFS this problem doesn't
>> happen.
>>
>> userProducts.map(x => x._1 + ":" +
>> x._2).saveAsTextFile("/user/tuannd/test2.txt")
>> val userProducts = sc.textFile("/user/tuannd/test2.txt").map(x => {
>> val d =x.split(":")
>> (d(0).toInt(), d(1).toInt())
>> })
>> // other steps are as same as above
>>
>> I'm using spark 1.5.2.
>> Thanks for all your help.
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Element-appear-in-both-2-splits-of-RDD-after-using-randomSplit-tp26281.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>>
>

Re: Element appear in both 2 splits of RDD after using randomSplit

Posted by Ted Yu <yu...@gmail.com>.
Have you looked at:
SPARK-12662 Fix DataFrame.randomSplit to avoid creating overlapping splits

Cheers

On Sat, Feb 20, 2016 at 7:01 PM, tuan3w <ne...@gmail.com> wrote:

> I'm training a model using MLLib. When I try to split data into training
> and
> test data, I found a weird problem. I can't figure what problem is
> happening
> here.
>
> Here is my code in experiment:
>
> val  logData = rdd.map(x => (x._1, x._2)).distinct()
> val ratings: RDD[Rating] = logData.map(x => Rating(x._1, x._2, 1))
> val userProducts = ratings.map(x => (x.user, x.product))
> val splits = userProducts.randomSplit(Array(0.7, 0.3))
> val train = splits(0)
> train.count() // 1660895
> val test = splits(1)
> test.count() // 712306
> // test if an element appear in both splits
> train.map(x => (x._1 + "_" + x._2, 1)).join(test.map(x => (x._1 + "_" +
> x._2, 2))).take(5)
> //return res153: Array[(String, (Int, Int))] = Array((1172491_2899,(1,2)),
> (1206777_1567,(1,2)), (91828_571,(1,2)), (329210_2435,(1,2)),
> (24356_135,(1,2)))
>
> If I try to save to hdfs and load RDD from HDFS this problem doesn't
> happen.
>
> userProducts.map(x => x._1 + ":" +
> x._2).saveAsTextFile("/user/tuannd/test2.txt")
> val userProducts = sc.textFile("/user/tuannd/test2.txt").map(x => {
> val d =x.split(":")
> (d(0).toInt(), d(1).toInt())
> })
> // other steps are as same as above
>
> I'm using spark 1.5.2.
> Thanks for all your help.
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Element-appear-in-both-2-splits-of-RDD-after-using-randomSplit-tp26281.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>