You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by redocpot <ju...@gmail.com> on 2014/01/07 12:08:50 UTC

split a RDD by pencetage

Hi,

I want to split a RDD by certain percentage, like 10 % (split the RDD into
10 piece)

Ideally, the function preferred is as below: 

def deterministicSplit[T](dataSet: RDD[T], nb: Int): Array[RDD[T]] = {
    /* code */
}

the "dataSet" is a RDD sorted by its key. For example, if "nb" = 10 here,
this function returns an array containing the first 10% data, the second 10%
data, ... the tenth 10% data.

In fact, I can do this in an ugly way. But I prefer to do it properly. Any
hints ?  Share your good ideas, please.

I have looked up the RDD APIs, but could not find what I want.

Thank you.

Hao



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/split-a-RDD-by-pencetage-tp333.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: split a RDD by pencetage

Posted by "pankaj.arora" <pa...@gmail.com>.
You can use MapPartitions to achieve this.

/split each partition into 10 equal parts with each part having number as
its id
val splittedRDD = self.mapPartitions((itr)=> {
Iterate over this iterator and breaks this iterator into 10 parts.
val iterators = Array[ArrayBuffer[T]](10)
var i =0
for(tuple <- itr) {
  iterators(i%10) = tuple
i+=1
}
i = 0
iterators.map((i,_))
})

//filter rdd for each part broken above and flat map to get array of RDDs
var rddArray = (0 to 10).toArray.map(i => splittedRDD.filter(_._1 ==
i).flatMap(x=>x)

The code is not written in IDE it will work with little modifications



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/split-a-RDD-by-pencetage-tp333p14106.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org