You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Anand Nalya <an...@gmail.com> on 2015/07/06 09:32:57 UTC

Split RDD into two in a single pass

Hi,

I've a RDD which I want to split into two disjoint RDDs on with a boolean
function. I can do this with the following

val rdd1 = rdd.filter(f)
val rdd2 = rdd.filter(fnot)

I'm assuming that each of the above statement will traverse the RDD once
thus resulting in 2 passes.

Is there a way of doing this in a single pass over the RDD so that when f
returns true, the element goes to rdd1 and to rdd2 otherwise.

Regards,
Anand

Re: Split RDD into two in a single pass

Posted by Daniel Darabos <da...@lynxanalytics.com>.

This comes up so often. I wonder if the documentation or the API could be
changed to answer this question.

The solution I found is from
http://stackoverflow.com/questions/23995040/write-to-multiple-outputs-by-key-spark-one-spark-job.
You basically write the items into two directories in a single pass through
the RDD. Then you read back the two directories as two RDDs.

It avoids traversing the RDD twice, but writing and reading to the file
system is also costly. It may not worth it always.

On Mon, Jul 6, 2015 at 9:32 AM, Anand Nalya <an...@gmail.com> wrote:

> Hi,
>
> I've a RDD which I want to split into two disjoint RDDs on with a boolean
> function. I can do this with the following
>
> val rdd1 = rdd.filter(f)
> val rdd2 = rdd.filter(fnot)
>
> I'm assuming that each of the above statement will traverse the RDD once
> thus resulting in 2 passes.
>
> Is there a way of doing this in a single pass over the RDD so that when f
> returns true, the element goes to rdd1 and to rdd2 otherwise.
>
> Regards,
> Anand
>