You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Johan Stenberg <jo...@gmail.com> on 2014/09/26 17:55:24 UTC

How to do operations on multiple RDD's

Hi,

This is my first post to the email list so give me some feedback if I do
something wrong.

To do operations on two RDD's to produce a new one you can just use
zipPartitions, but if I have an arbitrary number of RDD's that I would like
to perform an operation on to produce a single RDD, how do I do that? I've
been reading the docs but haven't found anything.

For example: if I have a Seq of RDD[Array[Int]]'s and I want to take the
majority of each array cell. So if all RDD's have one array which are like
this:

[1, 2, 3]
[0, 0, 0]
[1, 2, 0]

Then the resulting RDD would have the array [1, 2, 0]. How do I approach
this problem? It becomes too heavy to have an accumulator variable I guess?
Otherwise it could be an array of maps with values as keys and frequency as
values.

Essentially I want something like zipPartitions but for arbitrarily many
RDD's, is there any such functionality or how would I approach this problem?

Cheers,

Johan

Re: How to do operations on multiple RDD's

Posted by Daniel Siegmann <da...@velos.io>.

There are numerous ways to combine RDDs. In your case, it seems you have
several RDDs of the same type and you want to do an operation across all of
them as if they were a single RDD. The way to do this is SparkContext.union
or RDD.union, which have minimal overhead. The only difference between
these is the latter allows you to only union two at a time (but of course
you can just call reduce on your sequence to union them all).

Keep in mind this won't repartition anything, so if you find you have too
many partitions after the union you could use RDD.coalesce to merge them.

On Fri, Sep 26, 2014 at 11:55 AM, Johan Stenberg <jo...@gmail.com>
wrote:

> Hi,
>
> This is my first post to the email list so give me some feedback if I do
> something wrong.
>
> To do operations on two RDD's to produce a new one you can just use
> zipPartitions, but if I have an arbitrary number of RDD's that I would like
> to perform an operation on to produce a single RDD, how do I do that? I've
> been reading the docs but haven't found anything.
>
> For example: if I have a Seq of RDD[Array[Int]]'s and I want to take the
> majority of each array cell. So if all RDD's have one array which are like
> this:
>
> [1, 2, 3]
> [0, 0, 0]
> [1, 2, 0]
>
> Then the resulting RDD would have the array [1, 2, 0]. How do I approach
> this problem? It becomes too heavy to have an accumulator variable I guess?
> Otherwise it could be an array of maps with values as keys and frequency as
> values.
>
> Essentially I want something like zipPartitions but for arbitrarily many
> RDD's, is there any such functionality or how would I approach this problem?
>
> Cheers,
>
> Johan
>

-- 
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
E: daniel.siegmann@velos.io W: www.velos.io