You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Ian Ferreira <ia...@hotmail.com> on 2014/04/19 06:59:54 UTC

Combining RDD's columns

This may seem contrived but, suppose I wanted to create a collection of  
"single column" RDD's that contain calculated values, so I want to cache 
these to avoid re-calc.

i.e.

rdd1 = {Names]
rdd2 = {Star Sign}
rdd3 = {Age}

Then I want to create a new virtual RDD that  is a collection of these 
RDD's to create a "multi-column" RDD

rddA = {Names, Age}
rddB = {Names, Star Sign}

I saw that rdd.union() merges rows, but anything that can combine columns?

Cheers
- Ian

Re: Combining RDD's columns

Posted by Jeremy Freeman <fr...@gmail.com>.
Hi Ian,

If I understand what you're after, you might find "zip" useful. From the docs:

Zips this RDD with another one, returning key-value pairs with the first element in each RDD, second element in each RDD, etc. Assumes that the two RDDs have the *same number of partitions* and the *same number of elements in each partition* (e.g. one was made through a map on the other).

Here's a toy example:

>> val rdd1 = sc.parallelize(Array("name1", "name2", "name3"), 3)
>> val rdd2 = sc.parallelize(Array("sign1", "sign2", "sign3"), 3)
>> rdd1.collect()
Array[String] = Array(name1, name2, name3)
>> rdd2.collect()
Array[String] = Array(sign1, sign2, sign3)
>> rdd1.zip(rdd2).collect()
Array[(String, String)] = Array((name1,sign1), (name2,sign2), (name3,sign3))

In your case, you might have the first two RDDs calculated from some common raw data through a map.

-- Jeremy

---------------------
Jeremy Freeman, PhD
Neuroscientist
@thefreemanlab

On Apr 19, 2014, at 12:59 AM, Ian Ferreira <ia...@hotmail.com> wrote:

> 
> This may seem contrived but, suppose I wanted to create a collection of  "single column" RDD's that contain calculated values, so I want to cache these to avoid re-calc.
> 
> i.e.
> 
> rdd1 = {Names]
> rdd2 = {Star Sign}
> rdd3 = {Age}
> 
> Then I want to create a new virtual RDD that  is a collection of these RDD's to create a "multi-column" RDD
> 
> rddA = {Names, Age}
> rddB = {Names, Star Sign}
> 
> I saw that rdd.union() merges rows, but anything that can combine columns?
> 
> Cheers
> - Ian