You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Aureliano Buendia <bu...@gmail.com> on 2014/01/21 06:42:03 UTC

How to perform multi dimensional reduction in spark?

Hi,

It seems spark does not support nested RDD's, so I was wondering how can
spark handle multi dimensional reductions.

As an example consider a dataset with these rows:

((i, j), value)

where i, j and k are long indexes, and value is a double.

How is it possible to first reduce the above rdd over j, and then reduce
the results over i?

Just to clarify, a scala equivalent would look like this:

var results = 0
for (i <- 0 until I) {
  var jReduction = 0
  for (j <- 0 until J) {
    *// Reduce over j*
    jReduction = jReduction + rdd(i,j)
  }
  *// Reduce over i*
  results = results * jReductions(i)
}

Re: How to perform multi dimensional reduction in spark?

Posted by "Evan R. Sparks" <ev...@gmail.com>.
You can break this down into a two-stage pipeline like the following:

rdd.map(r => (r._1._1, r._2)).reduceByKey(_ + _).map(_._2).reduce(_ * _)
The first map puts each (i,j) pair into an i group, and then the
reduceByKey will sum up all the values in each i group.

The second map extracts the value of each i group and the reduce will sum
up all of those values. There are optimizations that can be made here if
your data gets really big or really dense, but that's the basic idea.

- Evan


On Tue, Jan 21, 2014 at 11:29 AM, Aureliano Buendia <bu...@gmail.com>wrote:

> Surprisingly, this turned out to be more complicated than what I expected.
>
> I had the impression that this would be trivial in spark. Am I missing
> something here?
>
>
> On Tue, Jan 21, 2014 at 5:42 AM, Aureliano Buendia <bu...@gmail.com>wrote:
>
>> Hi,
>>
>> It seems spark does not support nested RDD's, so I was wondering how can
>> spark handle multi dimensional reductions.
>>
>> As an example consider a dataset with these rows:
>>
>> ((i, j), value)
>>
>> where i, j and k are long indexes, and value is a double.
>>
>> How is it possible to first reduce the above rdd over j, and then reduce
>> the results over i?
>>
>> Just to clarify, a scala equivalent would look like this:
>>
>> var results = 0
>> for (i <- 0 until I) {
>>   var jReduction = 0
>>   for (j <- 0 until J) {
>>     *// Reduce over j*
>>     jReduction = jReduction + rdd(i,j)
>>   }
>>   *// Reduce over i*
>>   results = results * jReductions(i)
>> }
>>
>>
>

Re: How to perform multi dimensional reduction in spark?

Posted by Aureliano Buendia <bu...@gmail.com>.
Surprisingly, this turned out to be more complicated than what I expected.

I had the impression that this would be trivial in spark. Am I missing
something here?


On Tue, Jan 21, 2014 at 5:42 AM, Aureliano Buendia <bu...@gmail.com>wrote:

> Hi,
>
> It seems spark does not support nested RDD's, so I was wondering how can
> spark handle multi dimensional reductions.
>
> As an example consider a dataset with these rows:
>
> ((i, j), value)
>
> where i, j and k are long indexes, and value is a double.
>
> How is it possible to first reduce the above rdd over j, and then reduce
> the results over i?
>
> Just to clarify, a scala equivalent would look like this:
>
> var results = 0
> for (i <- 0 until I) {
>   var jReduction = 0
>   for (j <- 0 until J) {
>     *// Reduce over j*
>     jReduction = jReduction + rdd(i,j)
>   }
>   *// Reduce over i*
>   results = results * jReductions(i)
> }
>
>