You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Meisam Fathi <me...@gmail.com> on 2013/11/11 22:35:55 UTC

How to aggregte data by key

Hi,

I'm trying to use Spark to aggregate data.

I am doing something similar to this right now.

    val groupByRdd = rdd.groupBy(x => (x._1,))
    val aggregateRdd = groupByRdd map(x => (x._2.sum)

This works fine for smaller datasets but runs OOM for larger datasets
(the groupBy operation runs out memory).

I know I could use RDD.aggregate() if I wanted to aggregate all the
data for all keys. Is there anyway to use something similar to
RDD.aggregate()? I'm looking for an operation like RDD.reduceByKey()
or RDD.aggregateByKey() in Spark. Is there one already implemented or
should I write my own?

Thanks,
Meisam

Re: How to aggregte data by key

Posted by Meisam Fathi <me...@gmail.com>.

Thanks Josh.

On Mon, Nov 11, 2013 at 4:50 PM, Josh Rosen <ro...@gmail.com> wrote:
> For RDDs of pairs, Spark has operations like reduceByKey() and
> combineByKey(). The Quick Start guide features a word count example that
> uses reduceByKey():
> https://spark.incubator.apache.org/docs/latest/quick-start.html#more-on-rdd-operations
>
> These operations are defined in a class called PairRDDFunctions
> (https://spark.incubator.apache.org/docs/latest/api/core/index.html#org.apache.spark.rdd.PairRDDFunctions).
> SparkContext provides implicit conversions to let you write
> rdd.reduceByKey(...) without having to manually wrap your RDD into a
> PairRDDFunctions; just add import org.apache.spark.SparkContext._ to your
> imports.
>
>
>
> On Mon, Nov 11, 2013 at 1:35 PM, Meisam Fathi <me...@gmail.com>
> wrote:
>>
>> Hi,
>>
>> I'm trying to use Spark to aggregate data.
>>
>> I am doing something similar to this right now.
>>
>>     val groupByRdd = rdd.groupBy(x => (x._1,))
>>     val aggregateRdd = groupByRdd map(x => (x._2.sum)
>>
>> This works fine for smaller datasets but runs OOM for larger datasets
>> (the groupBy operation runs out memory).
>>
>> I know I could use RDD.aggregate() if I wanted to aggregate all the
>> data for all keys. Is there anyway to use something similar to
>> RDD.aggregate()? I'm looking for an operation like RDD.reduceByKey()
>> or RDD.aggregateByKey() in Spark. Is there one already implemented or
>> should I write my own?
>>
>> Thanks,
>> Meisam
>
>

Re: How to aggregte data by key

Posted by Josh Rosen <ro...@gmail.com>.

For RDDs of pairs, Spark has operations like reduceByKey() and
combineByKey(). The Quick Start guide features a word count example that
uses reduceByKey():
https://spark.incubator.apache.org/docs/latest/quick-start.html#more-on-rdd-operations

These operations are defined in a class called PairRDDFunctions (
https://spark.incubator.apache.org/docs/latest/api/core/index.html#org.apache.spark.rdd.PairRDDFunctions).
 SparkContext provides implicit conversions to let you write
rdd.reduceByKey(...) without having to manually wrap your RDD into a
PairRDDFunctions; just add import org.apache.spark.SparkContext._ to your
imports.

On Mon, Nov 11, 2013 at 1:35 PM, Meisam Fathi <me...@gmail.com>wrote:

> Hi,
>
> I'm trying to use Spark to aggregate data.
>
> I am doing something similar to this right now.
>
>     val groupByRdd = rdd.groupBy(x => (x._1,))
>     val aggregateRdd = groupByRdd map(x => (x._2.sum)
>
> This works fine for smaller datasets but runs OOM for larger datasets
> (the groupBy operation runs out memory).
>
> I know I could use RDD.aggregate() if I wanted to aggregate all the
> data for all keys. Is there anyway to use something similar to
> RDD.aggregate()? I'm looking for an operation like RDD.reduceByKey()
> or RDD.aggregateByKey() in Spark. Is there one already implemented or
> should I write my own?
>
> Thanks,
> Meisam
>