You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by David Rowe <da...@gmail.com> on 2014/09/29 08:59:35 UTC

aggregateByKey vs combineByKey

Hi All,

After some hair pulling, I've reached the realisation that an operation I
am currently doing via:

myRDD.groupByKey.mapValues(func)

should be done more efficiently using aggregateByKey or combineByKey. Both
of these methods would do, and they seem very similar to me in terms of
their function.

My question is, what are the differences between these two methods (other
than the slight differences in their type signatures)? Under what
circumstances should I use one or the other?

Thanks

Dave

Re: aggregateByKey vs combineByKey

Posted by David Rowe <da...@gmail.com>.

Thanks Liquan, that was really helpful.

On Mon, Sep 29, 2014 at 5:54 PM, Liquan Pei <li...@gmail.com> wrote:

> Hi Dave,
>
> You can replace groupByKey with reduceByKey to improve performance in some
> cases. reduceByKey performs map side combine which can reduce Network IO
> and shuffle size where as groupByKey will not perform map side combine.
>
> combineByKey is more general then aggregateByKey. Actually, the
> implementation of aggregateByKey, reduceByKey and groupByKey is achieved by
> combineByKey. aggregateByKey is similar to reduceByKey but you can provide
> initial values when performing aggregation.
>
> As the name suggests, aggregateByKey is suitable for compute aggregations
> for keys, example aggregations such as sum, avg, etc. The rule here is that
> the extra computation spent for map side combine can reduce the size sent
> out to other nodes and driver. If your func has satisfies this rule, you
> probably should use aggregateByKey.
>
> combineByKey is more general and you have the flexibility to specify
> whether you'd like to perform map side combine. However, it is more complex
> to use. At minimum, you need to implement three functions: createCombiner,
> mergeValue, mergeCombiners.
>
> Hope this helps!
> Liquan
>
> On Sun, Sep 28, 2014 at 11:59 PM, David Rowe <da...@gmail.com> wrote:
>
>> Hi All,
>>
>> After some hair pulling, I've reached the realisation that an operation I
>> am currently doing via:
>>
>> myRDD.groupByKey.mapValues(func)
>>
>> should be done more efficiently using aggregateByKey or combineByKey.
>> Both of these methods would do, and they seem very similar to me in terms
>> of their function.
>>
>> My question is, what are the differences between these two methods (other
>> than the slight differences in their type signatures)? Under what
>> circumstances should I use one or the other?
>>
>> Thanks
>>
>> Dave
>>
>>
>>
>
>
> --
> Liquan Pei
> Department of Physics
> University of Massachusetts Amherst
>

Re: aggregateByKey vs combineByKey

Posted by Liquan Pei <li...@gmail.com>.

Hi Dave,

You can replace groupByKey with reduceByKey to improve performance in some
cases. reduceByKey performs map side combine which can reduce Network IO
and shuffle size where as groupByKey will not perform map side combine.

combineByKey is more general then aggregateByKey. Actually, the
implementation of aggregateByKey, reduceByKey and groupByKey is achieved by
combineByKey. aggregateByKey is similar to reduceByKey but you can provide
initial values when performing aggregation.

As the name suggests, aggregateByKey is suitable for compute aggregations
for keys, example aggregations such as sum, avg, etc. The rule here is that
the extra computation spent for map side combine can reduce the size sent
out to other nodes and driver. If your func has satisfies this rule, you
probably should use aggregateByKey.

combineByKey is more general and you have the flexibility to specify
whether you'd like to perform map side combine. However, it is more complex
to use. At minimum, you need to implement three functions: createCombiner,
mergeValue, mergeCombiners.

Hope this helps!
Liquan

On Sun, Sep 28, 2014 at 11:59 PM, David Rowe <da...@gmail.com> wrote:

> Hi All,
>
> After some hair pulling, I've reached the realisation that an operation I
> am currently doing via:
>
> myRDD.groupByKey.mapValues(func)
>
> should be done more efficiently using aggregateByKey or combineByKey. Both
> of these methods would do, and they seem very similar to me in terms of
> their function.
>
> My question is, what are the differences between these two methods (other
> than the slight differences in their type signatures)? Under what
> circumstances should I use one or the other?
>
> Thanks
>
> Dave
>
>
>

-- 
Liquan Pei
Department of Physics
University of Massachusetts Amherst