You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Stephen Fletcher <st...@gmail.com> on 2017/04/07 14:26:01 UTC

reducebykey

Are there plans to add reduceByKey to dataframes, Since switching over to
spark 2 I find myself increasing dissatisfied with the idea of converting
dataframes to RDD to do procedural programming on grouped data(both from a
ease of programming stance and performance stance). So I've been using
Dataframe's experimental groupByKey and flatMapGroups which perform
extremely well, I'm guessing because of the encoders, but the amount of
data being transfers is a little excessive. Is there any plans to port
reduceByKey ( and additionally a reduceByKeyleft and right)?

Re: reducebykey

Posted by Ankur Srivastava <an...@gmail.com>.
Hi Stephen,

If you use aggregate functions or reduceGroup on KeyValueGroupedDataset it
behaves as reduceByKey on RDD.

Only if you use flatMapGroups and mapGroups  it behaves as groupByKey on
RDD and if you read the API documentation it warns of using the API.

Hope this helps.

Thanks
Ankur

On Fri, Apr 7, 2017 at 7:26 AM, Stephen Fletcher <stephen.fletcher@gmail.com
> wrote:

> Are there plans to add reduceByKey to dataframes, Since switching over to
> spark 2 I find myself increasing dissatisfied with the idea of converting
> dataframes to RDD to do procedural programming on grouped data(both from a
> ease of programming stance and performance stance). So I've been using
> Dataframe's experimental groupByKey and flatMapGroups which perform
> extremely well, I'm guessing because of the encoders, but the amount of
> data being transfers is a little excessive. Is there any plans to port
> reduceByKey ( and additionally a reduceByKeyleft and right)?
>