You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Daniel Shields (JIRA)" <ji...@apache.org> on 2016/09/06 19:01:20 UTC

[jira] [Updated] (SPARK-17416) Add Dataset.groupByKey overload that takes a value selector function

     [ https://issues.apache.org/jira/browse/SPARK-17416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Daniel Shields updated SPARK-17416:
-----------------------------------
    Description: 
I propose that the following overload be added to Dataset[T]:

def groupByKey[K, V](keyFunc: T => K, valueFunc: T => V)(implicit arg0: Encoder[K], arg1: Encoder[V])

This would simplify a number of use cases.  For example, consider the following classic MapReduce query:

rdd.flatMap(f).reduceByKey(g) // where f returns a list of tuples


An idiomatic way to write this with Spark 2.0 would be:

dataset.flatMap(f).groupByKey(_._1, _._2).reduceGroups(g)

Without the groupByKey overload suggested above, this must be written as:

dataset.flatMap(f).groupByKey(_._1).reduceGroups((a, b) => g(a._2, b._2))

  was:
I propose that the following overload be added to Dataset[T]:

def groupByKey[K, V](keyFunc: T => K, valueFunc: T => V)(implicit arg0: Encoder[K], implicit arg1: Encoder[V])

This would simplify a number of use cases.  For example, consider the following classic MapReduce query:

rdd.flatMap(f).reduceByKey(g) // where f returns a list of tuples


An idiomatic way to write this with Spark 2.0 would be:

dataset.flatMap(f).groupByKey(_._1, _._2).reduceGroups(g)

Without the groupByKey overload suggested above, this must be written as:

dataset.flatMap(f).groupByKey(_._1).reduceGroups((a, b) => g(a._2, b._2))


> Add Dataset.groupByKey overload that takes a value selector function
> --------------------------------------------------------------------
>
>                 Key: SPARK-17416
>                 URL: https://issues.apache.org/jira/browse/SPARK-17416
>             Project: Spark
>          Issue Type: New Feature
>            Reporter: Daniel Shields
>
> I propose that the following overload be added to Dataset[T]:
> def groupByKey[K, V](keyFunc: T => K, valueFunc: T => V)(implicit arg0: Encoder[K], arg1: Encoder[V])
> This would simplify a number of use cases.  For example, consider the following classic MapReduce query:
> rdd.flatMap(f).reduceByKey(g) // where f returns a list of tuples
> An idiomatic way to write this with Spark 2.0 would be:
> dataset.flatMap(f).groupByKey(_._1, _._2).reduceGroups(g)
> Without the groupByKey overload suggested above, this must be written as:
> dataset.flatMap(f).groupByKey(_._1).reduceGroups((a, b) => g(a._2, b._2))



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org