You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Timothy Hunter (JIRA)" <ji...@apache.org> on 2016/06/28 17:01:01 UTC

[jira] [Created] (SPARK-16258) Automatically append the grouping keys in SparkR's gapply

Timothy Hunter created SPARK-16258:
--------------------------------------

             Summary: Automatically append the grouping keys in SparkR's gapply
                 Key: SPARK-16258
                 URL: https://issues.apache.org/jira/browse/SPARK-16258
             Project: Spark
          Issue Type: Improvement
          Components: SparkR
            Reporter: Timothy Hunter


While working on the group apply function for python [1], we found it easier to depart from SparkR's gapply function in the following way:
 - the keys are appended by default to the spark dataframe being returned
 - the output schema that the users provides is the schema of the R data frame and does not include the keys

Here are the reasons for doing so:
 - in most cases, users will want to know the key associated with a result -> appending the key is the sensible default
 - most functions in the SQL interface and in MLlib append columns, and gapply departs from this philosophy
 - for the cases when they do not need it, adding the key is a fraction of the computation time and of the output size
 - from a formal perspective, it makes calling gapply fully transparent to the type of the key: it is easier to build a function with gapply because it does not need to know anything about the key

This ticket proposes to change SparkR's gapply function to follow the same convention as Python's implementation.

cc [~Narine] [~shivaram]

[1] https://github.com/databricks/spark-sklearn/blob/master/python/spark_sklearn/group_apply.py



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org