You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Timothy Hunter (JIRA)" <ji...@apache.org> on 2016/06/28 17:01:01 UTC
[jira] [Created] (SPARK-16258) Automatically append the grouping
keys in SparkR's gapply
Timothy Hunter created SPARK-16258:
--------------------------------------
Summary: Automatically append the grouping keys in SparkR's gapply
Key: SPARK-16258
URL: https://issues.apache.org/jira/browse/SPARK-16258
Project: Spark
Issue Type: Improvement
Components: SparkR
Reporter: Timothy Hunter
While working on the group apply function for python [1], we found it easier to depart from SparkR's gapply function in the following way:
- the keys are appended by default to the spark dataframe being returned
- the output schema that the users provides is the schema of the R data frame and does not include the keys
Here are the reasons for doing so:
- in most cases, users will want to know the key associated with a result -> appending the key is the sensible default
- most functions in the SQL interface and in MLlib append columns, and gapply departs from this philosophy
- for the cases when they do not need it, adding the key is a fraction of the computation time and of the output size
- from a formal perspective, it makes calling gapply fully transparent to the type of the key: it is easier to build a function with gapply because it does not need to know anything about the key
This ticket proposes to change SparkR's gapply function to follow the same convention as Python's implementation.
cc [~Narine] [~shivaram]
[1] https://github.com/databricks/spark-sklearn/blob/master/python/spark_sklearn/group_apply.py
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org