You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by NarineK <gi...@git.apache.org> on 2016/07/07 13:31:24 UTC
[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...
GitHub user NarineK opened a pull request:
https://github.com/apache/spark/pull/14090
[SPARK-16112][SparkR] Programming guide for gapply/gapplyCollect
## What changes were proposed in this pull request?
Updates programming guide for spark.gapply/spark.gapplyCollect.
Similar to other examples I used faithful dataset to demonstrate gapply's functionality.
Please, let me know if you prefer another example.
## How was this patch tested?
Existing test cases in R
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/NarineK/spark gapplyProgGuide
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/14090.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #14090
----
commit 29d8a5c6c22202cdf7d6cc44f1d6cbeca5946918
Author: Narine Kokhlikyan <na...@slice.com>
Date: 2016-06-20T22:12:11Z
Fixed duplicated documentation problem + separated documentation for dapply and dapplyCollect
commit 698c4331d2a8bfe7f4b372ebc8123b6c27a57e68
Author: Narine Kokhlikyan <na...@slice.com>
Date: 2016-06-23T18:51:48Z
merge with master
commit 85a4493a03b3601a93c25ebc1eafb2868efec8d8
Author: Narine Kokhlikyan <na...@slice.com>
Date: 2016-07-07T13:18:49Z
Adding programming guide for gapply/gapplyCollect
commit 7781d1c111f38e3608d5ebd468e6d344d52efa5c
Author: Narine Kokhlikyan <na...@slice.com>
Date: 2016-07-07T13:27:35Z
removing output format
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #14090: [SPARK-16112][SparkR] Programming guide for gapply/gappl...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/14090
**[Test build #62147 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62147/consoleFull)** for PR 14090 at commit [`2af7243`](https://github.com/apache/spark/commit/2af724321e0d51aed64c84dd22741a7cc6067caf).
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #14090: [SPARK-16112][SparkR] Programming guide for gapply/gappl...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/14090
Build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...
Posted by NarineK <gi...@git.apache.org>.
Github user NarineK commented on a diff in the pull request:
https://github.com/apache/spark/pull/14090#discussion_r70202321
--- Diff: docs/sparkr.md ---
@@ -306,6 +306,64 @@ head(ldf, 3)
{% endhighlight %}
</div>
+#### Run a given function on a large dataset grouping by input column(s) and using `gapply` or `gapplyCollect`
+
+##### gapply
+Apply a function to each group of a `SparkDataFrame`. The function is to be applied to each group of the `SparkDataFrame` and should have only two parameters: grouping key and R `data.frame` corresponding to
+that key. The groups are chosen from `SparkDataFrame`s column(s).
+The output of function should be a `data.frame`. Schema specifies the row format of the resulting
+`SparkDataFrame`. It must match the R function's output.
--- End diff --
Thanks @shivaram.
Does the following mapping looks fine to have in the table ?
```
**R Spark**
byte byte
integer integer
float float
double double
numeric double
character string
string string
binary binary
raw binary
logical boolean
timestamp timestamp
date date
array array
map map
struct struct
```
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #14090: [SPARK-16112][SparkR] Programming guide for gapply/gappl...
Posted by shivaram <gi...@git.apache.org>.
Github user shivaram commented on the issue:
https://github.com/apache/spark/pull/14090
Merging this to master, branch-2.0
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...
Posted by shivaram <gi...@git.apache.org>.
Github user shivaram commented on a diff in the pull request:
https://github.com/apache/spark/pull/14090#discussion_r70922863
--- Diff: docs/sparkr.md ---
@@ -316,6 +314,139 @@ head(ldf, 3)
{% endhighlight %}
</div>
+#### Run a given function on a large dataset grouping by input column(s) and using `gapply` or `gapplyCollect`
+
+##### gapply
+Apply a function to each group of a `SparkDataFrame`. The function is to be applied to each group of the `SparkDataFrame` and should have only two parameters: grouping key and R `data.frame` corresponding to
+that key. The groups are chosen from `SparkDataFrame`s column(s).
+The output of function should be a `data.frame`. Schema specifies the row format of the resulting
+`SparkDataFrame`. It must represent R function's output schema on the basis of Spark data types. The column names of the returned `data.frame` are set by user. Below data type mapping between R
+and Spark.
+
+#### Data type mapping between R and Spark
+<table class="table">
+<tr><th>R</th><th>Spark</th></tr>
+<tr>
+ <td>byte</td>
+ <td>byte</td>
+</tr>
+<tr>
+ <td>integer</td>
+ <td>integer</td>
+</tr>
+<tr>
+ <td>float</td>
+ <td>float</td>
+</tr>
+<tr>
+ <td>double</td>
+ <td>double</td>
+</tr>
+<tr>
+ <td>numeric</td>
+ <td>double</td>
+</tr>
+<tr>
+ <td>character</td>
+ <td>string</td>
+</tr>
+<tr>
+ <td>string</td>
+ <td>string</td>
+</tr>
+<tr>
+ <td>binary</td>
+ <td>binary</td>
+</tr>
+<tr>
+ <td>raw</td>
+ <td>binary</td>
+</tr>
+<tr>
+ <td>logical</td>
+ <td>boolean</td>
+</tr>
+<tr>
+ <td>timestamp</td>
+ <td>timestamp</td>
+</tr>
+<tr>
+ <td>date</td>
+ <td>date</td>
+</tr>
+<tr>
+ <td>array</td>
+ <td>array</td>
+</tr>
+<tr>
+ <td>list</td>
+ <td>array</td>
+</tr>
+<tr>
+ <td>map</td>
+ <td>map</td>
+</tr>
+<tr>
+ <td>env</td>
+ <td>map</td>
+</tr>
+<tr>
+ <td>struct</td>
--- End diff --
And as you mentioned above we can also change `date` to `Date` to be more specific. (It would be ideal now that I think to link these R types to the CRAN help page. For example we can link to https://stat.ethz.ch/R-manual/R-devel/library/base/html/Dates.html for Date and https://stat.ethz.ch/R-manual/R-devel/library/base/html/DateTimeClasses.html for `POSIXct / POSIXlt`
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #14090: [SPARK-16112][SparkR] Programming guide for gapply/gappl...
Posted by shivaram <gi...@git.apache.org>.
Github user shivaram commented on the issue:
https://github.com/apache/spark/pull/14090
Thanks @NarineK - I tried it on a fresh Ubuntu VM and it rendered fine. I think it has something to do with ruby / jekyll versions. The rendered docs looked fine on the Ubuntu VM
LGTM. @felixcheung Could you also take one final look ?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...
Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on a diff in the pull request:
https://github.com/apache/spark/pull/14090#discussion_r70711116
--- Diff: docs/sparkr.md ---
@@ -263,7 +263,7 @@ In SparkR, we support several kinds of User-Defined Functions:
##### dapply
Apply a function to each partition of a `SparkDataFrame`. The function to be applied to each partition of the `SparkDataFrame`
and should have only one parameter, to which a `data.frame` corresponds to each partition will be passed. The output of function
-should be a `data.frame`. Schema specifies the row format of the resulting a `SparkDataFrame`. It must match the R function's output.
+should be a `data.frame`. Schema specifies the row format of the resulting a `SparkDataFrame`. It must match to [data types of R function's output fields](#data-type-mapping-between-r-and-spark).
--- End diff --
`output fields` --> `return values` or `return value`?
http://adv-r.had.co.nz/Functions.html#return-values
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...
Posted by NarineK <gi...@git.apache.org>.
Github user NarineK commented on a diff in the pull request:
https://github.com/apache/spark/pull/14090#discussion_r70168781
--- Diff: docs/sparkr.md ---
@@ -306,6 +306,64 @@ head(ldf, 3)
{% endhighlight %}
</div>
+#### Run a given function on a large dataset grouping by input column(s) and using `gapply` or `gapplyCollect`
+
+##### gapply
+Apply a function to each group of a `SparkDataFrame`. The function is to be applied to each group of the `SparkDataFrame` and should have only two parameters: grouping key and R `data.frame` corresponding to
+that key. The groups are chosen from `SparkDataFrame`s column(s).
+The output of function should be a `data.frame`. Schema specifies the row format of the resulting
+`SparkDataFrame`. It must match the R function's output.
--- End diff --
Thanks @felixcheung, Does this sound better ?
"It must reflect R function's output schema on the basis of Spark data types. The column names of each output field in the schema are set by user." I could also bring up some examples.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...
Posted by shivaram <gi...@git.apache.org>.
Github user shivaram commented on a diff in the pull request:
https://github.com/apache/spark/pull/14090#discussion_r70920785
--- Diff: docs/sparkr.md ---
@@ -316,6 +314,139 @@ head(ldf, 3)
{% endhighlight %}
</div>
+#### Run a given function on a large dataset grouping by input column(s) and using `gapply` or `gapplyCollect`
+
+##### gapply
+Apply a function to each group of a `SparkDataFrame`. The function is to be applied to each group of the `SparkDataFrame` and should have only two parameters: grouping key and R `data.frame` corresponding to
+that key. The groups are chosen from `SparkDataFrame`s column(s).
+The output of function should be a `data.frame`. Schema specifies the row format of the resulting
+`SparkDataFrame`. It must represent R function's output schema on the basis of Spark data types. The column names of the returned `data.frame` are set by user. Below data type mapping between R
+and Spark.
+
+#### Data type mapping between R and Spark
+<table class="table">
+<tr><th>R</th><th>Spark</th></tr>
+<tr>
+ <td>byte</td>
+ <td>byte</td>
+</tr>
+<tr>
+ <td>integer</td>
+ <td>integer</td>
+</tr>
+<tr>
+ <td>float</td>
+ <td>float</td>
+</tr>
+<tr>
+ <td>double</td>
+ <td>double</td>
+</tr>
+<tr>
+ <td>numeric</td>
+ <td>double</td>
+</tr>
+<tr>
+ <td>character</td>
+ <td>string</td>
+</tr>
+<tr>
+ <td>string</td>
+ <td>string</td>
+</tr>
+<tr>
+ <td>binary</td>
+ <td>binary</td>
+</tr>
+<tr>
+ <td>raw</td>
+ <td>binary</td>
+</tr>
+<tr>
+ <td>logical</td>
+ <td>boolean</td>
+</tr>
+<tr>
+ <td>timestamp</td>
+ <td>timestamp</td>
+</tr>
+<tr>
+ <td>date</td>
+ <td>date</td>
+</tr>
+<tr>
+ <td>array</td>
+ <td>array</td>
+</tr>
+<tr>
+ <td>list</td>
+ <td>array</td>
+</tr>
+<tr>
+ <td>map</td>
+ <td>map</td>
+</tr>
+<tr>
+ <td>env</td>
+ <td>map</td>
+</tr>
+<tr>
+ <td>struct</td>
--- End diff --
Thats a good point - So users can create a schema with `struct` and that is mapping to a corresponding SQL type. But they can't create any R objects that will be parsed as `struct`. The main reason our schema is more flexible than our serialization / deserialization support is that the schema can be used to say read JSON files or JDBC tables etc.
For the use case here, where users are returning a `data.frame` from UDF I dont think there is any valid mapping for `struct` from R.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #14090: [SPARK-16112][SparkR] Programming guide for gapply/gappl...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/14090
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...
Posted by shivaram <gi...@git.apache.org>.
Github user shivaram commented on a diff in the pull request:
https://github.com/apache/spark/pull/14090#discussion_r70346974
--- Diff: docs/sparkr.md ---
@@ -306,6 +306,64 @@ head(ldf, 3)
{% endhighlight %}
</div>
+#### Run a given function on a large dataset grouping by input column(s) and using `gapply` or `gapplyCollect`
+
+##### gapply
+Apply a function to each group of a `SparkDataFrame`. The function is to be applied to each group of the `SparkDataFrame` and should have only two parameters: grouping key and R `data.frame` corresponding to
+that key. The groups are chosen from `SparkDataFrame`s column(s).
+The output of function should be a `data.frame`. Schema specifies the row format of the resulting
+`SparkDataFrame`. It must match the R function's output.
--- End diff --
I think those mappings are only used to print things in `str`. A better list to consult would be the list at https://github.com/apache/spark/blob/master/R/pkg/R/serialize.R#L23 -- As that says `list` in R should become a `array` in SparkSQL and `env` in R should map to a `map`
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...
Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on a diff in the pull request:
https://github.com/apache/spark/pull/14090#discussion_r70905195
--- Diff: docs/sparkr.md ---
@@ -316,6 +314,139 @@ head(ldf, 3)
{% endhighlight %}
</div>
+#### Run a given function on a large dataset grouping by input column(s) and using `gapply` or `gapplyCollect`
+
+##### gapply
+Apply a function to each group of a `SparkDataFrame`. The function is to be applied to each group of the `SparkDataFrame` and should have only two parameters: grouping key and R `data.frame` corresponding to
+that key. The groups are chosen from `SparkDataFrame`s column(s).
+The output of function should be a `data.frame`. Schema specifies the row format of the resulting
+`SparkDataFrame`. It must represent R function's output schema on the basis of Spark data types. The column names of the returned `data.frame` are set by user. Below data type mapping between R
+and Spark.
+
+#### Data type mapping between R and Spark
+<table class="table">
+<tr><th>R</th><th>Spark</th></tr>
+<tr>
+ <td>byte</td>
+ <td>byte</td>
+</tr>
+<tr>
+ <td>integer</td>
+ <td>integer</td>
+</tr>
+<tr>
+ <td>float</td>
+ <td>float</td>
+</tr>
+<tr>
+ <td>double</td>
+ <td>double</td>
+</tr>
+<tr>
+ <td>numeric</td>
+ <td>double</td>
+</tr>
+<tr>
+ <td>character</td>
+ <td>string</td>
+</tr>
+<tr>
+ <td>string</td>
+ <td>string</td>
+</tr>
+<tr>
+ <td>binary</td>
+ <td>binary</td>
+</tr>
+<tr>
+ <td>raw</td>
+ <td>binary</td>
+</tr>
+<tr>
+ <td>logical</td>
+ <td>boolean</td>
+</tr>
+<tr>
+ <td>timestamp</td>
+ <td>timestamp</td>
+</tr>
+<tr>
+ <td>date</td>
+ <td>date</td>
+</tr>
+<tr>
+ <td>array</td>
+ <td>array</td>
+</tr>
+<tr>
+ <td>list</td>
+ <td>array</td>
+</tr>
+<tr>
+ <td>map</td>
+ <td>map</td>
+</tr>
+<tr>
+ <td>env</td>
+ <td>map</td>
+</tr>
+<tr>
+ <td>struct</td>
--- End diff --
I don't think `date` is a type either.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...
Posted by NarineK <gi...@git.apache.org>.
Github user NarineK commented on a diff in the pull request:
https://github.com/apache/spark/pull/14090#discussion_r70920244
--- Diff: docs/sparkr.md ---
@@ -316,6 +314,139 @@ head(ldf, 3)
{% endhighlight %}
</div>
+#### Run a given function on a large dataset grouping by input column(s) and using `gapply` or `gapplyCollect`
+
+##### gapply
+Apply a function to each group of a `SparkDataFrame`. The function is to be applied to each group of the `SparkDataFrame` and should have only two parameters: grouping key and R `data.frame` corresponding to
+that key. The groups are chosen from `SparkDataFrame`s column(s).
+The output of function should be a `data.frame`. Schema specifies the row format of the resulting
+`SparkDataFrame`. It must represent R function's output schema on the basis of Spark data types. The column names of the returned `data.frame` are set by user. Below data type mapping between R
+and Spark.
+
+#### Data type mapping between R and Spark
+<table class="table">
+<tr><th>R</th><th>Spark</th></tr>
+<tr>
+ <td>byte</td>
+ <td>byte</td>
+</tr>
+<tr>
+ <td>integer</td>
+ <td>integer</td>
+</tr>
+<tr>
+ <td>float</td>
+ <td>float</td>
+</tr>
+<tr>
+ <td>double</td>
+ <td>double</td>
+</tr>
+<tr>
+ <td>numeric</td>
+ <td>double</td>
+</tr>
+<tr>
+ <td>character</td>
+ <td>string</td>
+</tr>
+<tr>
+ <td>string</td>
+ <td>string</td>
+</tr>
+<tr>
+ <td>binary</td>
+ <td>binary</td>
+</tr>
+<tr>
+ <td>raw</td>
+ <td>binary</td>
+</tr>
+<tr>
+ <td>logical</td>
+ <td>boolean</td>
+</tr>
+<tr>
+ <td>timestamp</td>
+ <td>timestamp</td>
+</tr>
+<tr>
+ <td>date</td>
+ <td>date</td>
+</tr>
+<tr>
+ <td>array</td>
+ <td>array</td>
+</tr>
+<tr>
+ <td>list</td>
+ <td>array</td>
+</tr>
+<tr>
+ <td>map</td>
+ <td>map</td>
+</tr>
+<tr>
+ <td>env</td>
+ <td>map</td>
+</tr>
+<tr>
+ <td>struct</td>
--- End diff --
@felixcheung, I think according to the following mapping we expect 'date':
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala#L91
And it seems that there is a 'Date' in base. Do I understand correct ?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #14090: [SPARK-16112][SparkR] Programming guide for gapply/gappl...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/14090
**[Test build #62411 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62411/consoleFull)** for PR 14090 at commit [`f584416`](https://github.com/apache/spark/commit/f584416b81bc19d951d28eb2861cc3a4a16bc117).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #14090: [SPARK-16112][SparkR] Programming guide for gapply/gappl...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/14090
**[Test build #62300 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62300/consoleFull)** for PR 14090 at commit [`5d34943`](https://github.com/apache/spark/commit/5d3494337ed2dfc5592b11e324aa7ef52a6f354e).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...
Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:
https://github.com/apache/spark/pull/14090
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #14090: [SPARK-16112][SparkR] Programming guide for gapply/gappl...
Posted by NarineK <gi...@git.apache.org>.
Github user NarineK commented on the issue:
https://github.com/apache/spark/pull/14090
Thanks @shivaram, @felixcheung for the comments. I'll address those today.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #14090: [SPARK-16112][SparkR] Programming guide for gapply/gappl...
Posted by shivaram <gi...@git.apache.org>.
Github user shivaram commented on the issue:
https://github.com/apache/spark/pull/14090
@felixcheung Could you take one more look at this ?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #14090: [SPARK-16112][SparkR] Programming guide for gapply/gappl...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/14090
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...
Posted by shivaram <gi...@git.apache.org>.
Github user shivaram commented on a diff in the pull request:
https://github.com/apache/spark/pull/14090#discussion_r70846132
--- Diff: docs/sparkr.md ---
@@ -316,6 +314,139 @@ head(ldf, 3)
{% endhighlight %}
</div>
+#### Run a given function on a large dataset grouping by input column(s) and using `gapply` or `gapplyCollect`
+
+##### gapply
+Apply a function to each group of a `SparkDataFrame`. The function is to be applied to each group of the `SparkDataFrame` and should have only two parameters: grouping key and R `data.frame` corresponding to
+that key. The groups are chosen from `SparkDataFrame`s column(s).
+The output of function should be a `data.frame`. Schema specifies the row format of the resulting
+`SparkDataFrame`. It must represent R function's output schema on the basis of Spark data types. The column names of the returned `data.frame` are set by user. Below data type mapping between R
+and Spark.
+
+#### Data type mapping between R and Spark
+<table class="table">
+<tr><th>R</th><th>Spark</th></tr>
+<tr>
+ <td>byte</td>
+ <td>byte</td>
+</tr>
+<tr>
+ <td>integer</td>
+ <td>integer</td>
+</tr>
+<tr>
+ <td>float</td>
+ <td>float</td>
+</tr>
+<tr>
+ <td>double</td>
+ <td>double</td>
+</tr>
+<tr>
+ <td>numeric</td>
+ <td>double</td>
+</tr>
+<tr>
+ <td>character</td>
+ <td>string</td>
+</tr>
+<tr>
+ <td>string</td>
+ <td>string</td>
+</tr>
+<tr>
+ <td>binary</td>
+ <td>binary</td>
+</tr>
+<tr>
+ <td>raw</td>
+ <td>binary</td>
+</tr>
+<tr>
+ <td>logical</td>
+ <td>boolean</td>
+</tr>
+<tr>
+ <td>timestamp</td>
+ <td>timestamp</td>
+</tr>
+<tr>
+ <td>date</td>
+ <td>date</td>
+</tr>
+<tr>
+ <td>array</td>
+ <td>array</td>
+</tr>
+<tr>
+ <td>list</td>
+ <td>array</td>
+</tr>
+<tr>
+ <td>map</td>
+ <td>map</td>
+</tr>
+<tr>
+ <td>env</td>
+ <td>map</td>
+</tr>
+<tr>
+ <td>struct</td>
--- End diff --
I dont think R has any notion of a `struct` or `map` data type ? Looking at the list of R data structures at http://adv-r.had.co.nz/Data-structures.html I think we should remove the struct -> struct and map -> map entries. Also I dont think there is a `timestamp` class in R. We should probably replace that with `POSIXct` or `POSIXlt`?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #14090: [SPARK-16112][SparkR] Programming guide for gapply/gappl...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/14090
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62147/
Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #14090: [SPARK-16112][SparkR] Programming guide for gapply/gappl...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/14090
**[Test build #62299 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62299/consoleFull)** for PR 14090 at commit [`8a2aff3`](https://github.com/apache/spark/commit/8a2aff3add082e20c45136dc5814e6ccdf4b256c).
* This patch passes all tests.
* This patch **does not merge cleanly**.
* This patch adds no public classes.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #14090: [SPARK-16112][SparkR] Programming guide for gapply/gappl...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/14090
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #14090: [SPARK-16112][SparkR] Programming guide for gapply/gappl...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/14090
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...
Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on a diff in the pull request:
https://github.com/apache/spark/pull/14090#discussion_r70172206
--- Diff: docs/sparkr.md ---
@@ -306,6 +306,64 @@ head(ldf, 3)
{% endhighlight %}
</div>
+#### Run a given function on a large dataset grouping by input column(s) and using `gapply` or `gapplyCollect`
+
+##### gapply
+Apply a function to each group of a `SparkDataFrame`. The function is to be applied to each group of the `SparkDataFrame` and should have only two parameters: grouping key and R `data.frame` corresponding to
+that key. The groups are chosen from `SparkDataFrame`s column(s).
+The output of function should be a `data.frame`. Schema specifies the row format of the resulting
+`SparkDataFrame`. It must match the R function's output.
--- End diff --
I think gapply and dapply are the first important use cases where we require strict mapping Spark JVM types to R atomic types. It might be worthwhile to add a section in the programming guide to illustrate and explain that further.
To be more concrete, what should be the column type of the UDF output R data.frame if the SparkDataFrame has a column of double? It would be good to have a table on that.
That could be a separate PR though.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #14090: [SPARK-16112][SparkR] Programming guide for gapply/gappl...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/14090
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61911/
Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...
Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on a diff in the pull request:
https://github.com/apache/spark/pull/14090#discussion_r70711263
--- Diff: docs/sparkr.md ---
@@ -312,7 +310,82 @@ head(ldf, 3)
Apply a function to each group of a `SparkDataFrame`. The function is to be applied to each group of the `SparkDataFrame` and should have only two parameters: grouping key and R `data.frame` corresponding to
that key. The groups are chosen from `SparkDataFrame`s column(s).
The output of function should be a `data.frame`. Schema specifies the row format of the resulting
-`SparkDataFrame`. It must match the R function's output.
+`SparkDataFrame`. It must represent R function's output schema on the basis of Spark data types. The column names of each output field in the schema are set by user. Bellow data type mapping between R
--- End diff --
same, `output field` here
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...
Posted by NarineK <gi...@git.apache.org>.
Github user NarineK commented on a diff in the pull request:
https://github.com/apache/spark/pull/14090#discussion_r70198331
--- Diff: docs/sparkr.md ---
@@ -306,6 +306,64 @@ head(ldf, 3)
{% endhighlight %}
</div>
+#### Run a given function on a large dataset grouping by input column(s) and using `gapply` or `gapplyCollect`
+
+##### gapply
+Apply a function to each group of a `SparkDataFrame`. The function is to be applied to each group of the `SparkDataFrame` and should have only two parameters: grouping key and R `data.frame` corresponding to
+that key. The groups are chosen from `SparkDataFrame`s column(s).
+The output of function should be a `data.frame`. Schema specifies the row format of the resulting
+`SparkDataFrame`. It must match the R function's output.
--- End diff --
or we could probably refer also to this ?
https://github.com/apache/spark/blob/master/R/pkg/R/types.R#L21
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #14090: [SPARK-16112][SparkR] Programming guide for gapply/gappl...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/14090
**[Test build #61911 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61911/consoleFull)** for PR 14090 at commit [`7781d1c`](https://github.com/apache/spark/commit/7781d1c111f38e3608d5ebd468e6d344d52efa5c).
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...
Posted by NarineK <gi...@git.apache.org>.
Github user NarineK commented on a diff in the pull request:
https://github.com/apache/spark/pull/14090#discussion_r70194370
--- Diff: docs/sparkr.md ---
@@ -306,6 +306,64 @@ head(ldf, 3)
{% endhighlight %}
</div>
+#### Run a given function on a large dataset grouping by input column(s) and using `gapply` or `gapplyCollect`
+
+##### gapply
+Apply a function to each group of a `SparkDataFrame`. The function is to be applied to each group of the `SparkDataFrame` and should have only two parameters: grouping key and R `data.frame` corresponding to
+that key. The groups are chosen from `SparkDataFrame`s column(s).
+The output of function should be a `data.frame`. Schema specifies the row format of the resulting
+`SparkDataFrame`. It must match the R function's output.
--- End diff --
I see. I think we can describe the following type mapping in the programming guide.
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala#L91
Those are the types used in the StructType's fields.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...
Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on a diff in the pull request:
https://github.com/apache/spark/pull/14090#discussion_r70000362
--- Diff: docs/sparkr.md ---
@@ -306,6 +306,64 @@ head(ldf, 3)
{% endhighlight %}
</div>
+#### Run a given function on a large dataset grouping by input column(s) and using `gapply` or `gapplyCollect`
+
+##### gapply
+Apply a function to each group of a `SparkDataFrame`. The function is to be applied to each group of the `SparkDataFrame` and should have only two parameters: grouping key and R `data.frame` corresponding to
+that key. The groups are chosen from `SparkDataFrame`s column(s).
+The output of function should be a `data.frame`. Schema specifies the row format of the resulting
+`SparkDataFrame`. It must match the R function's output.
--- End diff --
I suppose this could be explained in `dapply` above as well
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #14090: [SPARK-16112][SparkR] Programming guide for gapply/gappl...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/14090
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62299/
Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...
Posted by shivaram <gi...@git.apache.org>.
Github user shivaram commented on a diff in the pull request:
https://github.com/apache/spark/pull/14090#discussion_r70922747
--- Diff: docs/sparkr.md ---
@@ -316,6 +314,139 @@ head(ldf, 3)
{% endhighlight %}
</div>
+#### Run a given function on a large dataset grouping by input column(s) and using `gapply` or `gapplyCollect`
+
+##### gapply
+Apply a function to each group of a `SparkDataFrame`. The function is to be applied to each group of the `SparkDataFrame` and should have only two parameters: grouping key and R `data.frame` corresponding to
+that key. The groups are chosen from `SparkDataFrame`s column(s).
+The output of function should be a `data.frame`. Schema specifies the row format of the resulting
+`SparkDataFrame`. It must represent R function's output schema on the basis of Spark data types. The column names of the returned `data.frame` are set by user. Below data type mapping between R
+and Spark.
+
+#### Data type mapping between R and Spark
+<table class="table">
+<tr><th>R</th><th>Spark</th></tr>
+<tr>
+ <td>byte</td>
+ <td>byte</td>
+</tr>
+<tr>
+ <td>integer</td>
+ <td>integer</td>
+</tr>
+<tr>
+ <td>float</td>
+ <td>float</td>
+</tr>
+<tr>
+ <td>double</td>
+ <td>double</td>
+</tr>
+<tr>
+ <td>numeric</td>
+ <td>double</td>
+</tr>
+<tr>
+ <td>character</td>
+ <td>string</td>
+</tr>
+<tr>
+ <td>string</td>
+ <td>string</td>
+</tr>
+<tr>
+ <td>binary</td>
+ <td>binary</td>
+</tr>
+<tr>
+ <td>raw</td>
+ <td>binary</td>
+</tr>
+<tr>
+ <td>logical</td>
+ <td>boolean</td>
+</tr>
+<tr>
+ <td>timestamp</td>
+ <td>timestamp</td>
+</tr>
+<tr>
+ <td>date</td>
+ <td>date</td>
+</tr>
+<tr>
+ <td>array</td>
+ <td>array</td>
+</tr>
+<tr>
+ <td>list</td>
+ <td>array</td>
+</tr>
+<tr>
+ <td>map</td>
+ <td>map</td>
+</tr>
+<tr>
+ <td>env</td>
+ <td>map</td>
+</tr>
+<tr>
+ <td>struct</td>
--- End diff --
We can remove map, struct. For timestamp lets replace the R side of the table with `POSIXct` / `POSIXlt`
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...
Posted by shivaram <gi...@git.apache.org>.
Github user shivaram commented on a diff in the pull request:
https://github.com/apache/spark/pull/14090#discussion_r71041580
--- Diff: docs/sparkr.md ---
@@ -295,8 +294,7 @@ head(collect(df1))
##### dapplyCollect
Like `dapply`, apply a function to each partition of a `SparkDataFrame` and collect the result back. The output of function
-should be a `data.frame`. But, Schema is not required to be passed. Note that `dapplyCollect` only can be used if the
-output of UDF run on all the partitions can fit in driver memory.
+should be a `data.frame`. But, Schema is not required to be passed. Note that `dapplyCollect` can fail if the output of UDF run on all the partition cannot be pulled to the driver and fit in driver memory.
<div data-lang="r" markdown="1">
--- End diff --
I think we need a new line before the `<div>` ? Right now the `div` markings show up in the generated doc. I've attached a screenshot
![screenshot 2016-07-15 14 11 39](https://cloud.githubusercontent.com/assets/143893/16888609/1d4409fe-4a96-11e6-97db-6ebf05a03774.png)
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...
Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on a diff in the pull request:
https://github.com/apache/spark/pull/14090#discussion_r69955401
--- Diff: docs/sparkr.md ---
@@ -306,6 +306,64 @@ head(ldf, 3)
{% endhighlight %}
</div>
+#### Run a given function on a large dataset grouping by input column(s) and using `gapply` or `gapplyCollect`
+
+##### gapply
+Apply a function to each group of a `SparkDataFrame`. The function is to be applied to each group of the `SparkDataFrame` and should have only two parameters: grouping key and R `data.frame` corresponding to
+that key. The groups are chosen from `SparkDataFrame`s column(s).
+The output of function should be a `data.frame`. Schema specifies the row format of the resulting
+`SparkDataFrame`. It must match the R function's output.
--- End diff --
it was hard to do in roxygen2 doc but the programming guide would be a great please to touch on or refer to what "match" means exactly - type mapping between Spark and R is a bit fuzzy and would be good to explain a bit more on that
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #14090: [SPARK-16112][SparkR] Programming guide for gapply/gappl...
Posted by NarineK <gi...@git.apache.org>.
Github user NarineK commented on the issue:
https://github.com/apache/spark/pull/14090
Thanks, I've generated the docs with your suggested way @shivaram, but I'm not sure if I see the same thing as you.
I still see some '{% highlight r %}' and some formatting issues in general. I also followed this documentation:
https://github.com/apache/spark/tree/master/docs#generating-the-documentation-html
Please, let me know if you still see the issues after my latest commit.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #14090: [SPARK-16112][SparkR] Programming guide for gapply/gappl...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/14090
**[Test build #62145 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62145/consoleFull)** for PR 14090 at commit [`c1d7151`](https://github.com/apache/spark/commit/c1d71512a3bf0205615d1b6318029ad6f33d94dc).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #14090: [SPARK-16112][SparkR] Programming guide for gapply/gappl...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/14090
**[Test build #61911 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61911/consoleFull)** for PR 14090 at commit [`7781d1c`](https://github.com/apache/spark/commit/7781d1c111f38e3608d5ebd468e6d344d52efa5c).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...
Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on a diff in the pull request:
https://github.com/apache/spark/pull/14090#discussion_r70711218
--- Diff: docs/sparkr.md ---
@@ -312,7 +310,82 @@ head(ldf, 3)
Apply a function to each group of a `SparkDataFrame`. The function is to be applied to each group of the `SparkDataFrame` and should have only two parameters: grouping key and R `data.frame` corresponding to
that key. The groups are chosen from `SparkDataFrame`s column(s).
The output of function should be a `data.frame`. Schema specifies the row format of the resulting
-`SparkDataFrame`. It must match the R function's output.
+`SparkDataFrame`. It must represent R function's output schema on the basis of Spark data types. The column names of each output field in the schema are set by user. Bellow data type mapping between R
--- End diff --
`Bellow` should be `Below`?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #14090: [SPARK-16112][SparkR] Programming guide for gapply/gappl...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/14090
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #14090: [SPARK-16112][SparkR] Programming guide for gapply/gappl...
Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on the issue:
https://github.com/apache/spark/pull/14090
LGTM. thanks for putting this together!
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #14090: [SPARK-16112][SparkR] Programming guide for gapply/gappl...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/14090
**[Test build #62411 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62411/consoleFull)** for PR 14090 at commit [`f584416`](https://github.com/apache/spark/commit/f584416b81bc19d951d28eb2861cc3a4a16bc117).
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #14090: [SPARK-16112][SparkR] Programming guide for gapply/gappl...
Posted by shivaram <gi...@git.apache.org>.
Github user shivaram commented on the issue:
https://github.com/apache/spark/pull/14090
Thanks @NarineK for the updates. As a final thing I just had some formatting problems when I tested out this change locally. Let me know if you can't reproduce them. I just ran
```
cd docs
SKIP_API=1 jekyll build
open _site/sparkr.html
```
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...
Posted by NarineK <gi...@git.apache.org>.
Github user NarineK commented on a diff in the pull request:
https://github.com/apache/spark/pull/14090#discussion_r70202736
--- Diff: docs/sparkr.md ---
@@ -306,6 +306,64 @@ head(ldf, 3)
{% endhighlight %}
</div>
+#### Run a given function on a large dataset grouping by input column(s) and using `gapply` or `gapplyCollect`
+
+##### gapply
+Apply a function to each group of a `SparkDataFrame`. The function is to be applied to each group of the `SparkDataFrame` and should have only two parameters: grouping key and R `data.frame` corresponding to
+that key. The groups are chosen from `SparkDataFrame`s column(s).
+The output of function should be a `data.frame`. Schema specifies the row format of the resulting
+`SparkDataFrame`. It must match the R function's output.
--- End diff --
Thanks, I was looking at types.R file and have noticed that we have NA's for array, map and struct.
https://github.com/apache/spark/blob/master/R/pkg/R/types.R#L42
But I guess in our case we can have: array, map and struct mapped to array, map and struct correspondingly ?!
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #14090: [SPARK-16112][SparkR] Programming guide for gapply/gappl...
Posted by shivaram <gi...@git.apache.org>.
Github user shivaram commented on the issue:
https://github.com/apache/spark/pull/14090
cc @felixcheung @mengxr
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #14090: [SPARK-16112][SparkR] Programming guide for gapply/gappl...
Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on the issue:
https://github.com/apache/spark/pull/14090
LGTM except for comment on "schema matching".
Also I wonder if we should rephrase "can only be used if the output of UDF run on all the partitions can fit in driver memory" - it seems not as strong as a warning or correct as "can fail if the output of UDF run on all the partition cannot be pulled to the driver and fit in driver memory" (same in `dapplyCollect`)
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...
Posted by shivaram <gi...@git.apache.org>.
Github user shivaram commented on a diff in the pull request:
https://github.com/apache/spark/pull/14090#discussion_r71041878
--- Diff: docs/sparkr.md ---
@@ -316,6 +314,135 @@ head(ldf, 3)
{% endhighlight %}
</div>
+#### Run a given function on a large dataset grouping by input column(s) and using `gapply` or `gapplyCollect`
+
+##### gapply
+Apply a function to each group of a `SparkDataFrame`. The function is to be applied to each group of the `SparkDataFrame` and should have only two parameters: grouping key and R `data.frame` corresponding to
+that key. The groups are chosen from `SparkDataFrame`s column(s).
+The output of function should be a `data.frame`. Schema specifies the row format of the resulting
+`SparkDataFrame`. It must represent R function's output schema on the basis of Spark data types. The column names of the returned `data.frame` are set by user. Below data type mapping between R
--- End diff --
`Below data type` -> `Below is the data type`
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...
Posted by shivaram <gi...@git.apache.org>.
Github user shivaram commented on a diff in the pull request:
https://github.com/apache/spark/pull/14090#discussion_r70202560
--- Diff: docs/sparkr.md ---
@@ -306,6 +306,64 @@ head(ldf, 3)
{% endhighlight %}
</div>
+#### Run a given function on a large dataset grouping by input column(s) and using `gapply` or `gapplyCollect`
+
+##### gapply
+Apply a function to each group of a `SparkDataFrame`. The function is to be applied to each group of the `SparkDataFrame` and should have only two parameters: grouping key and R `data.frame` corresponding to
+that key. The groups are chosen from `SparkDataFrame`s column(s).
+The output of function should be a `data.frame`. Schema specifies the row format of the resulting
+`SparkDataFrame`. It must match the R function's output.
--- End diff --
This looks good to me !
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #14090: [SPARK-16112][SparkR] Programming guide for gapply/gappl...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/14090
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62145/
Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #14090: [SPARK-16112][SparkR] Programming guide for gapply/gappl...
Posted by NarineK <gi...@git.apache.org>.
Github user NarineK commented on the issue:
https://github.com/apache/spark/pull/14090
Added data type description
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...
Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on a diff in the pull request:
https://github.com/apache/spark/pull/14090#discussion_r71047599
--- Diff: docs/sparkr.md ---
@@ -316,6 +314,135 @@ head(ldf, 3)
{% endhighlight %}
</div>
+#### Run a given function on a large dataset grouping by input column(s) and using `gapply` or `gapplyCollect`
+
+##### gapply
+Apply a function to each group of a `SparkDataFrame`. The function is to be applied to each group of the `SparkDataFrame` and should have only two parameters: grouping key and R `data.frame` corresponding to
+that key. The groups are chosen from `SparkDataFrame`s column(s).
+The output of function should be a `data.frame`. Schema specifies the row format of the resulting
+`SparkDataFrame`. It must represent R function's output schema on the basis of Spark data types. The column names of the returned `data.frame` are set by user. Below data type mapping between R
+and Spark.
+
+#### Data type mapping between R and Spark
+<table class="table">
+<tr><th>R</th><th>Spark</th></tr>
+<tr>
+ <td>byte</td>
+ <td>byte</td>
+</tr>
+<tr>
+ <td>integer</td>
+ <td>integer</td>
+</tr>
+<tr>
+ <td>float</td>
+ <td>float</td>
+</tr>
+<tr>
+ <td>double</td>
+ <td>double</td>
+</tr>
+<tr>
+ <td>numeric</td>
+ <td>double</td>
+</tr>
+<tr>
+ <td>character</td>
+ <td>string</td>
+</tr>
+<tr>
+ <td>string</td>
+ <td>string</td>
+</tr>
+<tr>
+ <td>binary</td>
+ <td>binary</td>
+</tr>
+<tr>
+ <td>raw</td>
+ <td>binary</td>
+</tr>
+<tr>
+ <td>logical</td>
+ <td>boolean</td>
+</tr>
+<tr>
+ <td>[POSIXct](https://stat.ethz.ch/R-manual/R-devel/library/base/html/DateTimeClasses.html)</td>
--- End diff --
I think we need to put `<a href >` in `<table>`, eg. https://github.com/apache/spark/blame/master/docs/structured-streaming-programming-guide.md#L811
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #14090: [SPARK-16112][SparkR] Programming guide for gapply/gappl...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/14090
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...
Posted by NarineK <gi...@git.apache.org>.
Github user NarineK commented on a diff in the pull request:
https://github.com/apache/spark/pull/14090#discussion_r70923645
--- Diff: docs/sparkr.md ---
@@ -316,6 +314,139 @@ head(ldf, 3)
{% endhighlight %}
</div>
+#### Run a given function on a large dataset grouping by input column(s) and using `gapply` or `gapplyCollect`
+
+##### gapply
+Apply a function to each group of a `SparkDataFrame`. The function is to be applied to each group of the `SparkDataFrame` and should have only two parameters: grouping key and R `data.frame` corresponding to
+that key. The groups are chosen from `SparkDataFrame`s column(s).
+The output of function should be a `data.frame`. Schema specifies the row format of the resulting
+`SparkDataFrame`. It must represent R function's output schema on the basis of Spark data types. The column names of the returned `data.frame` are set by user. Below data type mapping between R
+and Spark.
+
+#### Data type mapping between R and Spark
+<table class="table">
+<tr><th>R</th><th>Spark</th></tr>
+<tr>
+ <td>byte</td>
+ <td>byte</td>
+</tr>
+<tr>
+ <td>integer</td>
+ <td>integer</td>
+</tr>
+<tr>
+ <td>float</td>
+ <td>float</td>
+</tr>
+<tr>
+ <td>double</td>
+ <td>double</td>
+</tr>
+<tr>
+ <td>numeric</td>
+ <td>double</td>
+</tr>
+<tr>
+ <td>character</td>
+ <td>string</td>
+</tr>
+<tr>
+ <td>string</td>
+ <td>string</td>
+</tr>
+<tr>
+ <td>binary</td>
+ <td>binary</td>
+</tr>
+<tr>
+ <td>raw</td>
+ <td>binary</td>
+</tr>
+<tr>
+ <td>logical</td>
+ <td>boolean</td>
+</tr>
+<tr>
+ <td>timestamp</td>
+ <td>timestamp</td>
+</tr>
+<tr>
+ <td>date</td>
+ <td>date</td>
+</tr>
+<tr>
+ <td>array</td>
+ <td>array</td>
+</tr>
+<tr>
+ <td>list</td>
+ <td>array</td>
+</tr>
+<tr>
+ <td>map</td>
+ <td>map</td>
+</tr>
+<tr>
+ <td>env</td>
+ <td>map</td>
+</tr>
+<tr>
+ <td>struct</td>
--- End diff --
Sounds good. for the mapping between: 'POSIXct / POSIXlt' to 'timestamp' and 'Date' to 'date' do we need to update 'getSQLDataType' method ?
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala#L91
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...
Posted by shivaram <gi...@git.apache.org>.
Github user shivaram commented on a diff in the pull request:
https://github.com/apache/spark/pull/14090#discussion_r70923795
--- Diff: docs/sparkr.md ---
@@ -316,6 +314,139 @@ head(ldf, 3)
{% endhighlight %}
</div>
+#### Run a given function on a large dataset grouping by input column(s) and using `gapply` or `gapplyCollect`
+
+##### gapply
+Apply a function to each group of a `SparkDataFrame`. The function is to be applied to each group of the `SparkDataFrame` and should have only two parameters: grouping key and R `data.frame` corresponding to
+that key. The groups are chosen from `SparkDataFrame`s column(s).
+The output of function should be a `data.frame`. Schema specifies the row format of the resulting
+`SparkDataFrame`. It must represent R function's output schema on the basis of Spark data types. The column names of the returned `data.frame` are set by user. Below data type mapping between R
+and Spark.
+
+#### Data type mapping between R and Spark
+<table class="table">
+<tr><th>R</th><th>Spark</th></tr>
+<tr>
+ <td>byte</td>
+ <td>byte</td>
+</tr>
+<tr>
+ <td>integer</td>
+ <td>integer</td>
+</tr>
+<tr>
+ <td>float</td>
+ <td>float</td>
+</tr>
+<tr>
+ <td>double</td>
+ <td>double</td>
+</tr>
+<tr>
+ <td>numeric</td>
+ <td>double</td>
+</tr>
+<tr>
+ <td>character</td>
+ <td>string</td>
+</tr>
+<tr>
+ <td>string</td>
+ <td>string</td>
+</tr>
+<tr>
+ <td>binary</td>
+ <td>binary</td>
+</tr>
+<tr>
+ <td>raw</td>
+ <td>binary</td>
+</tr>
+<tr>
+ <td>logical</td>
+ <td>boolean</td>
+</tr>
+<tr>
+ <td>timestamp</td>
+ <td>timestamp</td>
+</tr>
+<tr>
+ <td>date</td>
+ <td>date</td>
+</tr>
+<tr>
+ <td>array</td>
+ <td>array</td>
+</tr>
+<tr>
+ <td>list</td>
+ <td>array</td>
+</tr>
+<tr>
+ <td>map</td>
+ <td>map</td>
+</tr>
+<tr>
+ <td>env</td>
+ <td>map</td>
+</tr>
+<tr>
+ <td>struct</td>
--- End diff --
Not really - as I mentioned the getSQLDatatype looks at the schema - the method which looks at the R objects is in https://github.com/apache/spark/blob/2e4075e2ece9574100c79558cab054485e25c2ee/R/pkg/R/serialize.R#L84
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #14090: [SPARK-16112][SparkR] Programming guide for gapply/gappl...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/14090
**[Test build #62369 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62369/consoleFull)** for PR 14090 at commit [`19e849f`](https://github.com/apache/spark/commit/19e849f066e970a755401f99bc8248b8258a11c4).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...
Posted by shivaram <gi...@git.apache.org>.
Github user shivaram commented on a diff in the pull request:
https://github.com/apache/spark/pull/14090#discussion_r71041809
--- Diff: docs/sparkr.md ---
@@ -316,6 +314,135 @@ head(ldf, 3)
{% endhighlight %}
</div>
+#### Run a given function on a large dataset grouping by input column(s) and using `gapply` or `gapplyCollect`
+
+##### gapply
+Apply a function to each group of a `SparkDataFrame`. The function is to be applied to each group of the `SparkDataFrame` and should have only two parameters: grouping key and R `data.frame` corresponding to
+that key. The groups are chosen from `SparkDataFrame`s column(s).
+The output of function should be a `data.frame`. Schema specifies the row format of the resulting
+`SparkDataFrame`. It must represent R function's output schema on the basis of Spark data types. The column names of the returned `data.frame` are set by user. Below data type mapping between R
+and Spark.
+
+#### Data type mapping between R and Spark
+<table class="table">
+<tr><th>R</th><th>Spark</th></tr>
+<tr>
+ <td>byte</td>
+ <td>byte</td>
+</tr>
+<tr>
+ <td>integer</td>
+ <td>integer</td>
+</tr>
+<tr>
+ <td>float</td>
+ <td>float</td>
+</tr>
+<tr>
+ <td>double</td>
+ <td>double</td>
+</tr>
+<tr>
+ <td>numeric</td>
+ <td>double</td>
+</tr>
+<tr>
+ <td>character</td>
+ <td>string</td>
+</tr>
+<tr>
+ <td>string</td>
+ <td>string</td>
+</tr>
+<tr>
+ <td>binary</td>
+ <td>binary</td>
+</tr>
+<tr>
+ <td>raw</td>
+ <td>binary</td>
+</tr>
+<tr>
+ <td>logical</td>
+ <td>boolean</td>
+</tr>
+<tr>
+ <td>[POSIXct](https://stat.ethz.ch/R-manual/R-devel/library/base/html/DateTimeClasses.html)</td>
--- End diff --
Also not sure why - but the URL formatting doesnt seem to be working here. Screenshot of what i see is below
![screenshot 2016-07-15 14 13 56](https://cloud.githubusercontent.com/assets/143893/16888670/61fede2a-4a96-11e6-8b7f-507f3eb194d4.png)
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #14090: [SPARK-16112][SparkR] Programming guide for gapply/gappl...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/14090
**[Test build #62300 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62300/consoleFull)** for PR 14090 at commit [`5d34943`](https://github.com/apache/spark/commit/5d3494337ed2dfc5592b11e324aa7ef52a6f354e).
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #14090: [SPARK-16112][SparkR] Programming guide for gapply/gappl...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/14090
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62411/
Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...
Posted by NarineK <gi...@git.apache.org>.
Github user NarineK commented on a diff in the pull request:
https://github.com/apache/spark/pull/14090#discussion_r70920518
--- Diff: docs/sparkr.md ---
@@ -316,6 +314,139 @@ head(ldf, 3)
{% endhighlight %}
</div>
+#### Run a given function on a large dataset grouping by input column(s) and using `gapply` or `gapplyCollect`
+
+##### gapply
+Apply a function to each group of a `SparkDataFrame`. The function is to be applied to each group of the `SparkDataFrame` and should have only two parameters: grouping key and R `data.frame` corresponding to
+that key. The groups are chosen from `SparkDataFrame`s column(s).
+The output of function should be a `data.frame`. Schema specifies the row format of the resulting
+`SparkDataFrame`. It must represent R function's output schema on the basis of Spark data types. The column names of the returned `data.frame` are set by user. Below data type mapping between R
+and Spark.
+
+#### Data type mapping between R and Spark
+<table class="table">
+<tr><th>R</th><th>Spark</th></tr>
+<tr>
+ <td>byte</td>
+ <td>byte</td>
+</tr>
+<tr>
+ <td>integer</td>
+ <td>integer</td>
+</tr>
+<tr>
+ <td>float</td>
+ <td>float</td>
+</tr>
+<tr>
+ <td>double</td>
+ <td>double</td>
+</tr>
+<tr>
+ <td>numeric</td>
+ <td>double</td>
+</tr>
+<tr>
+ <td>character</td>
+ <td>string</td>
+</tr>
+<tr>
+ <td>string</td>
+ <td>string</td>
+</tr>
+<tr>
+ <td>binary</td>
+ <td>binary</td>
+</tr>
+<tr>
+ <td>raw</td>
+ <td>binary</td>
+</tr>
+<tr>
+ <td>logical</td>
+ <td>boolean</td>
+</tr>
+<tr>
+ <td>timestamp</td>
+ <td>timestamp</td>
+</tr>
+<tr>
+ <td>date</td>
+ <td>date</td>
+</tr>
+<tr>
+ <td>array</td>
+ <td>array</td>
+</tr>
+<tr>
+ <td>list</td>
+ <td>array</td>
+</tr>
+<tr>
+ <td>map</td>
+ <td>map</td>
+</tr>
+<tr>
+ <td>env</td>
+ <td>map</td>
+</tr>
+<tr>
+ <td>struct</td>
--- End diff --
@shivaram, I've looked at the following list:
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala#L92
It is being called for creating schema's field and it has map, struct, timestamp, etc ...
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #14090: [SPARK-16112][SparkR] Programming guide for gapply/gappl...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/14090
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62300/
Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...
Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on a diff in the pull request:
https://github.com/apache/spark/pull/14090#discussion_r70926341
--- Diff: docs/sparkr.md ---
@@ -316,6 +314,139 @@ head(ldf, 3)
{% endhighlight %}
</div>
+#### Run a given function on a large dataset grouping by input column(s) and using `gapply` or `gapplyCollect`
+
+##### gapply
+Apply a function to each group of a `SparkDataFrame`. The function is to be applied to each group of the `SparkDataFrame` and should have only two parameters: grouping key and R `data.frame` corresponding to
+that key. The groups are chosen from `SparkDataFrame`s column(s).
+The output of function should be a `data.frame`. Schema specifies the row format of the resulting
+`SparkDataFrame`. It must represent R function's output schema on the basis of Spark data types. The column names of the returned `data.frame` are set by user. Below data type mapping between R
+and Spark.
+
+#### Data type mapping between R and Spark
+<table class="table">
+<tr><th>R</th><th>Spark</th></tr>
+<tr>
+ <td>byte</td>
+ <td>byte</td>
+</tr>
+<tr>
+ <td>integer</td>
+ <td>integer</td>
+</tr>
+<tr>
+ <td>float</td>
+ <td>float</td>
+</tr>
+<tr>
+ <td>double</td>
+ <td>double</td>
+</tr>
+<tr>
+ <td>numeric</td>
+ <td>double</td>
+</tr>
+<tr>
+ <td>character</td>
+ <td>string</td>
+</tr>
+<tr>
+ <td>string</td>
+ <td>string</td>
+</tr>
+<tr>
+ <td>binary</td>
+ <td>binary</td>
+</tr>
+<tr>
+ <td>raw</td>
+ <td>binary</td>
+</tr>
+<tr>
+ <td>logical</td>
+ <td>boolean</td>
+</tr>
+<tr>
+ <td>timestamp</td>
+ <td>timestamp</td>
+</tr>
+<tr>
+ <td>date</td>
+ <td>date</td>
+</tr>
+<tr>
+ <td>array</td>
+ <td>array</td>
+</tr>
+<tr>
+ <td>list</td>
+ <td>array</td>
+</tr>
+<tr>
+ <td>map</td>
+ <td>map</td>
+</tr>
+<tr>
+ <td>env</td>
+ <td>map</td>
+</tr>
+<tr>
+ <td>struct</td>
--- End diff --
yes it should be `Date` not `date`
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...
Posted by NarineK <gi...@git.apache.org>.
Github user NarineK commented on a diff in the pull request:
https://github.com/apache/spark/pull/14090#discussion_r70921996
--- Diff: docs/sparkr.md ---
@@ -316,6 +314,139 @@ head(ldf, 3)
{% endhighlight %}
</div>
+#### Run a given function on a large dataset grouping by input column(s) and using `gapply` or `gapplyCollect`
+
+##### gapply
+Apply a function to each group of a `SparkDataFrame`. The function is to be applied to each group of the `SparkDataFrame` and should have only two parameters: grouping key and R `data.frame` corresponding to
+that key. The groups are chosen from `SparkDataFrame`s column(s).
+The output of function should be a `data.frame`. Schema specifies the row format of the resulting
+`SparkDataFrame`. It must represent R function's output schema on the basis of Spark data types. The column names of the returned `data.frame` are set by user. Below data type mapping between R
+and Spark.
+
+#### Data type mapping between R and Spark
+<table class="table">
+<tr><th>R</th><th>Spark</th></tr>
+<tr>
+ <td>byte</td>
+ <td>byte</td>
+</tr>
+<tr>
+ <td>integer</td>
+ <td>integer</td>
+</tr>
+<tr>
+ <td>float</td>
+ <td>float</td>
+</tr>
+<tr>
+ <td>double</td>
+ <td>double</td>
+</tr>
+<tr>
+ <td>numeric</td>
+ <td>double</td>
+</tr>
+<tr>
+ <td>character</td>
+ <td>string</td>
+</tr>
+<tr>
+ <td>string</td>
+ <td>string</td>
+</tr>
+<tr>
+ <td>binary</td>
+ <td>binary</td>
+</tr>
+<tr>
+ <td>raw</td>
+ <td>binary</td>
+</tr>
+<tr>
+ <td>logical</td>
+ <td>boolean</td>
+</tr>
+<tr>
+ <td>timestamp</td>
+ <td>timestamp</td>
+</tr>
+<tr>
+ <td>date</td>
+ <td>date</td>
+</tr>
+<tr>
+ <td>array</td>
+ <td>array</td>
+</tr>
+<tr>
+ <td>list</td>
+ <td>array</td>
+</tr>
+<tr>
+ <td>map</td>
+ <td>map</td>
+</tr>
+<tr>
+ <td>env</td>
+ <td>map</td>
+</tr>
+<tr>
+ <td>struct</td>
--- End diff --
Thanks for the explanation, @shivaram !
So, I'll remove map, struct and timestamp and leave the rest as is.
Does it sound fine ?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #14090: [SPARK-16112][SparkR] Programming guide for gapply/gappl...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/14090
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62369/
Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #14090: [SPARK-16112][SparkR] Programming guide for gapply/gappl...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/14090
**[Test build #62369 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62369/consoleFull)** for PR 14090 at commit [`19e849f`](https://github.com/apache/spark/commit/19e849f066e970a755401f99bc8248b8258a11c4).
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #14090: [SPARK-16112][SparkR] Programming guide for gapply/gappl...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/14090
**[Test build #62147 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62147/consoleFull)** for PR 14090 at commit [`2af7243`](https://github.com/apache/spark/commit/2af724321e0d51aed64c84dd22741a7cc6067caf).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...
Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on a diff in the pull request:
https://github.com/apache/spark/pull/14090#discussion_r70926563
--- Diff: docs/sparkr.md ---
@@ -316,6 +314,139 @@ head(ldf, 3)
{% endhighlight %}
</div>
+#### Run a given function on a large dataset grouping by input column(s) and using `gapply` or `gapplyCollect`
+
+##### gapply
+Apply a function to each group of a `SparkDataFrame`. The function is to be applied to each group of the `SparkDataFrame` and should have only two parameters: grouping key and R `data.frame` corresponding to
+that key. The groups are chosen from `SparkDataFrame`s column(s).
+The output of function should be a `data.frame`. Schema specifies the row format of the resulting
+`SparkDataFrame`. It must represent R function's output schema on the basis of Spark data types. The column names of the returned `data.frame` are set by user. Below data type mapping between R
+and Spark.
+
+#### Data type mapping between R and Spark
+<table class="table">
+<tr><th>R</th><th>Spark</th></tr>
+<tr>
+ <td>byte</td>
+ <td>byte</td>
+</tr>
+<tr>
+ <td>integer</td>
+ <td>integer</td>
+</tr>
+<tr>
+ <td>float</td>
+ <td>float</td>
+</tr>
+<tr>
+ <td>double</td>
+ <td>double</td>
+</tr>
+<tr>
+ <td>numeric</td>
+ <td>double</td>
+</tr>
+<tr>
+ <td>character</td>
+ <td>string</td>
+</tr>
+<tr>
+ <td>string</td>
+ <td>string</td>
+</tr>
+<tr>
+ <td>binary</td>
+ <td>binary</td>
+</tr>
+<tr>
+ <td>raw</td>
+ <td>binary</td>
+</tr>
+<tr>
+ <td>logical</td>
+ <td>boolean</td>
+</tr>
+<tr>
+ <td>timestamp</td>
+ <td>timestamp</td>
+</tr>
+<tr>
+ <td>date</td>
+ <td>date</td>
+</tr>
+<tr>
+ <td>array</td>
+ <td>array</td>
+</tr>
+<tr>
+ <td>list</td>
+ <td>array</td>
+</tr>
+<tr>
+ <td>map</td>
+ <td>map</td>
+</tr>
+<tr>
+ <td>env</td>
+ <td>map</td>
+</tr>
+<tr>
+ <td>struct</td>
--- End diff --
And `environment` instead of `env`?
https://stat.ethz.ch/R-manual/R-devel/library/base/html/environment.html
```
> e <- new.env()
> class(e)
[1] "environment"
```
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...
Posted by shivaram <gi...@git.apache.org>.
Github user shivaram commented on a diff in the pull request:
https://github.com/apache/spark/pull/14090#discussion_r70202064
--- Diff: docs/sparkr.md ---
@@ -306,6 +306,64 @@ head(ldf, 3)
{% endhighlight %}
</div>
+#### Run a given function on a large dataset grouping by input column(s) and using `gapply` or `gapplyCollect`
+
+##### gapply
+Apply a function to each group of a `SparkDataFrame`. The function is to be applied to each group of the `SparkDataFrame` and should have only two parameters: grouping key and R `data.frame` corresponding to
+that key. The groups are chosen from `SparkDataFrame`s column(s).
+The output of function should be a `data.frame`. Schema specifies the row format of the resulting
+`SparkDataFrame`. It must match the R function's output.
--- End diff --
Yeah but instead of a pointer to the code it would be great if we could have a table in the documentation.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #14090: [SPARK-16112][SparkR] Programming guide for gapply/gappl...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/14090
**[Test build #62299 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62299/consoleFull)** for PR 14090 at commit [`8a2aff3`](https://github.com/apache/spark/commit/8a2aff3add082e20c45136dc5814e6ccdf4b256c).
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #14090: [SPARK-16112][SparkR] Programming guide for gapply/gappl...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/14090
**[Test build #62145 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62145/consoleFull)** for PR 14090 at commit [`c1d7151`](https://github.com/apache/spark/commit/c1d71512a3bf0205615d1b6318029ad6f33d94dc).
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org