You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by NarineK <gi...@git.apache.org> on 2016/07/07 13:31:24 UTC

[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...

GitHub user NarineK opened a pull request:

    https://github.com/apache/spark/pull/14090

    [SPARK-16112][SparkR] Programming guide for gapply/gapplyCollect

    ## What changes were proposed in this pull request?
    
    Updates programming guide for spark.gapply/spark.gapplyCollect.
    
    Similar to other examples I used faithful dataset to demonstrate gapply's functionality.
    Please, let me know if you prefer another example.
    
    ## How was this patch tested?
    Existing test cases in R

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/NarineK/spark gapplyProgGuide

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/14090.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #14090
    
----
commit 29d8a5c6c22202cdf7d6cc44f1d6cbeca5946918
Author: Narine Kokhlikyan <na...@slice.com>
Date:   2016-06-20T22:12:11Z

    Fixed duplicated documentation problem + separated documentation for dapply and dapplyCollect

commit 698c4331d2a8bfe7f4b372ebc8123b6c27a57e68
Author: Narine Kokhlikyan <na...@slice.com>
Date:   2016-06-23T18:51:48Z

    merge with master

commit 85a4493a03b3601a93c25ebc1eafb2868efec8d8
Author: Narine Kokhlikyan <na...@slice.com>
Date:   2016-07-07T13:18:49Z

    Adding programming guide for gapply/gapplyCollect

commit 7781d1c111f38e3608d5ebd468e6d344d52efa5c
Author: Narine Kokhlikyan <na...@slice.com>
Date:   2016-07-07T13:27:35Z

    removing output format

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14090: [SPARK-16112][SparkR] Programming guide for gapply/gappl...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14090
  
    **[Test build #62147 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62147/consoleFull)** for PR 14090 at commit [`2af7243`](https://github.com/apache/spark/commit/2af724321e0d51aed64c84dd22741a7cc6067caf).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14090: [SPARK-16112][SparkR] Programming guide for gapply/gappl...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14090
  
    Build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...

Posted by NarineK <gi...@git.apache.org>.

Github user NarineK commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14090#discussion_r70202321
  
    --- Diff: docs/sparkr.md ---
    @@ -306,6 +306,64 @@ head(ldf, 3)
     {% endhighlight %}
     </div>
     
    +#### Run a given function on a large dataset grouping by input column(s) and using `gapply` or `gapplyCollect`
    +
    +##### gapply
    +Apply a function to each group of a `SparkDataFrame`. The function is to be applied to each group of the `SparkDataFrame` and should have only two parameters: grouping key and R `data.frame` corresponding to
    +that key. The groups are chosen from `SparkDataFrame`s column(s).
    +The output of function should be a `data.frame`. Schema specifies the row format of the resulting
    +`SparkDataFrame`. It must match the R function's output.
    --- End diff --
    
    Thanks @shivaram.
    Does the following mapping looks fine to have in the table ?
    ```
    **R               Spark**
    byte              byte
    integer          integer
    float              float
    double          double
    numeric        double
    character      string
    string            string
    binary           binary
    raw               binary
    logical           boolean
    timestamp    timestamp
    date              date
    array             array
    map              map
    struct            struct
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14090: [SPARK-16112][SparkR] Programming guide for gapply/gappl...

Posted by shivaram <gi...@git.apache.org>.

Github user shivaram commented on the issue:

    https://github.com/apache/spark/pull/14090
  
    Merging this to master, branch-2.0


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...

Posted by shivaram <gi...@git.apache.org>.

Github user shivaram commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14090#discussion_r70922863
  
    --- Diff: docs/sparkr.md ---
    @@ -316,6 +314,139 @@ head(ldf, 3)
     {% endhighlight %}
     </div>
     
    +#### Run a given function on a large dataset grouping by input column(s) and using `gapply` or `gapplyCollect`
    +
    +##### gapply
    +Apply a function to each group of a `SparkDataFrame`. The function is to be applied to each group of the `SparkDataFrame` and should have only two parameters: grouping key and R `data.frame` corresponding to
    +that key. The groups are chosen from `SparkDataFrame`s column(s).
    +The output of function should be a `data.frame`. Schema specifies the row format of the resulting
    +`SparkDataFrame`. It must represent R function's output schema on the basis of Spark data types. The column names of the returned `data.frame` are set by user. Below data type mapping between R
    +and Spark.
    +
    +#### Data type mapping between R and Spark
    +<table class="table">
    +<tr><th>R</th><th>Spark</th></tr>
    +<tr>
    +  <td>byte</td>
    +  <td>byte</td>
    +</tr>
    +<tr>
    +  <td>integer</td>
    +  <td>integer</td>
    +</tr>
    +<tr>
    +  <td>float</td>
    +  <td>float</td>
    +</tr>
    +<tr>
    +  <td>double</td>
    +  <td>double</td>
    +</tr>
    +<tr>
    +  <td>numeric</td>
    +  <td>double</td>
    +</tr>
    +<tr>
    +  <td>character</td>
    +  <td>string</td>
    +</tr>
    +<tr>
    +  <td>string</td>
    +  <td>string</td>
    +</tr>
    +<tr>
    +  <td>binary</td>
    +  <td>binary</td>
    +</tr>
    +<tr>
    +  <td>raw</td>
    +  <td>binary</td>
    +</tr>
    +<tr>
    +  <td>logical</td>
    +  <td>boolean</td>
    +</tr>
    +<tr>
    +  <td>timestamp</td>
    +  <td>timestamp</td>
    +</tr>
    +<tr>
    +  <td>date</td>
    +  <td>date</td>
    +</tr>
    +<tr>
    +  <td>array</td>
    +  <td>array</td>
    +</tr>
    +<tr>
    +  <td>list</td>
    +  <td>array</td>
    +</tr>
    +<tr>
    +  <td>map</td>
    +  <td>map</td>
    +</tr>
    +<tr>
    +  <td>env</td>
    +  <td>map</td>
    +</tr>
    +<tr>
    +  <td>struct</td>
    --- End diff --
    
    And as you mentioned above we can also change `date` to `Date` to be more specific. (It would be ideal now that I think to link these R types to the CRAN help page. For example we can link to https://stat.ethz.ch/R-manual/R-devel/library/base/html/Dates.html for Date and https://stat.ethz.ch/R-manual/R-devel/library/base/html/DateTimeClasses.html for `POSIXct / POSIXlt`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14090: [SPARK-16112][SparkR] Programming guide for gapply/gappl...

Posted by shivaram <gi...@git.apache.org>.

Github user shivaram commented on the issue:

    https://github.com/apache/spark/pull/14090
  
    Thanks @NarineK - I tried it on a fresh Ubuntu VM and it rendered fine. I think it has something to do with ruby / jekyll versions. The rendered docs looked fine on the Ubuntu VM
    
    LGTM. @felixcheung Could you also take one final look ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14090#discussion_r70711116
  
    --- Diff: docs/sparkr.md ---
    @@ -263,7 +263,7 @@ In SparkR, we support several kinds of User-Defined Functions:
     ##### dapply
     Apply a function to each partition of a `SparkDataFrame`. The function to be applied to each partition of the `SparkDataFrame`
     and should have only one parameter, to which a `data.frame` corresponds to each partition will be passed. The output of function
    -should be a `data.frame`. Schema specifies the row format of the resulting a `SparkDataFrame`. It must match the R function's output.
    +should be a `data.frame`. Schema specifies the row format of the resulting a `SparkDataFrame`. It must match to [data types of R function's output fields](#data-type-mapping-between-r-and-spark).
    --- End diff --
    
    `output fields` --> `return values` or `return value`?
    http://adv-r.had.co.nz/Functions.html#return-values


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...

Posted by NarineK <gi...@git.apache.org>.

Github user NarineK commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14090#discussion_r70168781
  
    --- Diff: docs/sparkr.md ---
    @@ -306,6 +306,64 @@ head(ldf, 3)
     {% endhighlight %}
     </div>
     
    +#### Run a given function on a large dataset grouping by input column(s) and using `gapply` or `gapplyCollect`
    +
    +##### gapply
    +Apply a function to each group of a `SparkDataFrame`. The function is to be applied to each group of the `SparkDataFrame` and should have only two parameters: grouping key and R `data.frame` corresponding to
    +that key. The groups are chosen from `SparkDataFrame`s column(s).
    +The output of function should be a `data.frame`. Schema specifies the row format of the resulting
    +`SparkDataFrame`. It must match the R function's output.
    --- End diff --
    
    Thanks @felixcheung, Does this sound better ?
    "It must reflect R function's output schema on the basis of Spark data types. The column names of each output field in the schema are set by user." I could also bring up some examples.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...

Posted by shivaram <gi...@git.apache.org>.

Github user shivaram commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14090#discussion_r70920785
  
    --- Diff: docs/sparkr.md ---
    @@ -316,6 +314,139 @@ head(ldf, 3)
     {% endhighlight %}
     </div>
     
    +#### Run a given function on a large dataset grouping by input column(s) and using `gapply` or `gapplyCollect`
    +
    +##### gapply
    +Apply a function to each group of a `SparkDataFrame`. The function is to be applied to each group of the `SparkDataFrame` and should have only two parameters: grouping key and R `data.frame` corresponding to
    +that key. The groups are chosen from `SparkDataFrame`s column(s).
    +The output of function should be a `data.frame`. Schema specifies the row format of the resulting
    +`SparkDataFrame`. It must represent R function's output schema on the basis of Spark data types. The column names of the returned `data.frame` are set by user. Below data type mapping between R
    +and Spark.
    +
    +#### Data type mapping between R and Spark
    +<table class="table">
    +<tr><th>R</th><th>Spark</th></tr>
    +<tr>
    +  <td>byte</td>
    +  <td>byte</td>
    +</tr>
    +<tr>
    +  <td>integer</td>
    +  <td>integer</td>
    +</tr>
    +<tr>
    +  <td>float</td>
    +  <td>float</td>
    +</tr>
    +<tr>
    +  <td>double</td>
    +  <td>double</td>
    +</tr>
    +<tr>
    +  <td>numeric</td>
    +  <td>double</td>
    +</tr>
    +<tr>
    +  <td>character</td>
    +  <td>string</td>
    +</tr>
    +<tr>
    +  <td>string</td>
    +  <td>string</td>
    +</tr>
    +<tr>
    +  <td>binary</td>
    +  <td>binary</td>
    +</tr>
    +<tr>
    +  <td>raw</td>
    +  <td>binary</td>
    +</tr>
    +<tr>
    +  <td>logical</td>
    +  <td>boolean</td>
    +</tr>
    +<tr>
    +  <td>timestamp</td>
    +  <td>timestamp</td>
    +</tr>
    +<tr>
    +  <td>date</td>
    +  <td>date</td>
    +</tr>
    +<tr>
    +  <td>array</td>
    +  <td>array</td>
    +</tr>
    +<tr>
    +  <td>list</td>
    +  <td>array</td>
    +</tr>
    +<tr>
    +  <td>map</td>
    +  <td>map</td>
    +</tr>
    +<tr>
    +  <td>env</td>
    +  <td>map</td>
    +</tr>
    +<tr>
    +  <td>struct</td>
    --- End diff --
    
    Thats a good point - So users can create a schema with `struct` and that is mapping to a corresponding SQL type. But they can't create any R objects that will be parsed as `struct`. The main reason our schema is more flexible than our serialization / deserialization support is that the schema can be used to say read JSON files or JDBC tables etc.
    
    For the use case here, where users are returning a `data.frame` from UDF I dont think there is any valid mapping for `struct` from R. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14090: [SPARK-16112][SparkR] Programming guide for gapply/gappl...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14090
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...

Posted by shivaram <gi...@git.apache.org>.

Github user shivaram commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14090#discussion_r70346974
  
    --- Diff: docs/sparkr.md ---
    @@ -306,6 +306,64 @@ head(ldf, 3)
     {% endhighlight %}
     </div>
     
    +#### Run a given function on a large dataset grouping by input column(s) and using `gapply` or `gapplyCollect`
    +
    +##### gapply
    +Apply a function to each group of a `SparkDataFrame`. The function is to be applied to each group of the `SparkDataFrame` and should have only two parameters: grouping key and R `data.frame` corresponding to
    +that key. The groups are chosen from `SparkDataFrame`s column(s).
    +The output of function should be a `data.frame`. Schema specifies the row format of the resulting
    +`SparkDataFrame`. It must match the R function's output.
    --- End diff --
    
    I think those mappings are only used to print things in `str`. A better list to consult would be the list at https://github.com/apache/spark/blob/master/R/pkg/R/serialize.R#L23 -- As that says `list` in R should become a `array` in SparkSQL and `env` in R should map to a `map`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14090#discussion_r70905195
  
    --- Diff: docs/sparkr.md ---
    @@ -316,6 +314,139 @@ head(ldf, 3)
     {% endhighlight %}
     </div>
     
    +#### Run a given function on a large dataset grouping by input column(s) and using `gapply` or `gapplyCollect`
    +
    +##### gapply
    +Apply a function to each group of a `SparkDataFrame`. The function is to be applied to each group of the `SparkDataFrame` and should have only two parameters: grouping key and R `data.frame` corresponding to
    +that key. The groups are chosen from `SparkDataFrame`s column(s).
    +The output of function should be a `data.frame`. Schema specifies the row format of the resulting
    +`SparkDataFrame`. It must represent R function's output schema on the basis of Spark data types. The column names of the returned `data.frame` are set by user. Below data type mapping between R
    +and Spark.
    +
    +#### Data type mapping between R and Spark
    +<table class="table">
    +<tr><th>R</th><th>Spark</th></tr>
    +<tr>
    +  <td>byte</td>
    +  <td>byte</td>
    +</tr>
    +<tr>
    +  <td>integer</td>
    +  <td>integer</td>
    +</tr>
    +<tr>
    +  <td>float</td>
    +  <td>float</td>
    +</tr>
    +<tr>
    +  <td>double</td>
    +  <td>double</td>
    +</tr>
    +<tr>
    +  <td>numeric</td>
    +  <td>double</td>
    +</tr>
    +<tr>
    +  <td>character</td>
    +  <td>string</td>
    +</tr>
    +<tr>
    +  <td>string</td>
    +  <td>string</td>
    +</tr>
    +<tr>
    +  <td>binary</td>
    +  <td>binary</td>
    +</tr>
    +<tr>
    +  <td>raw</td>
    +  <td>binary</td>
    +</tr>
    +<tr>
    +  <td>logical</td>
    +  <td>boolean</td>
    +</tr>
    +<tr>
    +  <td>timestamp</td>
    +  <td>timestamp</td>
    +</tr>
    +<tr>
    +  <td>date</td>
    +  <td>date</td>
    +</tr>
    +<tr>
    +  <td>array</td>
    +  <td>array</td>
    +</tr>
    +<tr>
    +  <td>list</td>
    +  <td>array</td>
    +</tr>
    +<tr>
    +  <td>map</td>
    +  <td>map</td>
    +</tr>
    +<tr>
    +  <td>env</td>
    +  <td>map</td>
    +</tr>
    +<tr>
    +  <td>struct</td>
    --- End diff --
    
    I don't think `date` is a type either.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...

Posted by NarineK <gi...@git.apache.org>.

Github user NarineK commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14090#discussion_r70920244
  
    --- Diff: docs/sparkr.md ---
    @@ -316,6 +314,139 @@ head(ldf, 3)
     {% endhighlight %}
     </div>
     
    +#### Run a given function on a large dataset grouping by input column(s) and using `gapply` or `gapplyCollect`
    +
    +##### gapply
    +Apply a function to each group of a `SparkDataFrame`. The function is to be applied to each group of the `SparkDataFrame` and should have only two parameters: grouping key and R `data.frame` corresponding to
    +that key. The groups are chosen from `SparkDataFrame`s column(s).
    +The output of function should be a `data.frame`. Schema specifies the row format of the resulting
    +`SparkDataFrame`. It must represent R function's output schema on the basis of Spark data types. The column names of the returned `data.frame` are set by user. Below data type mapping between R
    +and Spark.
    +
    +#### Data type mapping between R and Spark
    +<table class="table">
    +<tr><th>R</th><th>Spark</th></tr>
    +<tr>
    +  <td>byte</td>
    +  <td>byte</td>
    +</tr>
    +<tr>
    +  <td>integer</td>
    +  <td>integer</td>
    +</tr>
    +<tr>
    +  <td>float</td>
    +  <td>float</td>
    +</tr>
    +<tr>
    +  <td>double</td>
    +  <td>double</td>
    +</tr>
    +<tr>
    +  <td>numeric</td>
    +  <td>double</td>
    +</tr>
    +<tr>
    +  <td>character</td>
    +  <td>string</td>
    +</tr>
    +<tr>
    +  <td>string</td>
    +  <td>string</td>
    +</tr>
    +<tr>
    +  <td>binary</td>
    +  <td>binary</td>
    +</tr>
    +<tr>
    +  <td>raw</td>
    +  <td>binary</td>
    +</tr>
    +<tr>
    +  <td>logical</td>
    +  <td>boolean</td>
    +</tr>
    +<tr>
    +  <td>timestamp</td>
    +  <td>timestamp</td>
    +</tr>
    +<tr>
    +  <td>date</td>
    +  <td>date</td>
    +</tr>
    +<tr>
    +  <td>array</td>
    +  <td>array</td>
    +</tr>
    +<tr>
    +  <td>list</td>
    +  <td>array</td>
    +</tr>
    +<tr>
    +  <td>map</td>
    +  <td>map</td>
    +</tr>
    +<tr>
    +  <td>env</td>
    +  <td>map</td>
    +</tr>
    +<tr>
    +  <td>struct</td>
    --- End diff --
    
    @felixcheung, I think according to the following mapping we expect 'date':
    https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala#L91
    And it seems that there is a 'Date' in base. Do I understand correct ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14090: [SPARK-16112][SparkR] Programming guide for gapply/gappl...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14090
  
    **[Test build #62411 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62411/consoleFull)** for PR 14090 at commit [`f584416`](https://github.com/apache/spark/commit/f584416b81bc19d951d28eb2861cc3a4a16bc117).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14090: [SPARK-16112][SparkR] Programming guide for gapply/gappl...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14090
  
    **[Test build #62300 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62300/consoleFull)** for PR 14090 at commit [`5d34943`](https://github.com/apache/spark/commit/5d3494337ed2dfc5592b11e324aa7ef52a6f354e).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/14090


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14090: [SPARK-16112][SparkR] Programming guide for gapply/gappl...

Posted by NarineK <gi...@git.apache.org>.

Github user NarineK commented on the issue:

    https://github.com/apache/spark/pull/14090
  
    Thanks @shivaram, @felixcheung for the comments. I'll address those today.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14090: [SPARK-16112][SparkR] Programming guide for gapply/gappl...

Posted by shivaram <gi...@git.apache.org>.

Github user shivaram commented on the issue:

    https://github.com/apache/spark/pull/14090
  
    @felixcheung Could you take one more look at this ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14090: [SPARK-16112][SparkR] Programming guide for gapply/gappl...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14090
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...

Posted by shivaram <gi...@git.apache.org>.

Github user shivaram commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14090#discussion_r70846132
  
    --- Diff: docs/sparkr.md ---
    @@ -316,6 +314,139 @@ head(ldf, 3)
     {% endhighlight %}
     </div>
     
    +#### Run a given function on a large dataset grouping by input column(s) and using `gapply` or `gapplyCollect`
    +
    +##### gapply
    +Apply a function to each group of a `SparkDataFrame`. The function is to be applied to each group of the `SparkDataFrame` and should have only two parameters: grouping key and R `data.frame` corresponding to
    +that key. The groups are chosen from `SparkDataFrame`s column(s).
    +The output of function should be a `data.frame`. Schema specifies the row format of the resulting
    +`SparkDataFrame`. It must represent R function's output schema on the basis of Spark data types. The column names of the returned `data.frame` are set by user. Below data type mapping between R
    +and Spark.
    +
    +#### Data type mapping between R and Spark
    +<table class="table">
    +<tr><th>R</th><th>Spark</th></tr>
    +<tr>
    +  <td>byte</td>
    +  <td>byte</td>
    +</tr>
    +<tr>
    +  <td>integer</td>
    +  <td>integer</td>
    +</tr>
    +<tr>
    +  <td>float</td>
    +  <td>float</td>
    +</tr>
    +<tr>
    +  <td>double</td>
    +  <td>double</td>
    +</tr>
    +<tr>
    +  <td>numeric</td>
    +  <td>double</td>
    +</tr>
    +<tr>
    +  <td>character</td>
    +  <td>string</td>
    +</tr>
    +<tr>
    +  <td>string</td>
    +  <td>string</td>
    +</tr>
    +<tr>
    +  <td>binary</td>
    +  <td>binary</td>
    +</tr>
    +<tr>
    +  <td>raw</td>
    +  <td>binary</td>
    +</tr>
    +<tr>
    +  <td>logical</td>
    +  <td>boolean</td>
    +</tr>
    +<tr>
    +  <td>timestamp</td>
    +  <td>timestamp</td>
    +</tr>
    +<tr>
    +  <td>date</td>
    +  <td>date</td>
    +</tr>
    +<tr>
    +  <td>array</td>
    +  <td>array</td>
    +</tr>
    +<tr>
    +  <td>list</td>
    +  <td>array</td>
    +</tr>
    +<tr>
    +  <td>map</td>
    +  <td>map</td>
    +</tr>
    +<tr>
    +  <td>env</td>
    +  <td>map</td>
    +</tr>
    +<tr>
    +  <td>struct</td>
    --- End diff --
    
    I dont think R has any notion of a `struct` or `map` data type ? Looking at the list of R data structures at http://adv-r.had.co.nz/Data-structures.html I think we should remove the struct -> struct and map -> map entries. Also I dont think there is a `timestamp` class in R. We should probably replace that with `POSIXct` or `POSIXlt`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14090: [SPARK-16112][SparkR] Programming guide for gapply/gappl...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14090
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62147/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14090: [SPARK-16112][SparkR] Programming guide for gapply/gappl...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14090
  
    **[Test build #62299 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62299/consoleFull)** for PR 14090 at commit [`8a2aff3`](https://github.com/apache/spark/commit/8a2aff3add082e20c45136dc5814e6ccdf4b256c).
     * This patch passes all tests.
     * This patch **does not merge cleanly**.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14090: [SPARK-16112][SparkR] Programming guide for gapply/gappl...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14090
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14090: [SPARK-16112][SparkR] Programming guide for gapply/gappl...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14090
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14090#discussion_r70172206
  
    --- Diff: docs/sparkr.md ---
    @@ -306,6 +306,64 @@ head(ldf, 3)
     {% endhighlight %}
     </div>
     
    +#### Run a given function on a large dataset grouping by input column(s) and using `gapply` or `gapplyCollect`
    +
    +##### gapply
    +Apply a function to each group of a `SparkDataFrame`. The function is to be applied to each group of the `SparkDataFrame` and should have only two parameters: grouping key and R `data.frame` corresponding to
    +that key. The groups are chosen from `SparkDataFrame`s column(s).
    +The output of function should be a `data.frame`. Schema specifies the row format of the resulting
    +`SparkDataFrame`. It must match the R function's output.
    --- End diff --
    
    I think gapply and dapply are the first important use cases where we require strict mapping Spark JVM types to R atomic types. It might be worthwhile to add a section in the programming guide to illustrate and explain that further.
    
    To be more concrete, what should be the column type of the UDF output R data.frame if the SparkDataFrame has a column of double? It would be good to have a table on that.
    
    That could be a separate PR though.
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14090: [SPARK-16112][SparkR] Programming guide for gapply/gappl...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14090
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61911/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14090#discussion_r70711263
  
    --- Diff: docs/sparkr.md ---
    @@ -312,7 +310,82 @@ head(ldf, 3)
     Apply a function to each group of a `SparkDataFrame`. The function is to be applied to each group of the `SparkDataFrame` and should have only two parameters: grouping key and R `data.frame` corresponding to
     that key. The groups are chosen from `SparkDataFrame`s column(s).
     The output of function should be a `data.frame`. Schema specifies the row format of the resulting
    -`SparkDataFrame`. It must match the R function's output.
    +`SparkDataFrame`. It must represent R function's output schema on the basis of Spark data types. The column names of each output field in the schema are set by user. Bellow data type mapping between R
    --- End diff --
    
    same, `output field` here


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...

Posted by NarineK <gi...@git.apache.org>.

Github user NarineK commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14090#discussion_r70198331
  
    --- Diff: docs/sparkr.md ---
    @@ -306,6 +306,64 @@ head(ldf, 3)
     {% endhighlight %}
     </div>
     
    +#### Run a given function on a large dataset grouping by input column(s) and using `gapply` or `gapplyCollect`
    +
    +##### gapply
    +Apply a function to each group of a `SparkDataFrame`. The function is to be applied to each group of the `SparkDataFrame` and should have only two parameters: grouping key and R `data.frame` corresponding to
    +that key. The groups are chosen from `SparkDataFrame`s column(s).
    +The output of function should be a `data.frame`. Schema specifies the row format of the resulting
    +`SparkDataFrame`. It must match the R function's output.
    --- End diff --
    
    or we could probably refer also to this ?
    https://github.com/apache/spark/blob/master/R/pkg/R/types.R#L21


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14090: [SPARK-16112][SparkR] Programming guide for gapply/gappl...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14090
  
    **[Test build #61911 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61911/consoleFull)** for PR 14090 at commit [`7781d1c`](https://github.com/apache/spark/commit/7781d1c111f38e3608d5ebd468e6d344d52efa5c).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...

Posted by NarineK <gi...@git.apache.org>.

Github user NarineK commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14090#discussion_r70194370
  
    --- Diff: docs/sparkr.md ---
    @@ -306,6 +306,64 @@ head(ldf, 3)
     {% endhighlight %}
     </div>
     
    +#### Run a given function on a large dataset grouping by input column(s) and using `gapply` or `gapplyCollect`
    +
    +##### gapply
    +Apply a function to each group of a `SparkDataFrame`. The function is to be applied to each group of the `SparkDataFrame` and should have only two parameters: grouping key and R `data.frame` corresponding to
    +that key. The groups are chosen from `SparkDataFrame`s column(s).
    +The output of function should be a `data.frame`. Schema specifies the row format of the resulting
    +`SparkDataFrame`. It must match the R function's output.
    --- End diff --
    
    I see. I think we can describe the following type mapping in the programming guide. 
    https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala#L91
    Those are the types used in the StructType's fields.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14090#discussion_r70000362
  
    --- Diff: docs/sparkr.md ---
    @@ -306,6 +306,64 @@ head(ldf, 3)
     {% endhighlight %}
     </div>
     
    +#### Run a given function on a large dataset grouping by input column(s) and using `gapply` or `gapplyCollect`
    +
    +##### gapply
    +Apply a function to each group of a `SparkDataFrame`. The function is to be applied to each group of the `SparkDataFrame` and should have only two parameters: grouping key and R `data.frame` corresponding to
    +that key. The groups are chosen from `SparkDataFrame`s column(s).
    +The output of function should be a `data.frame`. Schema specifies the row format of the resulting
    +`SparkDataFrame`. It must match the R function's output.
    --- End diff --
    
    I suppose this could be explained in `dapply` above as well


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14090: [SPARK-16112][SparkR] Programming guide for gapply/gappl...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14090
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62299/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...

Posted by shivaram <gi...@git.apache.org>.

Github user shivaram commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14090#discussion_r70922747
  
    --- Diff: docs/sparkr.md ---
    @@ -316,6 +314,139 @@ head(ldf, 3)
     {% endhighlight %}
     </div>
     
    +#### Run a given function on a large dataset grouping by input column(s) and using `gapply` or `gapplyCollect`
    +
    +##### gapply
    +Apply a function to each group of a `SparkDataFrame`. The function is to be applied to each group of the `SparkDataFrame` and should have only two parameters: grouping key and R `data.frame` corresponding to
    +that key. The groups are chosen from `SparkDataFrame`s column(s).
    +The output of function should be a `data.frame`. Schema specifies the row format of the resulting
    +`SparkDataFrame`. It must represent R function's output schema on the basis of Spark data types. The column names of the returned `data.frame` are set by user. Below data type mapping between R
    +and Spark.
    +
    +#### Data type mapping between R and Spark
    +<table class="table">
    +<tr><th>R</th><th>Spark</th></tr>
    +<tr>
    +  <td>byte</td>
    +  <td>byte</td>
    +</tr>
    +<tr>
    +  <td>integer</td>
    +  <td>integer</td>
    +</tr>
    +<tr>
    +  <td>float</td>
    +  <td>float</td>
    +</tr>
    +<tr>
    +  <td>double</td>
    +  <td>double</td>
    +</tr>
    +<tr>
    +  <td>numeric</td>
    +  <td>double</td>
    +</tr>
    +<tr>
    +  <td>character</td>
    +  <td>string</td>
    +</tr>
    +<tr>
    +  <td>string</td>
    +  <td>string</td>
    +</tr>
    +<tr>
    +  <td>binary</td>
    +  <td>binary</td>
    +</tr>
    +<tr>
    +  <td>raw</td>
    +  <td>binary</td>
    +</tr>
    +<tr>
    +  <td>logical</td>
    +  <td>boolean</td>
    +</tr>
    +<tr>
    +  <td>timestamp</td>
    +  <td>timestamp</td>
    +</tr>
    +<tr>
    +  <td>date</td>
    +  <td>date</td>
    +</tr>
    +<tr>
    +  <td>array</td>
    +  <td>array</td>
    +</tr>
    +<tr>
    +  <td>list</td>
    +  <td>array</td>
    +</tr>
    +<tr>
    +  <td>map</td>
    +  <td>map</td>
    +</tr>
    +<tr>
    +  <td>env</td>
    +  <td>map</td>
    +</tr>
    +<tr>
    +  <td>struct</td>
    --- End diff --
    
    We can remove map, struct. For timestamp lets replace the R side of the table with `POSIXct` / `POSIXlt`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...

Posted by shivaram <gi...@git.apache.org>.

Github user shivaram commented on a diff in the pull request:

https://github.com/apache/spark/pull/14090#discussion_r71041580

--- Diff: docs/sparkr.md ---
@@ -295,8 +294,7 @@ head(collect(df1))

##### dapplyCollect
Like `dapply`, apply a function to each partition of a `SparkDataFrame` and collect the result back. The output of function
-should be a `data.frame`. But, Schema is not required to be passed. Note that `dapplyCollect` only can be used if the
-output of UDF run on all the partitions can fit in driver memory.
+should be a `data.frame`. But, Schema is not required to be passed. Note that `dapplyCollect` can fail if the output of UDF run on all the partition cannot be pulled to the driver and fit in driver memory.
<div data-lang="r" markdown="1">
--- End diff --

I think we need a new line before the `<div>` ? Right now the `div` markings show up in the generated doc. I've attached a screenshot

![screenshot 2016-07-15 14 11 39](https://cloud.githubusercontent.com/assets/143893/16888609/1d4409fe-4a96-11e6-97db-6ebf05a03774.png)

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14090#discussion_r69955401
  
    --- Diff: docs/sparkr.md ---
    @@ -306,6 +306,64 @@ head(ldf, 3)
     {% endhighlight %}
     </div>
     
    +#### Run a given function on a large dataset grouping by input column(s) and using `gapply` or `gapplyCollect`
    +
    +##### gapply
    +Apply a function to each group of a `SparkDataFrame`. The function is to be applied to each group of the `SparkDataFrame` and should have only two parameters: grouping key and R `data.frame` corresponding to
    +that key. The groups are chosen from `SparkDataFrame`s column(s).
    +The output of function should be a `data.frame`. Schema specifies the row format of the resulting
    +`SparkDataFrame`. It must match the R function's output.
    --- End diff --
    
    it was hard to do in roxygen2 doc but the programming guide would be a great please to touch on or refer to what "match" means exactly - type mapping between Spark and R is a bit fuzzy and would be good to explain a bit more on that


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14090: [SPARK-16112][SparkR] Programming guide for gapply/gappl...

Posted by NarineK <gi...@git.apache.org>.

Github user NarineK commented on the issue:

    https://github.com/apache/spark/pull/14090
  
    Thanks, I've generated the docs with your suggested way @shivaram, but I'm not sure if I see the same thing as you.
    I still see some '{% highlight r %}' and some formatting issues in general. I also followed this documentation:
    https://github.com/apache/spark/tree/master/docs#generating-the-documentation-html
    Please, let me know if you still see the issues after my latest commit.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14090: [SPARK-16112][SparkR] Programming guide for gapply/gappl...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14090
  
    **[Test build #62145 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62145/consoleFull)** for PR 14090 at commit [`c1d7151`](https://github.com/apache/spark/commit/c1d71512a3bf0205615d1b6318029ad6f33d94dc).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14090: [SPARK-16112][SparkR] Programming guide for gapply/gappl...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14090
  
    **[Test build #61911 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61911/consoleFull)** for PR 14090 at commit [`7781d1c`](https://github.com/apache/spark/commit/7781d1c111f38e3608d5ebd468e6d344d52efa5c).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14090#discussion_r70711218
  
    --- Diff: docs/sparkr.md ---
    @@ -312,7 +310,82 @@ head(ldf, 3)
     Apply a function to each group of a `SparkDataFrame`. The function is to be applied to each group of the `SparkDataFrame` and should have only two parameters: grouping key and R `data.frame` corresponding to
     that key. The groups are chosen from `SparkDataFrame`s column(s).
     The output of function should be a `data.frame`. Schema specifies the row format of the resulting
    -`SparkDataFrame`. It must match the R function's output.
    +`SparkDataFrame`. It must represent R function's output schema on the basis of Spark data types. The column names of each output field in the schema are set by user. Bellow data type mapping between R
    --- End diff --
    
    `Bellow` should be `Below`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14090: [SPARK-16112][SparkR] Programming guide for gapply/gappl...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14090
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14090: [SPARK-16112][SparkR] Programming guide for gapply/gappl...

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on the issue:

    https://github.com/apache/spark/pull/14090
  
    LGTM. thanks for putting this together!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14090: [SPARK-16112][SparkR] Programming guide for gapply/gappl...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14090
  
    **[Test build #62411 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62411/consoleFull)** for PR 14090 at commit [`f584416`](https://github.com/apache/spark/commit/f584416b81bc19d951d28eb2861cc3a4a16bc117).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14090: [SPARK-16112][SparkR] Programming guide for gapply/gappl...

Posted by shivaram <gi...@git.apache.org>.

Github user shivaram commented on the issue:

    https://github.com/apache/spark/pull/14090
  
    Thanks @NarineK for the updates. As a final thing I just had some formatting problems when I tested out this change locally. Let me know if you can't reproduce them. I just ran
    ```
    cd docs
    SKIP_API=1 jekyll build
    open _site/sparkr.html
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...

Posted by NarineK <gi...@git.apache.org>.

Github user NarineK commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14090#discussion_r70202736
  
    --- Diff: docs/sparkr.md ---
    @@ -306,6 +306,64 @@ head(ldf, 3)
     {% endhighlight %}
     </div>
     
    +#### Run a given function on a large dataset grouping by input column(s) and using `gapply` or `gapplyCollect`
    +
    +##### gapply
    +Apply a function to each group of a `SparkDataFrame`. The function is to be applied to each group of the `SparkDataFrame` and should have only two parameters: grouping key and R `data.frame` corresponding to
    +that key. The groups are chosen from `SparkDataFrame`s column(s).
    +The output of function should be a `data.frame`. Schema specifies the row format of the resulting
    +`SparkDataFrame`. It must match the R function's output.
    --- End diff --
    
    Thanks, I was looking at types.R file and have noticed that we have NA's for array, map and struct.
    https://github.com/apache/spark/blob/master/R/pkg/R/types.R#L42
    But I guess in our case we can have: array, map and struct mapped to array, map and struct correspondingly ?!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14090: [SPARK-16112][SparkR] Programming guide for gapply/gappl...

Posted by shivaram <gi...@git.apache.org>.

Github user shivaram commented on the issue:

    https://github.com/apache/spark/pull/14090
  
    cc @felixcheung @mengxr 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14090: [SPARK-16112][SparkR] Programming guide for gapply/gappl...

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on the issue:

    https://github.com/apache/spark/pull/14090
  
    LGTM except for comment on "schema matching".
    Also I wonder if we should rephrase "can only be used if the output of UDF run on all the partitions can fit in driver memory" - it seems not as strong as a warning or correct as "can fail if the output of UDF run on all the partition cannot be pulled to the driver and fit in driver memory" (same in `dapplyCollect`)



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...

Posted by shivaram <gi...@git.apache.org>.

Github user shivaram commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14090#discussion_r71041878
  
    --- Diff: docs/sparkr.md ---
    @@ -316,6 +314,135 @@ head(ldf, 3)
     {% endhighlight %}
     </div>
     
    +#### Run a given function on a large dataset grouping by input column(s) and using `gapply` or `gapplyCollect`
    +
    +##### gapply
    +Apply a function to each group of a `SparkDataFrame`. The function is to be applied to each group of the `SparkDataFrame` and should have only two parameters: grouping key and R `data.frame` corresponding to
    +that key. The groups are chosen from `SparkDataFrame`s column(s).
    +The output of function should be a `data.frame`. Schema specifies the row format of the resulting
    +`SparkDataFrame`. It must represent R function's output schema on the basis of Spark data types. The column names of the returned `data.frame` are set by user. Below data type mapping between R
    --- End diff --
    
    `Below data type` -> `Below is the data type`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...

Posted by shivaram <gi...@git.apache.org>.

Github user shivaram commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14090#discussion_r70202560
  
    --- Diff: docs/sparkr.md ---
    @@ -306,6 +306,64 @@ head(ldf, 3)
     {% endhighlight %}
     </div>
     
    +#### Run a given function on a large dataset grouping by input column(s) and using `gapply` or `gapplyCollect`
    +
    +##### gapply
    +Apply a function to each group of a `SparkDataFrame`. The function is to be applied to each group of the `SparkDataFrame` and should have only two parameters: grouping key and R `data.frame` corresponding to
    +that key. The groups are chosen from `SparkDataFrame`s column(s).
    +The output of function should be a `data.frame`. Schema specifies the row format of the resulting
    +`SparkDataFrame`. It must match the R function's output.
    --- End diff --
    
    This looks good to me ! 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14090: [SPARK-16112][SparkR] Programming guide for gapply/gappl...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14090
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62145/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14090: [SPARK-16112][SparkR] Programming guide for gapply/gappl...

Posted by NarineK <gi...@git.apache.org>.

Github user NarineK commented on the issue:

    https://github.com/apache/spark/pull/14090
  
    Added data type description


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14090#discussion_r71047599
  
    --- Diff: docs/sparkr.md ---
    @@ -316,6 +314,135 @@ head(ldf, 3)
     {% endhighlight %}
     </div>
     
    +#### Run a given function on a large dataset grouping by input column(s) and using `gapply` or `gapplyCollect`
    +
    +##### gapply
    +Apply a function to each group of a `SparkDataFrame`. The function is to be applied to each group of the `SparkDataFrame` and should have only two parameters: grouping key and R `data.frame` corresponding to
    +that key. The groups are chosen from `SparkDataFrame`s column(s).
    +The output of function should be a `data.frame`. Schema specifies the row format of the resulting
    +`SparkDataFrame`. It must represent R function's output schema on the basis of Spark data types. The column names of the returned `data.frame` are set by user. Below data type mapping between R
    +and Spark.
    +
    +#### Data type mapping between R and Spark
    +<table class="table">
    +<tr><th>R</th><th>Spark</th></tr>
    +<tr>
    +  <td>byte</td>
    +  <td>byte</td>
    +</tr>
    +<tr>
    +  <td>integer</td>
    +  <td>integer</td>
    +</tr>
    +<tr>
    +  <td>float</td>
    +  <td>float</td>
    +</tr>
    +<tr>
    +  <td>double</td>
    +  <td>double</td>
    +</tr>
    +<tr>
    +  <td>numeric</td>
    +  <td>double</td>
    +</tr>
    +<tr>
    +  <td>character</td>
    +  <td>string</td>
    +</tr>
    +<tr>
    +  <td>string</td>
    +  <td>string</td>
    +</tr>
    +<tr>
    +  <td>binary</td>
    +  <td>binary</td>
    +</tr>
    +<tr>
    +  <td>raw</td>
    +  <td>binary</td>
    +</tr>
    +<tr>
    +  <td>logical</td>
    +  <td>boolean</td>
    +</tr>
    +<tr>
    +  <td>[POSIXct](https://stat.ethz.ch/R-manual/R-devel/library/base/html/DateTimeClasses.html)</td>
    --- End diff --
    
    I think we need to put `<a href >` in `<table>`, eg. https://github.com/apache/spark/blame/master/docs/structured-streaming-programming-guide.md#L811


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14090: [SPARK-16112][SparkR] Programming guide for gapply/gappl...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14090
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...

Posted by NarineK <gi...@git.apache.org>.

Github user NarineK commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14090#discussion_r70923645
  
    --- Diff: docs/sparkr.md ---
    @@ -316,6 +314,139 @@ head(ldf, 3)
     {% endhighlight %}
     </div>
     
    +#### Run a given function on a large dataset grouping by input column(s) and using `gapply` or `gapplyCollect`
    +
    +##### gapply
    +Apply a function to each group of a `SparkDataFrame`. The function is to be applied to each group of the `SparkDataFrame` and should have only two parameters: grouping key and R `data.frame` corresponding to
    +that key. The groups are chosen from `SparkDataFrame`s column(s).
    +The output of function should be a `data.frame`. Schema specifies the row format of the resulting
    +`SparkDataFrame`. It must represent R function's output schema on the basis of Spark data types. The column names of the returned `data.frame` are set by user. Below data type mapping between R
    +and Spark.
    +
    +#### Data type mapping between R and Spark
    +<table class="table">
    +<tr><th>R</th><th>Spark</th></tr>
    +<tr>
    +  <td>byte</td>
    +  <td>byte</td>
    +</tr>
    +<tr>
    +  <td>integer</td>
    +  <td>integer</td>
    +</tr>
    +<tr>
    +  <td>float</td>
    +  <td>float</td>
    +</tr>
    +<tr>
    +  <td>double</td>
    +  <td>double</td>
    +</tr>
    +<tr>
    +  <td>numeric</td>
    +  <td>double</td>
    +</tr>
    +<tr>
    +  <td>character</td>
    +  <td>string</td>
    +</tr>
    +<tr>
    +  <td>string</td>
    +  <td>string</td>
    +</tr>
    +<tr>
    +  <td>binary</td>
    +  <td>binary</td>
    +</tr>
    +<tr>
    +  <td>raw</td>
    +  <td>binary</td>
    +</tr>
    +<tr>
    +  <td>logical</td>
    +  <td>boolean</td>
    +</tr>
    +<tr>
    +  <td>timestamp</td>
    +  <td>timestamp</td>
    +</tr>
    +<tr>
    +  <td>date</td>
    +  <td>date</td>
    +</tr>
    +<tr>
    +  <td>array</td>
    +  <td>array</td>
    +</tr>
    +<tr>
    +  <td>list</td>
    +  <td>array</td>
    +</tr>
    +<tr>
    +  <td>map</td>
    +  <td>map</td>
    +</tr>
    +<tr>
    +  <td>env</td>
    +  <td>map</td>
    +</tr>
    +<tr>
    +  <td>struct</td>
    --- End diff --
    
    Sounds good. for the mapping between: 'POSIXct / POSIXlt' to 'timestamp' and 'Date' to 'date' do we need to update 'getSQLDataType' method ?
    https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala#L91



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...

Posted by shivaram <gi...@git.apache.org>.

Github user shivaram commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14090#discussion_r70923795
  
    --- Diff: docs/sparkr.md ---
    @@ -316,6 +314,139 @@ head(ldf, 3)
     {% endhighlight %}
     </div>
     
    +#### Run a given function on a large dataset grouping by input column(s) and using `gapply` or `gapplyCollect`
    +
    +##### gapply
    +Apply a function to each group of a `SparkDataFrame`. The function is to be applied to each group of the `SparkDataFrame` and should have only two parameters: grouping key and R `data.frame` corresponding to
    +that key. The groups are chosen from `SparkDataFrame`s column(s).
    +The output of function should be a `data.frame`. Schema specifies the row format of the resulting
    +`SparkDataFrame`. It must represent R function's output schema on the basis of Spark data types. The column names of the returned `data.frame` are set by user. Below data type mapping between R
    +and Spark.
    +
    +#### Data type mapping between R and Spark
    +<table class="table">
    +<tr><th>R</th><th>Spark</th></tr>
    +<tr>
    +  <td>byte</td>
    +  <td>byte</td>
    +</tr>
    +<tr>
    +  <td>integer</td>
    +  <td>integer</td>
    +</tr>
    +<tr>
    +  <td>float</td>
    +  <td>float</td>
    +</tr>
    +<tr>
    +  <td>double</td>
    +  <td>double</td>
    +</tr>
    +<tr>
    +  <td>numeric</td>
    +  <td>double</td>
    +</tr>
    +<tr>
    +  <td>character</td>
    +  <td>string</td>
    +</tr>
    +<tr>
    +  <td>string</td>
    +  <td>string</td>
    +</tr>
    +<tr>
    +  <td>binary</td>
    +  <td>binary</td>
    +</tr>
    +<tr>
    +  <td>raw</td>
    +  <td>binary</td>
    +</tr>
    +<tr>
    +  <td>logical</td>
    +  <td>boolean</td>
    +</tr>
    +<tr>
    +  <td>timestamp</td>
    +  <td>timestamp</td>
    +</tr>
    +<tr>
    +  <td>date</td>
    +  <td>date</td>
    +</tr>
    +<tr>
    +  <td>array</td>
    +  <td>array</td>
    +</tr>
    +<tr>
    +  <td>list</td>
    +  <td>array</td>
    +</tr>
    +<tr>
    +  <td>map</td>
    +  <td>map</td>
    +</tr>
    +<tr>
    +  <td>env</td>
    +  <td>map</td>
    +</tr>
    +<tr>
    +  <td>struct</td>
    --- End diff --
    
    Not really - as I mentioned the getSQLDatatype looks at the schema - the method which looks at the R objects is in https://github.com/apache/spark/blob/2e4075e2ece9574100c79558cab054485e25c2ee/R/pkg/R/serialize.R#L84


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14090: [SPARK-16112][SparkR] Programming guide for gapply/gappl...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14090
  
    **[Test build #62369 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62369/consoleFull)** for PR 14090 at commit [`19e849f`](https://github.com/apache/spark/commit/19e849f066e970a755401f99bc8248b8258a11c4).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...

Posted by shivaram <gi...@git.apache.org>.

Github user shivaram commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14090#discussion_r71041809
  
    --- Diff: docs/sparkr.md ---
    @@ -316,6 +314,135 @@ head(ldf, 3)
     {% endhighlight %}
     </div>
     
    +#### Run a given function on a large dataset grouping by input column(s) and using `gapply` or `gapplyCollect`
    +
    +##### gapply
    +Apply a function to each group of a `SparkDataFrame`. The function is to be applied to each group of the `SparkDataFrame` and should have only two parameters: grouping key and R `data.frame` corresponding to
    +that key. The groups are chosen from `SparkDataFrame`s column(s).
    +The output of function should be a `data.frame`. Schema specifies the row format of the resulting
    +`SparkDataFrame`. It must represent R function's output schema on the basis of Spark data types. The column names of the returned `data.frame` are set by user. Below data type mapping between R
    +and Spark.
    +
    +#### Data type mapping between R and Spark
    +<table class="table">
    +<tr><th>R</th><th>Spark</th></tr>
    +<tr>
    +  <td>byte</td>
    +  <td>byte</td>
    +</tr>
    +<tr>
    +  <td>integer</td>
    +  <td>integer</td>
    +</tr>
    +<tr>
    +  <td>float</td>
    +  <td>float</td>
    +</tr>
    +<tr>
    +  <td>double</td>
    +  <td>double</td>
    +</tr>
    +<tr>
    +  <td>numeric</td>
    +  <td>double</td>
    +</tr>
    +<tr>
    +  <td>character</td>
    +  <td>string</td>
    +</tr>
    +<tr>
    +  <td>string</td>
    +  <td>string</td>
    +</tr>
    +<tr>
    +  <td>binary</td>
    +  <td>binary</td>
    +</tr>
    +<tr>
    +  <td>raw</td>
    +  <td>binary</td>
    +</tr>
    +<tr>
    +  <td>logical</td>
    +  <td>boolean</td>
    +</tr>
    +<tr>
    +  <td>[POSIXct](https://stat.ethz.ch/R-manual/R-devel/library/base/html/DateTimeClasses.html)</td>
    --- End diff --
    
    Also not sure why - but the URL formatting doesnt seem to be working here. Screenshot of what i see is below
    
    ![screenshot 2016-07-15 14 13 56](https://cloud.githubusercontent.com/assets/143893/16888670/61fede2a-4a96-11e6-8b7f-507f3eb194d4.png)



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14090: [SPARK-16112][SparkR] Programming guide for gapply/gappl...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14090
  
    **[Test build #62300 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62300/consoleFull)** for PR 14090 at commit [`5d34943`](https://github.com/apache/spark/commit/5d3494337ed2dfc5592b11e324aa7ef52a6f354e).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14090: [SPARK-16112][SparkR] Programming guide for gapply/gappl...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14090
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62411/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...

Posted by NarineK <gi...@git.apache.org>.

Github user NarineK commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14090#discussion_r70920518
  
    --- Diff: docs/sparkr.md ---
    @@ -316,6 +314,139 @@ head(ldf, 3)
     {% endhighlight %}
     </div>
     
    +#### Run a given function on a large dataset grouping by input column(s) and using `gapply` or `gapplyCollect`
    +
    +##### gapply
    +Apply a function to each group of a `SparkDataFrame`. The function is to be applied to each group of the `SparkDataFrame` and should have only two parameters: grouping key and R `data.frame` corresponding to
    +that key. The groups are chosen from `SparkDataFrame`s column(s).
    +The output of function should be a `data.frame`. Schema specifies the row format of the resulting
    +`SparkDataFrame`. It must represent R function's output schema on the basis of Spark data types. The column names of the returned `data.frame` are set by user. Below data type mapping between R
    +and Spark.
    +
    +#### Data type mapping between R and Spark
    +<table class="table">
    +<tr><th>R</th><th>Spark</th></tr>
    +<tr>
    +  <td>byte</td>
    +  <td>byte</td>
    +</tr>
    +<tr>
    +  <td>integer</td>
    +  <td>integer</td>
    +</tr>
    +<tr>
    +  <td>float</td>
    +  <td>float</td>
    +</tr>
    +<tr>
    +  <td>double</td>
    +  <td>double</td>
    +</tr>
    +<tr>
    +  <td>numeric</td>
    +  <td>double</td>
    +</tr>
    +<tr>
    +  <td>character</td>
    +  <td>string</td>
    +</tr>
    +<tr>
    +  <td>string</td>
    +  <td>string</td>
    +</tr>
    +<tr>
    +  <td>binary</td>
    +  <td>binary</td>
    +</tr>
    +<tr>
    +  <td>raw</td>
    +  <td>binary</td>
    +</tr>
    +<tr>
    +  <td>logical</td>
    +  <td>boolean</td>
    +</tr>
    +<tr>
    +  <td>timestamp</td>
    +  <td>timestamp</td>
    +</tr>
    +<tr>
    +  <td>date</td>
    +  <td>date</td>
    +</tr>
    +<tr>
    +  <td>array</td>
    +  <td>array</td>
    +</tr>
    +<tr>
    +  <td>list</td>
    +  <td>array</td>
    +</tr>
    +<tr>
    +  <td>map</td>
    +  <td>map</td>
    +</tr>
    +<tr>
    +  <td>env</td>
    +  <td>map</td>
    +</tr>
    +<tr>
    +  <td>struct</td>
    --- End diff --
    
    @shivaram, I've looked at the following list:
    https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala#L92
    It is being called for creating schema's field and it has map, struct, timestamp, etc ... 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14090: [SPARK-16112][SparkR] Programming guide for gapply/gappl...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14090
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62300/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14090#discussion_r70926341
  
    --- Diff: docs/sparkr.md ---
    @@ -316,6 +314,139 @@ head(ldf, 3)
     {% endhighlight %}
     </div>
     
    +#### Run a given function on a large dataset grouping by input column(s) and using `gapply` or `gapplyCollect`
    +
    +##### gapply
    +Apply a function to each group of a `SparkDataFrame`. The function is to be applied to each group of the `SparkDataFrame` and should have only two parameters: grouping key and R `data.frame` corresponding to
    +that key. The groups are chosen from `SparkDataFrame`s column(s).
    +The output of function should be a `data.frame`. Schema specifies the row format of the resulting
    +`SparkDataFrame`. It must represent R function's output schema on the basis of Spark data types. The column names of the returned `data.frame` are set by user. Below data type mapping between R
    +and Spark.
    +
    +#### Data type mapping between R and Spark
    +<table class="table">
    +<tr><th>R</th><th>Spark</th></tr>
    +<tr>
    +  <td>byte</td>
    +  <td>byte</td>
    +</tr>
    +<tr>
    +  <td>integer</td>
    +  <td>integer</td>
    +</tr>
    +<tr>
    +  <td>float</td>
    +  <td>float</td>
    +</tr>
    +<tr>
    +  <td>double</td>
    +  <td>double</td>
    +</tr>
    +<tr>
    +  <td>numeric</td>
    +  <td>double</td>
    +</tr>
    +<tr>
    +  <td>character</td>
    +  <td>string</td>
    +</tr>
    +<tr>
    +  <td>string</td>
    +  <td>string</td>
    +</tr>
    +<tr>
    +  <td>binary</td>
    +  <td>binary</td>
    +</tr>
    +<tr>
    +  <td>raw</td>
    +  <td>binary</td>
    +</tr>
    +<tr>
    +  <td>logical</td>
    +  <td>boolean</td>
    +</tr>
    +<tr>
    +  <td>timestamp</td>
    +  <td>timestamp</td>
    +</tr>
    +<tr>
    +  <td>date</td>
    +  <td>date</td>
    +</tr>
    +<tr>
    +  <td>array</td>
    +  <td>array</td>
    +</tr>
    +<tr>
    +  <td>list</td>
    +  <td>array</td>
    +</tr>
    +<tr>
    +  <td>map</td>
    +  <td>map</td>
    +</tr>
    +<tr>
    +  <td>env</td>
    +  <td>map</td>
    +</tr>
    +<tr>
    +  <td>struct</td>
    --- End diff --
    
    yes it should be `Date` not `date`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...

Posted by NarineK <gi...@git.apache.org>.

Github user NarineK commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14090#discussion_r70921996
  
    --- Diff: docs/sparkr.md ---
    @@ -316,6 +314,139 @@ head(ldf, 3)
     {% endhighlight %}
     </div>
     
    +#### Run a given function on a large dataset grouping by input column(s) and using `gapply` or `gapplyCollect`
    +
    +##### gapply
    +Apply a function to each group of a `SparkDataFrame`. The function is to be applied to each group of the `SparkDataFrame` and should have only two parameters: grouping key and R `data.frame` corresponding to
    +that key. The groups are chosen from `SparkDataFrame`s column(s).
    +The output of function should be a `data.frame`. Schema specifies the row format of the resulting
    +`SparkDataFrame`. It must represent R function's output schema on the basis of Spark data types. The column names of the returned `data.frame` are set by user. Below data type mapping between R
    +and Spark.
    +
    +#### Data type mapping between R and Spark
    +<table class="table">
    +<tr><th>R</th><th>Spark</th></tr>
    +<tr>
    +  <td>byte</td>
    +  <td>byte</td>
    +</tr>
    +<tr>
    +  <td>integer</td>
    +  <td>integer</td>
    +</tr>
    +<tr>
    +  <td>float</td>
    +  <td>float</td>
    +</tr>
    +<tr>
    +  <td>double</td>
    +  <td>double</td>
    +</tr>
    +<tr>
    +  <td>numeric</td>
    +  <td>double</td>
    +</tr>
    +<tr>
    +  <td>character</td>
    +  <td>string</td>
    +</tr>
    +<tr>
    +  <td>string</td>
    +  <td>string</td>
    +</tr>
    +<tr>
    +  <td>binary</td>
    +  <td>binary</td>
    +</tr>
    +<tr>
    +  <td>raw</td>
    +  <td>binary</td>
    +</tr>
    +<tr>
    +  <td>logical</td>
    +  <td>boolean</td>
    +</tr>
    +<tr>
    +  <td>timestamp</td>
    +  <td>timestamp</td>
    +</tr>
    +<tr>
    +  <td>date</td>
    +  <td>date</td>
    +</tr>
    +<tr>
    +  <td>array</td>
    +  <td>array</td>
    +</tr>
    +<tr>
    +  <td>list</td>
    +  <td>array</td>
    +</tr>
    +<tr>
    +  <td>map</td>
    +  <td>map</td>
    +</tr>
    +<tr>
    +  <td>env</td>
    +  <td>map</td>
    +</tr>
    +<tr>
    +  <td>struct</td>
    --- End diff --
    
    Thanks for the explanation, @shivaram !
    So, I'll remove map, struct and timestamp and leave the rest as is.
    Does it sound fine ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14090: [SPARK-16112][SparkR] Programming guide for gapply/gappl...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14090
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62369/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14090: [SPARK-16112][SparkR] Programming guide for gapply/gappl...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14090
  
    **[Test build #62369 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62369/consoleFull)** for PR 14090 at commit [`19e849f`](https://github.com/apache/spark/commit/19e849f066e970a755401f99bc8248b8258a11c4).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14090: [SPARK-16112][SparkR] Programming guide for gapply/gappl...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14090
  
    **[Test build #62147 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62147/consoleFull)** for PR 14090 at commit [`2af7243`](https://github.com/apache/spark/commit/2af724321e0d51aed64c84dd22741a7cc6067caf).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14090#discussion_r70926563
  
    --- Diff: docs/sparkr.md ---
    @@ -316,6 +314,139 @@ head(ldf, 3)
     {% endhighlight %}
     </div>
     
    +#### Run a given function on a large dataset grouping by input column(s) and using `gapply` or `gapplyCollect`
    +
    +##### gapply
    +Apply a function to each group of a `SparkDataFrame`. The function is to be applied to each group of the `SparkDataFrame` and should have only two parameters: grouping key and R `data.frame` corresponding to
    +that key. The groups are chosen from `SparkDataFrame`s column(s).
    +The output of function should be a `data.frame`. Schema specifies the row format of the resulting
    +`SparkDataFrame`. It must represent R function's output schema on the basis of Spark data types. The column names of the returned `data.frame` are set by user. Below data type mapping between R
    +and Spark.
    +
    +#### Data type mapping between R and Spark
    +<table class="table">
    +<tr><th>R</th><th>Spark</th></tr>
    +<tr>
    +  <td>byte</td>
    +  <td>byte</td>
    +</tr>
    +<tr>
    +  <td>integer</td>
    +  <td>integer</td>
    +</tr>
    +<tr>
    +  <td>float</td>
    +  <td>float</td>
    +</tr>
    +<tr>
    +  <td>double</td>
    +  <td>double</td>
    +</tr>
    +<tr>
    +  <td>numeric</td>
    +  <td>double</td>
    +</tr>
    +<tr>
    +  <td>character</td>
    +  <td>string</td>
    +</tr>
    +<tr>
    +  <td>string</td>
    +  <td>string</td>
    +</tr>
    +<tr>
    +  <td>binary</td>
    +  <td>binary</td>
    +</tr>
    +<tr>
    +  <td>raw</td>
    +  <td>binary</td>
    +</tr>
    +<tr>
    +  <td>logical</td>
    +  <td>boolean</td>
    +</tr>
    +<tr>
    +  <td>timestamp</td>
    +  <td>timestamp</td>
    +</tr>
    +<tr>
    +  <td>date</td>
    +  <td>date</td>
    +</tr>
    +<tr>
    +  <td>array</td>
    +  <td>array</td>
    +</tr>
    +<tr>
    +  <td>list</td>
    +  <td>array</td>
    +</tr>
    +<tr>
    +  <td>map</td>
    +  <td>map</td>
    +</tr>
    +<tr>
    +  <td>env</td>
    +  <td>map</td>
    +</tr>
    +<tr>
    +  <td>struct</td>
    --- End diff --
    
    And `environment` instead of `env`?
    https://stat.ethz.ch/R-manual/R-devel/library/base/html/environment.html
    ```
    > e <- new.env()
    > class(e)
    [1] "environment"
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...

Posted by shivaram <gi...@git.apache.org>.

Github user shivaram commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14090#discussion_r70202064
  
    --- Diff: docs/sparkr.md ---
    @@ -306,6 +306,64 @@ head(ldf, 3)
     {% endhighlight %}
     </div>
     
    +#### Run a given function on a large dataset grouping by input column(s) and using `gapply` or `gapplyCollect`
    +
    +##### gapply
    +Apply a function to each group of a `SparkDataFrame`. The function is to be applied to each group of the `SparkDataFrame` and should have only two parameters: grouping key and R `data.frame` corresponding to
    +that key. The groups are chosen from `SparkDataFrame`s column(s).
    +The output of function should be a `data.frame`. Schema specifies the row format of the resulting
    +`SparkDataFrame`. It must match the R function's output.
    --- End diff --
    
    Yeah but instead of a pointer to the code it would be great if we could have a table in the documentation. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14090: [SPARK-16112][SparkR] Programming guide for gapply/gappl...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14090
  
    **[Test build #62299 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62299/consoleFull)** for PR 14090 at commit [`8a2aff3`](https://github.com/apache/spark/commit/8a2aff3add082e20c45136dc5814e6ccdf4b256c).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14090: [SPARK-16112][SparkR] Programming guide for gapply/gappl...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14090
  
    **[Test build #62145 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62145/consoleFull)** for PR 14090 at commit [`c1d7151`](https://github.com/apache/spark/commit/c1d71512a3bf0205615d1b6318029ad6f33d94dc).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org