You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by vectorijk <gi...@git.apache.org> on 2016/06/14 10:15:08 UTC

[GitHub] spark pull request #13660: [SPARK-15672][R][DOC] R programming guide update

GitHub user vectorijk opened a pull request:

    https://github.com/apache/spark/pull/13660

    [SPARK-15672][R][DOC] R programming guide update

    ## What changes were proposed in this pull request?
    Guide for
    - UDFs with dapply, dapplyCollect
    - spark.lapply for running parallel R functions
    
    ## How was this patch tested?
    build locally
    <img width="654" alt="screen shot 2016-06-14 at 03 12 56" src="https://cloud.githubusercontent.com/assets/3419881/16039344/12a3b6a0-31de-11e6-8d77-fe23308075c0.png">


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/vectorijk/spark spark-15672-R-guide-update

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/13660.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #13660
    
----
commit adee4d46551c379cb3c092d603041551f779c630
Author: Kai Jiang <ji...@gmail.com>
Date:   2016-06-06T04:41:42Z

    revise documentation for sparkr

commit 9081a0bf5072da5f5255f7ffcf398758ff19b46c
Author: Kai Jiang <ji...@gmail.com>
Date:   2016-06-06T04:41:42Z

    revise documentation for sparkr

commit 1ba263498c68ec41ec0096cdb305205a8f99f058
Author: Kai Jiang <ji...@gmail.com>
Date:   2016-06-14T09:31:08Z

    Merge branch 'spark-15672-R-guide-update' of github.com:vectorijk/spark into spark-15672-R-guide-update

commit 2611549e60f68d6ba12ec10b471dc96944508873
Author: Kai Jiang <ji...@gmail.com>
Date:   2016-06-14T10:08:25Z

    update

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13660: [SPARK-15672][R][DOC] R programming guide update

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the issue:

    https://github.com/apache/spark/pull/13660
  
    Ping @vectorijk 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13660: [SPARK-15672][R][DOC] R programming guide update

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13660
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13660: [SPARK-15672][R][DOC] R programming guide update

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13660
  
    **[Test build #60957 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60957/consoleFull)** for PR 13660 at commit [`8d4f163`](https://github.com/apache/spark/commit/8d4f16354005be12b932287f01b44a1c99a56f5b).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13660: [SPARK-15672][R][DOC] R programming guide update

Posted by NarineK <gi...@git.apache.org>.

Github user NarineK commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13660#discussion_r67944525
  
    --- Diff: docs/sparkr.md ---
    @@ -262,6 +262,83 @@ head(df)
     {% endhighlight %}
     </div>
     
    +### Applying User-defined Function
    +In SparkR, we support several kinds for User-defined Functions:
    +
    +#### Run a given function on a large dataset using `dapply` or `dapplyCollect`
    +
    +##### dapply
    +Apply a function to each partition of `SparkDataFrame`. The function to be applied to each partition of the `SparkDataFrame`
    --- End diff --
    
    @sun-rui , will do gapply in another PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13660: [SPARK-15672][R][DOC] R programming guide update

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on the issue:

    https://github.com/apache/spark/pull/13660
  
    @NarineK That is sort of unrelated to this PR since this PR is about the programming guide?
    
    But in short, this happens because in the R code both `dapply` and `dapplyCollect` has the `@rdname` tag to "dapply". I'm not sure if we need to do that. But the first copy of "x ..." and "func ..." is from "dapply" and the second is from "dapplyCollect".



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13660: [SPARK-15672][R][DOC] R programming guide update

Posted by vectorijk <gi...@git.apache.org>.

Github user vectorijk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13660#discussion_r67945982
  
    --- Diff: docs/sparkr.md ---
    @@ -262,6 +262,83 @@ head(df)
     {% endhighlight %}
     </div>
     
    +### Applying User-defined Function
    +In SparkR, we support several kinds for User-defined Functions:
    +
    +#### Run a given function on a large dataset using `dapply` or `dapplyCollect`
    +
    +##### dapply
    +Apply a function to each partition of `SparkDataFrame`. The function to be applied to each partition of the `SparkDataFrame`
    --- End diff --
    
    @NarineK I filed a JIRA [SPARK-16112](https://issues.apache.org/jira/browse/SPARK-16112) for gapply programming guide so that you could open a PR for this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13660: [SPARK-15672][R][DOC] R programming guide update

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13660
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13660: [SPARK-15672][R][DOC] R programming guide update

Posted by vectorijk <gi...@git.apache.org>.

Github user vectorijk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13660#discussion_r67782797
  
    --- Diff: docs/sparkr.md ---
    @@ -262,6 +262,79 @@ head(df)
     {% endhighlight %}
     </div>
     
    +### Applying User-defined Function
    +In SparkR, we support several kinds for User-defined Functions:
    +
    +#### Run a given function on a large dataset using `dapply` or `dapplyCollect`
    +
    +##### dapply
    +Apply a function to each partition of `SparkDataFrame`. The function to be applied to each partition of the `SparkDataFrame`
    +and should have only one parameter, to which a `data.frame` corresponds to each partition will be passed. The output of function
    +should be a `data.frame`. Schema specifies the row format of the resulting `SparkDataFrame`. It must match the R function's output.
    +<div data-lang="r"  markdown="1">
    +{% highlight r %}
    +
    +# Convert waiting time from hours to seconds.
    +# Note that we can apply UDF to DataFrame.
    +schema <- structType(structField("eruptions", "double"), structField("waiting", "double"),
    +                     structField("waiting_secs", "double"))
    +df1 <- dapply(df, function(x) {x <- cbind(x, x$waiting * 60)}, schema)
    +head(collect(df1))
    +##  eruptions waiting waiting_secs
    +##1     3.600      79         4740
    +##2     1.800      54         3240
    +##3     3.333      74         4440
    +##4     2.283      62         3720
    +##5     4.533      85         5100
    +##6     2.883      55         3300
    +{% endhighlight %}
    +</div>
    +
    +##### dapplyCollect
    +Like `dapply`, apply a function to each partition of `SparkDataFrame` and collect the result back.
    +<div data-lang="r"  markdown="1">
    +{% highlight r %}
    +
    +# Convert waiting time from hours to seconds.
    +# Note that we can apply UDF to DataFrame and return a R's data.frame
    +ldf <- dapplyCollect(
    +         df,
    +         function(x) {
    +           x <- cbind(x, "waiting_secs"=x$waiting * 60)
    +         })
    +head(ldf, 3)
    +##  eruptions waiting waiting_secs
    +##1     3.600      79         4740
    +##2     1.800      54         3240
    +##3     3.333      74         4440
    +
    +{% endhighlight %}
    +</div>
    +
    +#### Run many functions in parallel using `spark.lapply`
    +
    +##### lapply
    +Similar to `lapply` in native R, `spark.lapply` runs a function over a list of elements and distributes the computations with Spark.
    +Applies a function in a manner that is similar to `doParallel` or `lapply` to elements of a list.
    --- End diff --
    
    Thanks so much for pointing out this! I will update those very soon.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13660: [SPARK-15672][R][DOC] R programming guide update

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13660
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13660: [SPARK-15672][R][DOC] R programming guide update

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13660
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/60485/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13660: [SPARK-15672][R][DOC] R programming guide update

Posted by shivaram <gi...@git.apache.org>.

Github user shivaram commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13660#discussion_r67772111
  
    --- Diff: docs/sparkr.md ---
    @@ -262,6 +262,79 @@ head(df)
     {% endhighlight %}
     </div>
     
    +### Applying User-defined Function
    +In SparkR, we support several kinds for User-defined Functions:
    +
    +#### Run a given function on a large dataset using `dapply` or `dapplyCollect`
    +
    +##### dapply
    +Apply a function to each partition of `SparkDataFrame`. The function to be applied to each partition of the `SparkDataFrame`
    +and should have only one parameter, to which a `data.frame` corresponds to each partition will be passed. The output of function
    +should be a `data.frame`. Schema specifies the row format of the resulting `SparkDataFrame`. It must match the R function's output.
    +<div data-lang="r"  markdown="1">
    +{% highlight r %}
    +
    +# Convert waiting time from hours to seconds.
    +# Note that we can apply UDF to DataFrame.
    +schema <- structType(structField("eruptions", "double"), structField("waiting", "double"),
    +                     structField("waiting_secs", "double"))
    +df1 <- dapply(df, function(x) {x <- cbind(x, x$waiting * 60)}, schema)
    +head(collect(df1))
    +##  eruptions waiting waiting_secs
    +##1     3.600      79         4740
    +##2     1.800      54         3240
    +##3     3.333      74         4440
    +##4     2.283      62         3720
    +##5     4.533      85         5100
    +##6     2.883      55         3300
    +{% endhighlight %}
    +</div>
    +
    +##### dapplyCollect
    +Like `dapply`, apply a function to each partition of `SparkDataFrame` and collect the result back.
    +<div data-lang="r"  markdown="1">
    +{% highlight r %}
    +
    +# Convert waiting time from hours to seconds.
    +# Note that we can apply UDF to DataFrame and return a R's data.frame
    +ldf <- dapplyCollect(
    +         df,
    +         function(x) {
    +           x <- cbind(x, "waiting_secs"=x$waiting * 60)
    +         })
    +head(ldf, 3)
    +##  eruptions waiting waiting_secs
    +##1     3.600      79         4740
    +##2     1.800      54         3240
    +##3     3.333      74         4440
    +
    +{% endhighlight %}
    +</div>
    +
    +#### Run many functions in parallel using `spark.lapply`
    +
    +##### lapply
    +Similar to `lapply` in native R, `spark.lapply` runs a function over a list of elements and distributes the computations with Spark.
    +Applies a function in a manner that is similar to `doParallel` or `lapply` to elements of a list.
    --- End diff --
    
    Similar to the above, it would be good to add a line here saying the results of all the computations should fit in a single machine -- And that if that is not the case they can do something like `df <- createDataFrame(list)` and then use `dapply`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13660: [SPARK-15672][R][DOC] R programming guide update

Posted by shivaram <gi...@git.apache.org>.

Github user shivaram commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13660#discussion_r67018163
  
    --- Diff: docs/sparkr.md ---
    @@ -262,6 +262,67 @@ head(df)
     {% endhighlight %}
     </div>
     
    +### Applying User-defined Function
    +
    --- End diff --
    
    It will be good to add an introduction here that there are two kinds of user-defined functions we support in SparkR. Something like 
    ```
    In SparkR we support two kinds for user-defined functions
    1. Run a given function on a large dataset using dapply. 
    2. Run many functions in parallel using spark.lapply. 
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13660: [SPARK-15672][R][DOC] R programming guide update

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13660
  
    **[Test build #60787 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60787/consoleFull)** for PR 13660 at commit [`063bc8e`](https://github.com/apache/spark/commit/063bc8ed69b90504fbd79d32b53e044275bd6908).
     * This patch **fails Spark unit tests**.
     * This patch **does not merge cleanly**.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13660: [SPARK-15672][R][DOC] R programming guide update

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13660
  
    **[Test build #60918 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60918/consoleFull)** for PR 13660 at commit [`ae26233`](https://github.com/apache/spark/commit/ae26233922c36e747244bd317719b9559a43d38f).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13660: [SPARK-15672][R][DOC] R programming guide update

Posted by shivaram <gi...@git.apache.org>.

Github user shivaram commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13660#discussion_r67910379
  
    --- Diff: docs/sparkr.md ---
    @@ -262,6 +262,83 @@ head(df)
     {% endhighlight %}
     </div>
     
    +### Applying User-defined Function
    +In SparkR, we support several kinds for User-defined Functions:
    +
    +#### Run a given function on a large dataset using `dapply` or `dapplyCollect`
    +
    +##### dapply
    +Apply a function to each partition of `SparkDataFrame`. The function to be applied to each partition of the `SparkDataFrame`
    +and should have only one parameter, to which a `data.frame` corresponds to each partition will be passed. The output of function
    +should be a `data.frame`. Schema specifies the row format of the resulting `SparkDataFrame`. It must match the R function's output.
    +<div data-lang="r"  markdown="1">
    +{% highlight r %}
    +
    +# Convert waiting time from hours to seconds.
    +# Note that we can apply UDF to DataFrame.
    +schema <- structType(structField("eruptions", "double"), structField("waiting", "double"),
    +                     structField("waiting_secs", "double"))
    +df1 <- dapply(df, function(x) {x <- cbind(x, x$waiting * 60)}, schema)
    +head(collect(df1))
    +##  eruptions waiting waiting_secs
    +##1     3.600      79         4740
    +##2     1.800      54         3240
    +##3     3.333      74         4440
    +##4     2.283      62         3720
    +##5     4.533      85         5100
    +##6     2.883      55         3300
    +{% endhighlight %}
    +</div>
    +
    +##### dapplyCollect
    +Like `dapply`, apply a function to each partition of `SparkDataFrame` and collect the result back. The output of function
    +should be a `data.frame`. But, Schema is not required to be passed. Note that `dapplyCollect` only can be used if the
    +output of UDF run on all the partitions can fit in driver memory.
    +<div data-lang="r"  markdown="1">
    +{% highlight r %}
    +
    +# Convert waiting time from hours to seconds.
    +# Note that we can apply UDF to DataFrame and return a R's data.frame
    +ldf <- dapplyCollect(
    +         df,
    +         function(x) {
    +           x <- cbind(x, "waiting_secs"=x$waiting * 60)
    +         })
    +head(ldf, 3)
    +##  eruptions waiting waiting_secs
    +##1     3.600      79         4740
    +##2     1.800      54         3240
    +##3     3.333      74         4440
    +
    +{% endhighlight %}
    +</div>
    +
    +#### Run many functions in parallel using `spark.lapply`
    +
    +##### lapply
    --- End diff --
    
    This dicussion probably belongs on JIRA, but I don't think spark.lapply is particularly similar to RDD or Dataset. We are taking a local list here and passing a value from that to each parallel function -- I think of Datasets / RDDs as distributed datasets which don't fit in local R session and DataFrames in SparkR already cover that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13660: [SPARK-15672][R][DOC] R programming guide update

Posted by sun-rui <gi...@git.apache.org>.

Github user sun-rui commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13660#discussion_r67880117
  
    --- Diff: docs/sparkr.md ---
    @@ -262,6 +262,83 @@ head(df)
     {% endhighlight %}
     </div>
     
    +### Applying User-defined Function
    +In SparkR, we support several kinds for User-defined Functions:
    --- End diff --
    
    several kinds of?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13660: [SPARK-15672][R][DOC] R programming guide update

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13660#discussion_r67029622
  
    --- Diff: docs/sparkr.md ---
    @@ -262,6 +262,67 @@ head(df)
     {% endhighlight %}
     </div>
     
    +### Applying User-defined Function
    +
    +#### dapply
    +Apply a function to each partition of `SparkDataFrame`. The function to be applied to each partition of the `SparkDataFrame` and should have only one parameter, to which a `data.frame` corresponds to each partition will be passed. The output of function should be a `data.frame`.
    --- End diff --
    
    perhaps explain why the schema needs to be passed here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13660: [SPARK-15672][R][DOC] R programming guide update

Posted by vectorijk <gi...@git.apache.org>.

Github user vectorijk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13660#discussion_r67020817
  
    --- Diff: docs/sparkr.md ---
    @@ -262,6 +262,67 @@ head(df)
     {% endhighlight %}
     </div>
     
    +### Applying User-defined Function
    +
    +#### dapply
    +Apply a function to each partition of `SparkDataFrame`. The function to be applied to each partition of the `SparkDataFrame` and should have only one parameter, to which a `data.frame` corresponds to each partition will be passed. The output of function should be a `data.frame`.
    +<div data-lang="r"  markdown="1">
    +{% highlight r %}
    +
    +# Convert waiting time from hours to seconds.
    +# Note that we can apply UDF to DataFrame.
    +
    +df1 <- dapply(df, function(x) {x}, schema(df))
    --- End diff --
    
    ok, I will improve this example more specifically.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13660: [SPARK-15672][R][DOC] R programming guide update

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13660#discussion_r67029962
  
    --- Diff: docs/sparkr.md ---
    @@ -262,6 +262,67 @@ head(df)
     {% endhighlight %}
     </div>
     
    +### Applying User-defined Function
    +
    +#### dapply
    +Apply a function to each partition of `SparkDataFrame`. The function to be applied to each partition of the `SparkDataFrame` and should have only one parameter, to which a `data.frame` corresponds to each partition will be passed. The output of function should be a `data.frame`.
    +<div data-lang="r"  markdown="1">
    +{% highlight r %}
    +
    +# Convert waiting time from hours to seconds.
    +# Note that we can apply UDF to DataFrame.
    +
    +df1 <- dapply(df, function(x) {x}, schema(df))
    +head(collect(df1), 3)
    +##  eruptions waiting waiting_secs
    +##1     3.600      79         4740
    +##2     1.800      54         3240
    +##3     3.333      74         4440
    +
    +{% endhighlight %}
    +</div>
    +
    +#### dapplyCollect
    +Like `dapply`, apply a function to each partition of `SparkDataFrame` and collect the result back.
    +<div data-lang="r"  markdown="1">
    +{% highlight r %}
    +
    +# Convert waiting time from hours to seconds.
    +# Note that we can apply UDF to DataFrame.
    +ldf <- dapplyCollect(
    +         df,
    +         function(x) {
    +           x <- cbind(x, "waiting_secs"=x$waiting * 60)
    +         })
    +head(df, 3)
    +##  eruptions waiting waiting_secs
    +##1     3.600      79         4740
    +##2     1.800      54         3240
    +##3     3.333      74         4440
    +
    +{% endhighlight %}
    +</div>
    +
    +#### lapply
    +Similar to `lapply` in native R, `spark.lapply` runs a function over a list of elements and distributes the computations with Spark.
    +Applies a function in a manner that is similar to `doParallel` or `lapply` to elements of a list.
    +<div data-lang="r"  markdown="1">
    +{% highlight r %}
    +
    +# Perform distributed training of multiple models with spark.lapply
    +families <- c("gaussian", "poisson")
    +train <- function(family) {
    +  model <- glm(Sepal.Length ~ Sepal.Width + Species, iris, family = family)
    +  summary(model)
    +}
    +model.summaries <- spark.lapply(sc, families, train)
    --- End diff --
    
    perhaps describe more on what will get passed to this udf `train` here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13660: [SPARK-15672][R][DOC] R programming guide update

Posted by shivaram <gi...@git.apache.org>.

Github user shivaram commented on the issue:

    https://github.com/apache/spark/pull/13660
  
    Yeah we can remove the duplication by having separate rd files or by just removing documentation for the overlapping arguments (I think in this case `x` and `func` are the same for `dapply` and `dapplyCollect`). 
    
    @NarineK feel free to open a separate JIRA/PR for this


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13660: [SPARK-15672][R][DOC] R programming guide update

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13660#discussion_r67030111
  
    --- Diff: docs/sparkr.md ---
    @@ -262,6 +262,67 @@ head(df)
     {% endhighlight %}
     </div>
     
    +### Applying User-defined Function
    +
    +#### dapply
    +Apply a function to each partition of `SparkDataFrame`. The function to be applied to each partition of the `SparkDataFrame` and should have only one parameter, to which a `data.frame` corresponds to each partition will be passed. The output of function should be a `data.frame`.
    +<div data-lang="r"  markdown="1">
    +{% highlight r %}
    +
    +# Convert waiting time from hours to seconds.
    +# Note that we can apply UDF to DataFrame.
    +
    +df1 <- dapply(df, function(x) {x}, schema(df))
    +head(collect(df1), 3)
    +##  eruptions waiting waiting_secs
    +##1     3.600      79         4740
    +##2     1.800      54         3240
    +##3     3.333      74         4440
    +
    +{% endhighlight %}
    +</div>
    +
    +#### dapplyCollect
    +Like `dapply`, apply a function to each partition of `SparkDataFrame` and collect the result back.
    +<div data-lang="r"  markdown="1">
    +{% highlight r %}
    +
    +# Convert waiting time from hours to seconds.
    +# Note that we can apply UDF to DataFrame.
    +ldf <- dapplyCollect(
    +         df,
    +         function(x) {
    +           x <- cbind(x, "waiting_secs"=x$waiting * 60)
    +         })
    +head(df, 3)
    --- End diff --
    
    it will also be useful to point out `ldf` is a `data.frame`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13660: [SPARK-15672][R][DOC] R programming guide update

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13660
  
    Build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13660: [SPARK-15672][R][DOC] R programming guide update

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13660
  
    **[Test build #60485 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60485/consoleFull)** for PR 13660 at commit [`2611549`](https://github.com/apache/spark/commit/2611549e60f68d6ba12ec10b471dc96944508873).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13660: [SPARK-15672][R][DOC] R programming guide update

Posted by sun-rui <gi...@git.apache.org>.

Github user sun-rui commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13660#discussion_r67881287
  
    --- Diff: docs/sparkr.md ---
    @@ -262,6 +262,83 @@ head(df)
     {% endhighlight %}
     </div>
     
    +### Applying User-defined Function
    +In SparkR, we support several kinds for User-defined Functions:
    +
    +#### Run a given function on a large dataset using `dapply` or `dapplyCollect`
    +
    +##### dapply
    +Apply a function to each partition of `SparkDataFrame`. The function to be applied to each partition of the `SparkDataFrame`
    +and should have only one parameter, to which a `data.frame` corresponds to each partition will be passed. The output of function
    +should be a `data.frame`. Schema specifies the row format of the resulting `SparkDataFrame`. It must match the R function's output.
    +<div data-lang="r"  markdown="1">
    +{% highlight r %}
    +
    +# Convert waiting time from hours to seconds.
    +# Note that we can apply UDF to DataFrame.
    +schema <- structType(structField("eruptions", "double"), structField("waiting", "double"),
    +                     structField("waiting_secs", "double"))
    +df1 <- dapply(df, function(x) {x <- cbind(x, x$waiting * 60)}, schema)
    +head(collect(df1))
    +##  eruptions waiting waiting_secs
    +##1     3.600      79         4740
    +##2     1.800      54         3240
    +##3     3.333      74         4440
    +##4     2.283      62         3720
    +##5     4.533      85         5100
    +##6     2.883      55         3300
    +{% endhighlight %}
    +</div>
    +
    +##### dapplyCollect
    +Like `dapply`, apply a function to each partition of `SparkDataFrame` and collect the result back. The output of function
    +should be a `data.frame`. But, Schema is not required to be passed. Note that `dapplyCollect` only can be used if the
    +output of UDF run on all the partitions can fit in driver memory.
    +<div data-lang="r"  markdown="1">
    +{% highlight r %}
    +
    +# Convert waiting time from hours to seconds.
    +# Note that we can apply UDF to DataFrame and return a R's data.frame
    +ldf <- dapplyCollect(
    +         df,
    +         function(x) {
    +           x <- cbind(x, "waiting_secs"=x$waiting * 60)
    +         })
    +head(ldf, 3)
    +##  eruptions waiting waiting_secs
    +##1     3.600      79         4740
    +##2     1.800      54         3240
    +##3     3.333      74         4440
    +
    +{% endhighlight %}
    +</div>
    +
    +#### Run many functions in parallel using `spark.lapply`
    +
    +##### lapply
    --- End diff --
    
    One thought about spark.lapply() is that documenting here means our commitment to it. This is a case demonstrating the need to support Dataset in SparkR. Maybe next step we can consider replace RDD with Dataset in SparkR


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13660: [SPARK-15672][R][DOC] R programming guide update

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13660
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/60918/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13660: [SPARK-15672][R][DOC] R programming guide update

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/13660


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13660: [SPARK-15672][R][DOC] R programming guide update

Posted by sun-rui <gi...@git.apache.org>.

Github user sun-rui commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13660#discussion_r67880209
  
    --- Diff: docs/sparkr.md ---
    @@ -262,6 +262,83 @@ head(df)
     {% endhighlight %}
     </div>
     
    +### Applying User-defined Function
    +In SparkR, we support several kinds for User-defined Functions:
    +
    +#### Run a given function on a large dataset using `dapply` or `dapplyCollect`
    +
    +##### dapply
    +Apply a function to each partition of `SparkDataFrame`. The function to be applied to each partition of the `SparkDataFrame`
    --- End diff --
    
    Apply a function to each partition of a `SparkDataFrame`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13660: [SPARK-15672][R][DOC] R programming guide update

Posted by sun-rui <gi...@git.apache.org>.

Github user sun-rui commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13660#discussion_r67880749
  
    --- Diff: docs/sparkr.md ---
    @@ -262,6 +262,83 @@ head(df)
     {% endhighlight %}
     </div>
     
    +### Applying User-defined Function
    +In SparkR, we support several kinds for User-defined Functions:
    +
    +#### Run a given function on a large dataset using `dapply` or `dapplyCollect`
    +
    +##### dapply
    +Apply a function to each partition of `SparkDataFrame`. The function to be applied to each partition of the `SparkDataFrame`
    +and should have only one parameter, to which a `data.frame` corresponds to each partition will be passed. The output of function
    +should be a `data.frame`. Schema specifies the row format of the resulting `SparkDataFrame`. It must match the R function's output.
    +<div data-lang="r"  markdown="1">
    +{% highlight r %}
    +
    +# Convert waiting time from hours to seconds.
    +# Note that we can apply UDF to DataFrame.
    +schema <- structType(structField("eruptions", "double"), structField("waiting", "double"),
    +                     structField("waiting_secs", "double"))
    +df1 <- dapply(df, function(x) {x <- cbind(x, x$waiting * 60)}, schema)
    +head(collect(df1))
    +##  eruptions waiting waiting_secs
    +##1     3.600      79         4740
    +##2     1.800      54         3240
    +##3     3.333      74         4440
    +##4     2.283      62         3720
    +##5     4.533      85         5100
    +##6     2.883      55         3300
    +{% endhighlight %}
    +</div>
    +
    +##### dapplyCollect
    +Like `dapply`, apply a function to each partition of `SparkDataFrame` and collect the result back. The output of function
    +should be a `data.frame`. But, Schema is not required to be passed. Note that `dapplyCollect` only can be used if the
    +output of UDF run on all the partitions can fit in driver memory.
    +<div data-lang="r"  markdown="1">
    +{% highlight r %}
    +
    +# Convert waiting time from hours to seconds.
    +# Note that we can apply UDF to DataFrame and return a R's data.frame
    +ldf <- dapplyCollect(
    +         df,
    +         function(x) {
    +           x <- cbind(x, "waiting_secs"=x$waiting * 60)
    --- End diff --
    
    style nit: "waiting_secs" = x$waiting * 60


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13660: [SPARK-15672][R][DOC] R programming guide update

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13660
  
    **[Test build #60863 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60863/consoleFull)** for PR 13660 at commit [`3f2aea9`](https://github.com/apache/spark/commit/3f2aea9e6c909ec271e146073865b13551d90c27).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13660: [SPARK-15672][R][DOC] R programming guide update

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13660
  
    **[Test build #60485 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60485/consoleFull)** for PR 13660 at commit [`2611549`](https://github.com/apache/spark/commit/2611549e60f68d6ba12ec10b471dc96944508873).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13660: [SPARK-15672][R][DOC] R programming guide update

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13660
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/60788/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13660: [SPARK-15672][R][DOC] R programming guide update

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13660
  
    **[Test build #60957 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60957/consoleFull)** for PR 13660 at commit [`8d4f163`](https://github.com/apache/spark/commit/8d4f16354005be12b932287f01b44a1c99a56f5b).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13660: [SPARK-15672][R][DOC] R programming guide update

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13660
  
    **[Test build #60788 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60788/consoleFull)** for PR 13660 at commit [`920c975`](https://github.com/apache/spark/commit/920c975adf176a1cbce3f3631762073a7131f713).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13660: [SPARK-15672][R][DOC] R programming guide update

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13660
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/60863/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13660: [SPARK-15672][R][DOC] R programming guide update

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13660#discussion_r67915708
  
    --- Diff: docs/sparkr.md ---
    @@ -262,6 +262,83 @@ head(df)
     {% endhighlight %}
     </div>
     
    +### Applying User-defined Function
    +In SparkR, we support several kinds for User-defined Functions:
    +
    +#### Run a given function on a large dataset using `dapply` or `dapplyCollect`
    +
    +##### dapply
    +Apply a function to each partition of `SparkDataFrame`. The function to be applied to each partition of the `SparkDataFrame`
    +and should have only one parameter, to which a `data.frame` corresponds to each partition will be passed. The output of function
    +should be a `data.frame`. Schema specifies the row format of the resulting `SparkDataFrame`. It must match the R function's output.
    +<div data-lang="r"  markdown="1">
    +{% highlight r %}
    +
    +# Convert waiting time from hours to seconds.
    +# Note that we can apply UDF to DataFrame.
    +schema <- structType(structField("eruptions", "double"), structField("waiting", "double"),
    +                     structField("waiting_secs", "double"))
    +df1 <- dapply(df, function(x) {x <- cbind(x, x$waiting * 60)}, schema)
    +head(collect(df1))
    +##  eruptions waiting waiting_secs
    +##1     3.600      79         4740
    +##2     1.800      54         3240
    +##3     3.333      74         4440
    +##4     2.283      62         3720
    +##5     4.533      85         5100
    +##6     2.883      55         3300
    +{% endhighlight %}
    +</div>
    +
    +##### dapplyCollect
    +Like `dapply`, apply a function to each partition of `SparkDataFrame` and collect the result back. The output of function
    +should be a `data.frame`. But, Schema is not required to be passed. Note that `dapplyCollect` only can be used if the
    +output of UDF run on all the partitions can fit in driver memory.
    +<div data-lang="r"  markdown="1">
    +{% highlight r %}
    +
    +# Convert waiting time from hours to seconds.
    +# Note that we can apply UDF to DataFrame and return a R's data.frame
    +ldf <- dapplyCollect(
    +         df,
    +         function(x) {
    +           x <- cbind(x, "waiting_secs"=x$waiting * 60)
    +         })
    +head(ldf, 3)
    +##  eruptions waiting waiting_secs
    +##1     3.600      79         4740
    +##2     1.800      54         3240
    +##3     3.333      74         4440
    +
    +{% endhighlight %}
    +</div>
    +
    +#### Run many functions in parallel using `spark.lapply`
    +
    +##### lapply
    --- End diff --
    
    I think spark.lapply is more about running native R distributed vs Dataset would be typed data distributed (without native R), that seems fairly orthogonal to me.
    
    speaking of, how should we address local R here? should this say "Run local R functions in parallel using spark.lapply" or "Run local R functions distributed using spark.lapply"?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13660: [SPARK-15672][R][DOC] R programming guide update

Posted by NarineK <gi...@git.apache.org>.

Github user NarineK commented on the issue:

    https://github.com/apache/spark/pull/13660
  
    Hi @vectorijk , @felixcheung ,
    As I was looking at the documentation generated in R I've noticed that there is some duplicated information. I'm not sure if this is the right place to ask about it, I thought you might have seen it.
    In I help I see the following:
    ```
    Arguments
    
    x	
    A SparkDataFrame
    func	
    A function to be applied to each partition of the SparkDataFrame. func should have only one parameter, to which a data.frame corresponds to each partition will be passed. The output of func should be a data.frame.
    schema	
    The schema of the resulting SparkDataFrame after the function is applied. It must match the output of func.
    x	
    A SparkDataFrame
    func	
    A function to be applied to each partition of the SparkDataFrame. func should have only one parameter, to which a data.frame corresponds to each partition will be passed. The output of func should be a data.frame.
    See Also
    
    Other SparkDataFrame functions: SparkDataFrame-class, [[, agg, arrange, as.data.frame, attach, cache, collect, colnames, coltypes, columns, count, createOrReplaceTempView, describe, dim, distinct, dropDuplicates, dropna, drop, dtypes, except, explain, filter, first, gapplyCollect, gapply, group_by, head, histogram, insertInto, intersect, isLocal, join, limit, merge, mutate, ncol, persist, printSchema, rename, repartition, sample, saveAsTable, selectExpr, select, showDF, show, str, take, unionAll, unpersist, withColumn, with, write.df, write.jdbc, write.json, write.parquet, write.text
    
    Other SparkDataFrame functions: SparkDataFrame-class, [[, agg, arrange, as.data.frame, attach, cache, collect, colnames, coltypes, columns, count, createOrReplaceTempView, describe, dim, distinct, dropDuplicates, dropna, drop, dtypes, except, explain, filter, first, gapplyCollect, gapply, group_by, head, histogram, insertInto, intersect, isLocal, join, limit, merge, mutate, ncol, persist, printSchema, rename, repartition, sample, saveAsTable, selectExpr, select, showDF, show, str, take, unionAll, unpersist, withColumn, with, write.df, write.jdbc, write.json, write.parquet, write.text
    
    ```
    
    Is this on purpose ?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13660: [SPARK-15672][R][DOC] R programming guide update

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13660
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/60957/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13660: [SPARK-15672][R][DOC] R programming guide update

Posted by shivaram <gi...@git.apache.org>.

Github user shivaram commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13660#discussion_r67018219
  
    --- Diff: docs/sparkr.md ---
    @@ -262,6 +262,67 @@ head(df)
     {% endhighlight %}
     </div>
     
    +### Applying User-defined Function
    +
    +#### dapply
    +Apply a function to each partition of `SparkDataFrame`. The function to be applied to each partition of the `SparkDataFrame` and should have only one parameter, to which a `data.frame` corresponds to each partition will be passed. The output of function should be a `data.frame`.
    +<div data-lang="r"  markdown="1">
    +{% highlight r %}
    +
    +# Convert waiting time from hours to seconds.
    +# Note that we can apply UDF to DataFrame.
    +
    +df1 <- dapply(df, function(x) {x}, schema(df))
    --- End diff --
    
    The conversion is not actually happening in this example ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13660: [SPARK-15672][R][DOC] R programming guide update

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13660
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/60787/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13660: [SPARK-15672][R][DOC] R programming guide update

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13660
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13660: [SPARK-15672][R][DOC] R programming guide update

Posted by shivaram <gi...@git.apache.org>.

Github user shivaram commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13660#discussion_r67920626
  
    --- Diff: docs/sparkr.md ---
    @@ -262,6 +262,83 @@ head(df)
     {% endhighlight %}
     </div>
     
    +### Applying User-defined Function
    +In SparkR, we support several kinds for User-defined Functions:
    +
    +#### Run a given function on a large dataset using `dapply` or `dapplyCollect`
    +
    +##### dapply
    +Apply a function to each partition of `SparkDataFrame`. The function to be applied to each partition of the `SparkDataFrame`
    +and should have only one parameter, to which a `data.frame` corresponds to each partition will be passed. The output of function
    +should be a `data.frame`. Schema specifies the row format of the resulting `SparkDataFrame`. It must match the R function's output.
    +<div data-lang="r"  markdown="1">
    +{% highlight r %}
    +
    +# Convert waiting time from hours to seconds.
    +# Note that we can apply UDF to DataFrame.
    +schema <- structType(structField("eruptions", "double"), structField("waiting", "double"),
    +                     structField("waiting_secs", "double"))
    +df1 <- dapply(df, function(x) {x <- cbind(x, x$waiting * 60)}, schema)
    +head(collect(df1))
    +##  eruptions waiting waiting_secs
    +##1     3.600      79         4740
    +##2     1.800      54         3240
    +##3     3.333      74         4440
    +##4     2.283      62         3720
    +##5     4.533      85         5100
    +##6     2.883      55         3300
    +{% endhighlight %}
    +</div>
    +
    +##### dapplyCollect
    +Like `dapply`, apply a function to each partition of `SparkDataFrame` and collect the result back. The output of function
    +should be a `data.frame`. But, Schema is not required to be passed. Note that `dapplyCollect` only can be used if the
    +output of UDF run on all the partitions can fit in driver memory.
    +<div data-lang="r"  markdown="1">
    +{% highlight r %}
    +
    +# Convert waiting time from hours to seconds.
    +# Note that we can apply UDF to DataFrame and return a R's data.frame
    +ldf <- dapplyCollect(
    +         df,
    +         function(x) {
    +           x <- cbind(x, "waiting_secs"=x$waiting * 60)
    +         })
    +head(ldf, 3)
    +##  eruptions waiting waiting_secs
    +##1     3.600      79         4740
    +##2     1.800      54         3240
    +##3     3.333      74         4440
    +
    +{% endhighlight %}
    +</div>
    +
    +#### Run many functions in parallel using `spark.lapply`
    +
    +##### lapply
    --- End diff --
    
    I'm a little more inclined to use `distributed` instead of `parallel` but both of them sound fine to me. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13660: [SPARK-15672][R][DOC] R programming guide update

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13660
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13660: [SPARK-15672][R][DOC] R programming guide update

Posted by shivaram <gi...@git.apache.org>.

Github user shivaram commented on the issue:

    https://github.com/apache/spark/pull/13660
  
    @felixcheung @jkbradley any more comments on this ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13660: [SPARK-15672][R][DOC] R programming guide update

Posted by shivaram <gi...@git.apache.org>.

Github user shivaram commented on the issue:

    https://github.com/apache/spark/pull/13660
  
    Thanks @vectorijk - I left some comments inline.
    
    cc @felixcheung 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13660: [SPARK-15672][R][DOC] R programming guide update

Posted by sun-rui <gi...@git.apache.org>.

Github user sun-rui commented on the issue:

    https://github.com/apache/spark/pull/13660
  
    Can you add documentation for gapply() and gapplyCollect() together here? or @NarineK will do in another PR?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13660: [SPARK-15672][R][DOC] R programming guide update

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13660
  
    **[Test build #60788 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60788/consoleFull)** for PR 13660 at commit [`920c975`](https://github.com/apache/spark/commit/920c975adf176a1cbce3f3631762073a7131f713).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13660: [SPARK-15672][R][DOC] R programming guide update

Posted by shivaram <gi...@git.apache.org>.

Github user shivaram commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13660#discussion_r67771773
  
    --- Diff: docs/sparkr.md ---
    @@ -262,6 +262,79 @@ head(df)
     {% endhighlight %}
     </div>
     
    +### Applying User-defined Function
    +In SparkR, we support several kinds for User-defined Functions:
    +
    +#### Run a given function on a large dataset using `dapply` or `dapplyCollect`
    +
    +##### dapply
    +Apply a function to each partition of `SparkDataFrame`. The function to be applied to each partition of the `SparkDataFrame`
    +and should have only one parameter, to which a `data.frame` corresponds to each partition will be passed. The output of function
    +should be a `data.frame`. Schema specifies the row format of the resulting `SparkDataFrame`. It must match the R function's output.
    +<div data-lang="r"  markdown="1">
    +{% highlight r %}
    +
    +# Convert waiting time from hours to seconds.
    +# Note that we can apply UDF to DataFrame.
    +schema <- structType(structField("eruptions", "double"), structField("waiting", "double"),
    +                     structField("waiting_secs", "double"))
    +df1 <- dapply(df, function(x) {x <- cbind(x, x$waiting * 60)}, schema)
    +head(collect(df1))
    +##  eruptions waiting waiting_secs
    +##1     3.600      79         4740
    +##2     1.800      54         3240
    +##3     3.333      74         4440
    +##4     2.283      62         3720
    +##5     4.533      85         5100
    +##6     2.883      55         3300
    +{% endhighlight %}
    +</div>
    +
    +##### dapplyCollect
    +Like `dapply`, apply a function to each partition of `SparkDataFrame` and collect the result back.
    --- End diff --
    
    I think its good to say a couple of things here. First that we don't require any schema to be passed in to `dapplyCollect` (unlike `dapply`). The other thing is that its good to remind users that this should be used only if the output of the UDF run on all the partitions can fit in driver memory.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13660: [SPARK-15672][R][DOC] R programming guide update

Posted by shivaram <gi...@git.apache.org>.

Github user shivaram commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13660#discussion_r67018458
  
    --- Diff: docs/sparkr.md ---
    @@ -262,6 +262,67 @@ head(df)
     {% endhighlight %}
     </div>
     
    +### Applying User-defined Function
    +
    +#### dapply
    +Apply a function to each partition of `SparkDataFrame`. The function to be applied to each partition of the `SparkDataFrame` and should have only one parameter, to which a `data.frame` corresponds to each partition will be passed. The output of function should be a `data.frame`.
    +<div data-lang="r"  markdown="1">
    +{% highlight r %}
    +
    +# Convert waiting time from hours to seconds.
    +# Note that we can apply UDF to DataFrame.
    +
    +df1 <- dapply(df, function(x) {x}, schema(df))
    +head(collect(df1), 3)
    +##  eruptions waiting waiting_secs
    +##1     3.600      79         4740
    +##2     1.800      54         3240
    +##3     3.333      74         4440
    +
    +{% endhighlight %}
    +</div>
    +
    +#### dapplyCollect
    +Like `dapply`, apply a function to each partition of `SparkDataFrame` and collect the result back.
    +<div data-lang="r"  markdown="1">
    +{% highlight r %}
    +
    +# Convert waiting time from hours to seconds.
    +# Note that we can apply UDF to DataFrame.
    +ldf <- dapplyCollect(
    +         df,
    +         function(x) {
    +           x <- cbind(x, "waiting_secs"=x$waiting * 60)
    +         })
    +head(df, 3)
    --- End diff --
    
    Would be good to show how `head(Ldf)` looks here. Also we should note that the difference in dapplyCollect is that the schema doesn't need to be passed in by the user 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13660: [SPARK-15672][R][DOC] R programming guide update

Posted by vectorijk <gi...@git.apache.org>.

Github user vectorijk commented on the issue:

    https://github.com/apache/spark/pull/13660
  
    Jenkins test this again.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13660: [SPARK-15672][R][DOC] R programming guide update

Posted by vectorijk <gi...@git.apache.org>.

Github user vectorijk commented on the issue:

    https://github.com/apache/spark/pull/13660
  
    cc @jkbradley @shivaram 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13660: [SPARK-15672][R][DOC] R programming guide update

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the issue:

    https://github.com/apache/spark/pull/13660
  
    LGTM.  I'll merge this with master and branch-2.0
    Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13660: [SPARK-15672][R][DOC] R programming guide update

Posted by vectorijk <gi...@git.apache.org>.

Github user vectorijk commented on the issue:

    https://github.com/apache/spark/pull/13660
  
    @jkbradley @shivaram @felixcheung addressed comments.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13660: [SPARK-15672][R][DOC] R programming guide update

Posted by sun-rui <gi...@git.apache.org>.

Github user sun-rui commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13660#discussion_r67880530
  
    --- Diff: docs/sparkr.md ---
    @@ -262,6 +262,83 @@ head(df)
     {% endhighlight %}
     </div>
     
    +### Applying User-defined Function
    +In SparkR, we support several kinds for User-defined Functions:
    +
    +#### Run a given function on a large dataset using `dapply` or `dapplyCollect`
    +
    +##### dapply
    +Apply a function to each partition of `SparkDataFrame`. The function to be applied to each partition of the `SparkDataFrame`
    +and should have only one parameter, to which a `data.frame` corresponds to each partition will be passed. The output of function
    +should be a `data.frame`. Schema specifies the row format of the resulting `SparkDataFrame`. It must match the R function's output.
    +<div data-lang="r"  markdown="1">
    +{% highlight r %}
    +
    +# Convert waiting time from hours to seconds.
    +# Note that we can apply UDF to DataFrame.
    +schema <- structType(structField("eruptions", "double"), structField("waiting", "double"),
    +                     structField("waiting_secs", "double"))
    +df1 <- dapply(df, function(x) {x <- cbind(x, x$waiting * 60)}, schema)
    +head(collect(df1))
    +##  eruptions waiting waiting_secs
    +##1     3.600      79         4740
    +##2     1.800      54         3240
    +##3     3.333      74         4440
    +##4     2.283      62         3720
    +##5     4.533      85         5100
    +##6     2.883      55         3300
    +{% endhighlight %}
    +</div>
    +
    +##### dapplyCollect
    +Like `dapply`, apply a function to each partition of `SparkDataFrame` and collect the result back. The output of function
    --- End diff --
    
    apply a function to each partition of a `SparkDataFrame`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13660: [SPARK-15672][R][DOC] R programming guide update

Posted by sun-rui <gi...@git.apache.org>.

Github user sun-rui commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13660#discussion_r67880977
  
    --- Diff: docs/sparkr.md ---
    @@ -262,6 +262,83 @@ head(df)
     {% endhighlight %}
     </div>
     
    +### Applying User-defined Function
    +In SparkR, we support several kinds for User-defined Functions:
    +
    +#### Run a given function on a large dataset using `dapply` or `dapplyCollect`
    +
    +##### dapply
    +Apply a function to each partition of `SparkDataFrame`. The function to be applied to each partition of the `SparkDataFrame`
    +and should have only one parameter, to which a `data.frame` corresponds to each partition will be passed. The output of function
    +should be a `data.frame`. Schema specifies the row format of the resulting `SparkDataFrame`. It must match the R function's output.
    +<div data-lang="r"  markdown="1">
    +{% highlight r %}
    +
    +# Convert waiting time from hours to seconds.
    +# Note that we can apply UDF to DataFrame.
    +schema <- structType(structField("eruptions", "double"), structField("waiting", "double"),
    +                     structField("waiting_secs", "double"))
    +df1 <- dapply(df, function(x) {x <- cbind(x, x$waiting * 60)}, schema)
    +head(collect(df1))
    +##  eruptions waiting waiting_secs
    +##1     3.600      79         4740
    +##2     1.800      54         3240
    +##3     3.333      74         4440
    +##4     2.283      62         3720
    +##5     4.533      85         5100
    +##6     2.883      55         3300
    +{% endhighlight %}
    +</div>
    +
    +##### dapplyCollect
    +Like `dapply`, apply a function to each partition of `SparkDataFrame` and collect the result back. The output of function
    +should be a `data.frame`. But, Schema is not required to be passed. Note that `dapplyCollect` only can be used if the
    +output of UDF run on all the partitions can fit in driver memory.
    +<div data-lang="r"  markdown="1">
    +{% highlight r %}
    +
    +# Convert waiting time from hours to seconds.
    +# Note that we can apply UDF to DataFrame and return a R's data.frame
    +ldf <- dapplyCollect(
    +         df,
    +         function(x) {
    +           x <- cbind(x, "waiting_secs"=x$waiting * 60)
    +         })
    +head(ldf, 3)
    +##  eruptions waiting waiting_secs
    +##1     3.600      79         4740
    +##2     1.800      54         3240
    +##3     3.333      74         4440
    +
    +{% endhighlight %}
    +</div>
    +
    +#### Run many functions in parallel using `spark.lapply`
    +
    +##### lapply
    --- End diff --
    
    spark.lapply


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13660: [SPARK-15672][R][DOC] R programming guide update

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13660
  
    **[Test build #60787 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60787/consoleFull)** for PR 13660 at commit [`063bc8e`](https://github.com/apache/spark/commit/063bc8ed69b90504fbd79d32b53e044275bd6908).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13660: [SPARK-15672][R][DOC] R programming guide update

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13660
  
    **[Test build #60918 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60918/consoleFull)** for PR 13660 at commit [`ae26233`](https://github.com/apache/spark/commit/ae26233922c36e747244bd317719b9559a43d38f).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13660: [SPARK-15672][R][DOC] R programming guide update

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on the issue:

    https://github.com/apache/spark/pull/13660
  
    great! please see pending PR #13752 on removing `sc` parameter from `spark.lapply`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13660: [SPARK-15672][R][DOC] R programming guide update

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13660
  
    **[Test build #60863 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60863/consoleFull)** for PR 13660 at commit [`3f2aea9`](https://github.com/apache/spark/commit/3f2aea9e6c909ec271e146073865b13551d90c27).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org