You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by NarineK <gi...@git.apache.org> on 2016/06/18 09:06:34 UTC

[GitHub] spark pull request #13760: [SPARK-16012][SparkR] GapplyCollect - applies a R...

GitHub user NarineK opened a pull request:

    https://github.com/apache/spark/pull/13760

    [SPARK-16012][SparkR] GapplyCollect - applies a R function to each group similar to gapply and collects the result back to R data.frame

    ## What changes were proposed in this pull request?
    gapplyCollect() does gapply() on a SparkDataFrame and collect the result back to R. Compared to gapply() + collect(), gapplyCollect() offers performance optimization as well as programming convenience, as no schema is needed to be provided.
    
    This is similar to dapplyCollect().
    
    ## How was this patch tested?
    Added test cases for gapplyCollect similar to dapplyCollect


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/NarineK/spark gapplyCollect

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/13760.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #13760
    
----
commit ea31820c9501d1f8cba96bc7f8e0fab04e9af0a2
Author: Narine Kokhlikyan <na...@slice.com>
Date:   2016-06-17T10:51:15Z

    initial version of gapplyCollect

commit f8e54dc265ad0eb66a26508bc5221606a9652e22
Author: Narine Kokhlikyan <na...@slice.com>
Date:   2016-06-17T11:00:05Z

    merged with master

commit 591c4804764cdce67d22ce52ec38c74f246e738b
Author: Narine Kokhlikyan <na...@slice.com>
Date:   2016-06-17T11:03:04Z

    revert .gitignore

commit 37b633afdff46374d983d204f13c05c769c7f40e
Author: Narine Kokhlikyan <na...@slice.com>
Date:   2016-06-18T08:59:05Z

    added test cases + improved the code

commit de5dbb0be0a3fcc42096a10470c543eaf7aa6d5c
Author: Narine Kokhlikyan <na...@slice.com>
Date:   2016-06-18T09:05:57Z

    fixed test case

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13760: [SPARK-16012][SparkR] implement gapplyCollect whi...

Posted by NarineK <gi...@git.apache.org>.

Github user NarineK commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13760#discussion_r68564302
  
    --- Diff: R/pkg/R/group.R ---
    @@ -198,62 +198,61 @@ createMethods()
     #'
     #' Applies a R function to each group in the input GroupedData
     #'
    -#' @param x a GroupedData
    -#' @param func A function to be applied to each group partition specified by GroupedData.
    -#'             The function `func` takes as argument a key - grouping columns and
    -#'             a data frame - a local R data.frame.
    -#'             The output of `func` is a local R data.frame.
    -#' @param schema The schema of the resulting SparkDataFrame after the function is applied.
    -#'               The schema must match to output of `func`. It has to be defined for each
    -#'               output column with preferred output column name and corresponding data type.
    -#' @return a SparkDataFrame
    +#' @param x A GroupedData
     #' @rdname gapply
     #' @name gapply
     #' @export
    -#' @examples
    -#' \dontrun{
    -#' Computes the arithmetic mean of the second column by grouping
    -#' on the first and third columns. Output the grouping values and the average.
    -#'
    -#' df <- createDataFrame (
    -#' list(list(1L, 1, "1", 0.1), list(1L, 2, "1", 0.2), list(3L, 3, "3", 0.3)),
    -#'   c("a", "b", "c", "d"))
    -#'
    -#' Here our output contains three columns, the key which is a combination of two
    -#' columns with data types integer and string and the mean which is a double.
    -#' schema <-  structType(structField("a", "integer"), structField("c", "string"),
    -#'   structField("avg", "double"))
    -#' df1 <- gapply(
    -#'   df,
    -#'   list("a", "c"),
    -#'   function(key, x) {
    -#'     y <- data.frame(key, mean(x$b), stringsAsFactors = FALSE)
    -#'   },
    -#' schema)
    -#' collect(df1)
    -#'
    -#' Result
    -#' ------
    -#' a c avg
    -#' 3 3 3.0
    -#' 1 1 1.5
    -#' }
    +#' @seealso \link{gapplyCollect}
     #' @note gapply(GroupedData) since 2.0.0
     setMethod("gapply",
               signature(x = "GroupedData"),
               function(x, func, schema) {
    -            try(if (is.null(schema)) stop("schema cannot be NULL"))
    -            packageNamesArr <- serialize(.sparkREnv[[".packages"]],
    -                                 connection = NULL)
    -            broadcastArr <- lapply(ls(.broadcastNames),
    -                              function(name) { get(name, .broadcastNames) })
    -            sdf <- callJStatic(
    -                     "org.apache.spark.sql.api.r.SQLUtils",
    -                     "gapply",
    -                     x@sgd,
    -                     serialize(cleanClosure(func), connection = NULL),
    -                     packageNamesArr,
    -                     broadcastArr,
    -                     schema$jobj)
    -            dataFrame(sdf)
    +            if (is.null(schema)) stop("schema cannot be NULL")
    +            gapplyInternal(x, func, schema)
               })
    +
    +#' gapplyCollect
    +#'
    +#' Applies a R function to each group in the input GroupedData and collects the result
    --- End diff --
    
    Well the descriptions are are slightly different. Curently it shows both:
    
    Applies a R function to each group in the input GroupedData and collects the result back to R as a data.frame.
    
    Groups the SparkDataFrame using the specified columns, applies the R function to each group and collects the result back to R as data.frame.
    
    I can remove the one in group.R and leave only the DataFrame one.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13760: [SPARK-16012][SparkR] gapplyCollect - applies a R functi...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13760
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61064/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13760: [SPARK-16012][SparkR] GapplyCollect - applies a R...

Posted by sun-rui <gi...@git.apache.org>.

Github user sun-rui commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13760#discussion_r67799929
  
    --- Diff: R/pkg/R/DataFrame.R ---
    @@ -1347,6 +1347,65 @@ setMethod("gapply",
                 gapply(grouped, func, schema)
               })
     
    +#' gapplyCollect
    +#'
    +#' Groups the SparkDataFrame using the specified columns, applies the R function to each
    +#' group and collects the result back to R as data.frame.
    +#'
    +#' @param x A SparkDataFrame
    +#' @rdname gapplyCollect
    +#' @name gapplyCollect
    +#' @export
    +#' @examples
    +#'
    +#' \dontrun{
    +#' Computes the arithmetic mean of the second column by grouping
    +#' on the first and third columns. Output the grouping values and the average.
    +#'
    +#' result <- gapplyCollect(
    +#'   df,
    +#'   list("a", "c"),
    +#'   function(key, x) {
    +#'     y <- data.frame(key, mean(x$b), stringsAsFactors = FALSE)
    +#'     colnames(y) <- c("key_a", "key_c", "mean_b")
    +#'     y
    +#'   })
    +#'
    +#' Result
    +#' ------
    +#' key_a key_c mean_b
    +#' 3 3 3.0
    +#' 1 1 1.5
    +#'
    +#' Fits linear models on iris dataset by grouping on the 'Species' column and
    +#' using 'Sepal_Length' as a target variable, 'Sepal_Width', 'Petal_Length'
    +#' and 'Petal_Width' as training features.
    +#'
    +#' df <- createDataFrame (iris)
    +#' result <- gapplyCollect(
    +#'   df,
    +#'   list(df$"Species"),
    --- End diff --
    
    no need for a scalar to be a list. just df$"Species" is OK.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13760: [SPARK-16012][SparkR] GapplyCollect - applies a R functi...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13760
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/60778/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13760: [SPARK-16012][SparkR] GapplyCollect - applies a R functi...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13760
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/60779/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13760: [SPARK-16012][SparkR] implement gapplyCollect which will...

Posted by NarineK <gi...@git.apache.org>.

Github user NarineK commented on the issue:

    https://github.com/apache/spark/pull/13760
  
    Do you have any questions on this  @shivaram , @sun-rui ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13760: [SPARK-16012][SparkR] implement gapplyCollect whi...

Posted by shivaram <gi...@git.apache.org>.

Github user shivaram commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13760#discussion_r68268602
  
    --- Diff: R/pkg/R/group.R ---
    @@ -199,17 +199,10 @@ createMethods()
     #' Applies a R function to each group in the input GroupedData
     #'
     #' @param x a GroupedData
    -#' @param func A function to be applied to each group partition specified by GroupedData.
    -#'             The function `func` takes as argument a key - grouping columns and
    -#'             a data frame - a local R data.frame.
    -#'             The output of `func` is a local R data.frame.
    -#' @param schema The schema of the resulting SparkDataFrame after the function is applied.
    -#'               The schema must match to output of `func`. It has to be defined for each
    -#'               output column with preferred output column name and corresponding data type.
    -#' @return a SparkDataFrame
     #' @rdname gapply
     #' @name gapply
     #' @export
    +#' @seealso \link{gapplyCollect}
     #' @examples
     #' \dontrun{
    --- End diff --
    
    I think its fine either way (to have the examples spread vs. not spread). BTW the snippet pasted above looks pretty good to me.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13760: [SPARK-16012][SparkR] GapplyCollect - applies a R...

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13760#discussion_r67777816
  
    --- Diff: R/pkg/R/group.R ---
    @@ -191,18 +184,72 @@ createMethods()
     setMethod("gapply",
               signature(x = "GroupedData"),
               function(x, func, schema) {
    -            try(if (is.null(schema)) stop("schema cannot be NULL"))
    -            packageNamesArr <- serialize(.sparkREnv[[".packages"]],
    -                                 connection = NULL)
    -            broadcastArr <- lapply(ls(.broadcastNames),
    -                              function(name) { get(name, .broadcastNames) })
    -            sdf <- callJStatic(
    -                     "org.apache.spark.sql.api.r.SQLUtils",
    -                     "gapply",
    -                     x@sgd,
    -                     serialize(cleanClosure(func), connection = NULL),
    -                     packageNamesArr,
    -                     broadcastArr,
    -                     schema$jobj)
    -            dataFrame(sdf)
    +            gapplyInternal(x, func, schema)
    +          })
    +
    +#' gapplyCollect
    +#'
    +#' Applies a R function to each group in the input GroupedData and collects the result
    +#' back to R as a data.frame.
    +#'
    +#' @param x a GroupedData
    +#' @param func A function to be applied to each group partition specified by GroupedData.
    +#'             The function `func` takes as argument a key - grouping columns and
    +#'             a data frame - a local R data.frame.
    +#'             The output of `func` is a local R data.frame.
    +#' @return a SparkDataFrame
    +#' @rdname gapplyCollect
    +#' @name gapplyCollect
    +#' @seealso gapply \link{gapply}
    --- End diff --
    
    same here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13760: [SPARK-16012][SparkR] GapplyCollect - applies a R functi...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13760
  
    **[Test build #60815 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60815/consoleFull)** for PR 13760 at commit [`7dd883f`](https://github.com/apache/spark/commit/7dd883f070d3a620a5b45765ccbe6b5876baa52f).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13760: [SPARK-16012][SparkR] GapplyCollect - applies a R functi...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13760
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13760: [SPARK-16012][SparkR] implement gapplyCollect whi...

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13760#discussion_r68542089
  
    --- Diff: R/pkg/R/group.R ---
    @@ -198,62 +198,61 @@ createMethods()
     #'
     #' Applies a R function to each group in the input GroupedData
     #'
    -#' @param x a GroupedData
    -#' @param func A function to be applied to each group partition specified by GroupedData.
    -#'             The function `func` takes as argument a key - grouping columns and
    -#'             a data frame - a local R data.frame.
    -#'             The output of `func` is a local R data.frame.
    -#' @param schema The schema of the resulting SparkDataFrame after the function is applied.
    -#'               The schema must match to output of `func`. It has to be defined for each
    -#'               output column with preferred output column name and corresponding data type.
    -#' @return a SparkDataFrame
    +#' @param x A GroupedData
     #' @rdname gapply
     #' @name gapply
     #' @export
    -#' @examples
    -#' \dontrun{
    -#' Computes the arithmetic mean of the second column by grouping
    -#' on the first and third columns. Output the grouping values and the average.
    -#'
    -#' df <- createDataFrame (
    -#' list(list(1L, 1, "1", 0.1), list(1L, 2, "1", 0.2), list(3L, 3, "3", 0.3)),
    -#'   c("a", "b", "c", "d"))
    -#'
    -#' Here our output contains three columns, the key which is a combination of two
    -#' columns with data types integer and string and the mean which is a double.
    -#' schema <-  structType(structField("a", "integer"), structField("c", "string"),
    -#'   structField("avg", "double"))
    -#' df1 <- gapply(
    -#'   df,
    -#'   list("a", "c"),
    -#'   function(key, x) {
    -#'     y <- data.frame(key, mean(x$b), stringsAsFactors = FALSE)
    -#'   },
    -#' schema)
    -#' collect(df1)
    -#'
    -#' Result
    -#' ------
    -#' a c avg
    -#' 3 3 3.0
    -#' 1 1 1.5
    -#' }
    +#' @seealso \link{gapplyCollect}
     #' @note gapply(GroupedData) since 2.0.0
     setMethod("gapply",
               signature(x = "GroupedData"),
               function(x, func, schema) {
    -            try(if (is.null(schema)) stop("schema cannot be NULL"))
    -            packageNamesArr <- serialize(.sparkREnv[[".packages"]],
    -                                 connection = NULL)
    -            broadcastArr <- lapply(ls(.broadcastNames),
    -                              function(name) { get(name, .broadcastNames) })
    -            sdf <- callJStatic(
    -                     "org.apache.spark.sql.api.r.SQLUtils",
    -                     "gapply",
    -                     x@sgd,
    -                     serialize(cleanClosure(func), connection = NULL),
    -                     packageNamesArr,
    -                     broadcastArr,
    -                     schema$jobj)
    -            dataFrame(sdf)
    +            if (is.null(schema)) stop("schema cannot be NULL")
    +            gapplyInternal(x, func, schema)
               })
    +
    +#' gapplyCollect
    +#'
    +#' Applies a R function to each group in the input GroupedData and collects the result
    +#' back to R as a data.frame.
    +#'
    +#' @param x A GroupedData
    +#' @param func A function to be applied to each group partition specified by GroupedData.
    +#'             The function `func` takes as argument a key - grouping columns and
    +#'             a data frame - a local R data.frame.
    +#'             The output of `func` is a local R data.frame.
    +#' @return a SparkDataFrame
    +#' @rdname gapplyCollect
    +#' @name gapplyCollect
    +#' @export
    +#' @seealso \link{gapply}
    +#' @note gapplyCollect(GroupedData) since 2.0.0
    +setMethod("gapplyCollect",
    +          signature(x = "GroupedData"),
    +          function(x, func) {
    +            gdf <- gapplyInternal(x, func, NULL)
    +            content <- callJMethod(gdf@sdf, "collect")
    +            # content is a list of items of struct type. Each item has a single field
    +            # which is a serialized data.frame corresponds to one group of the
    +            # SparkDataFrame.
    +            ldfs <- lapply(content, function(x) { unserialize(x[[1]]) })
    +            ldf <- do.call(rbind, ldfs)
    +            row.names(ldf) <- NULL
    +            ldf
    +          })
    +
    +gapplyInternal <- function(x, func, schema) {
    +  packageNamesArr <- serialize(.sparkREnv[[".packages"]],
    +                       connection = NULL)
    +  broadcastArr <- lapply(ls(.broadcastNames),
    +                    function(name) { get(name, .broadcastNames) })
    +  sdf <- callJStatic(
    +           "org.apache.spark.sql.api.r.SQLUtils",
    +           "gapply",
    +           x@sgd,
    +           serialize(cleanClosure(func), connection = NULL),
    +           packageNamesArr,
    +           broadcastArr,
    +           if (is.null(schema)) { schema } else { schema$jobj })
    --- End diff --
    
    `if (is.null(schema)) { NULL } else (schema$jobj }}` ?
    or
    `if (class(schema) == "jodj") { schema$jobj } else { NULL }`? (this might be more consistent with the pattern used in other places)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13760: [SPARK-16012][SparkR] implement gapplyCollect which will...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13760
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13760: [SPARK-16012][SparkR] gapplyCollect - applies a R functi...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13760
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13760: [SPARK-16012][SparkR] GapplyCollect - applies a R functi...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13760
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/60815/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13760: [SPARK-16012][SparkR] Implement gapplyCollect which will...

Posted by sun-rui <gi...@git.apache.org>.

Github user sun-rui commented on the issue:

    https://github.com/apache/spark/pull/13760
  
    no


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13760: [SPARK-16012][SparkR] implement gapplyCollect which will...

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on the issue:

    https://github.com/apache/spark/pull/13760
  
    looks good, thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13760: [SPARK-16012][SparkR] gapplyCollect - applies a R functi...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13760
  
    **[Test build #60971 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60971/consoleFull)** for PR 13760 at commit [`1d62c38`](https://github.com/apache/spark/commit/1d62c38602e393a2e0707caec287aa3353f02b04).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13760: [SPARK-16012][SparkR] GapplyCollect - applies a R functi...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13760
  
    **[Test build #60777 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60777/consoleFull)** for PR 13760 at commit [`de5dbb0`](https://github.com/apache/spark/commit/de5dbb0be0a3fcc42096a10470c543eaf7aa6d5c).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13760: [SPARK-16012][SparkR] implement gapplyCollect whi...

Posted by NarineK <gi...@git.apache.org>.

Github user NarineK commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13760#discussion_r68571809
  
    --- Diff: R/pkg/R/DataFrame.R ---
    @@ -1370,14 +1370,22 @@ setMethod("dapplyCollect",
     #' columns with data types integer and string and the mean which is a double.
     #' schema <-  structType(structField("a", "integer"), structField("c", "string"),
     #'   structField("avg", "double"))
    -#' df1 <- gapply(
    +#' result <- gapply(
    --- End diff --
    
    done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13760: [SPARK-16012][SparkR] GapplyCollect - applies a R...

Posted by sun-rui <gi...@git.apache.org>.

Github user sun-rui commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13760#discussion_r67799970
  
    --- Diff: R/pkg/R/DataFrame.R ---
    @@ -1347,6 +1347,65 @@ setMethod("gapply",
                 gapply(grouped, func, schema)
               })
     
    +#' gapplyCollect
    +#'
    +#' Groups the SparkDataFrame using the specified columns, applies the R function to each
    +#' group and collects the result back to R as data.frame.
    +#'
    +#' @param x A SparkDataFrame
    +#' @rdname gapplyCollect
    +#' @name gapplyCollect
    +#' @export
    +#' @examples
    +#'
    +#' \dontrun{
    +#' Computes the arithmetic mean of the second column by grouping
    +#' on the first and third columns. Output the grouping values and the average.
    +#'
    +#' result <- gapplyCollect(
    +#'   df,
    +#'   list("a", "c"),
    +#'   function(key, x) {
    +#'     y <- data.frame(key, mean(x$b), stringsAsFactors = FALSE)
    +#'     colnames(y) <- c("key_a", "key_c", "mean_b")
    +#'     y
    +#'   })
    +#'
    +#' Result
    +#' ------
    +#' key_a key_c mean_b
    +#' 3 3 3.0
    +#' 1 1 1.5
    +#'
    +#' Fits linear models on iris dataset by grouping on the 'Species' column and
    +#' using 'Sepal_Length' as a target variable, 'Sepal_Width', 'Petal_Length'
    +#' and 'Petal_Width' as training features.
    +#'
    +#' df <- createDataFrame (iris)
    +#' result <- gapplyCollect(
    +#'   df,
    +#'   list(df$"Species"),
    +#'   function(key, x) {
    +#'     m <- suppressWarnings(lm(Sepal_Length ~
    +#'     Sepal_Width + Petal_Length + Petal_Width, x))
    --- End diff --
    
    ident here


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13760: [SPARK-16012][SparkR] implement gapplyCollect whi...

Posted by NarineK <gi...@git.apache.org>.

Github user NarineK commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13760#discussion_r68227494
  
    --- Diff: R/pkg/R/group.R ---
    @@ -199,17 +199,10 @@ createMethods()
     #' Applies a R function to each group in the input GroupedData
     #'
     #' @param x a GroupedData
    -#' @param func A function to be applied to each group partition specified by GroupedData.
    -#'             The function `func` takes as argument a key - grouping columns and
    -#'             a data frame - a local R data.frame.
    -#'             The output of `func` is a local R data.frame.
    -#' @param schema The schema of the resulting SparkDataFrame after the function is applied.
    -#'               The schema must match to output of `func`. It has to be defined for each
    -#'               output column with preferred output column name and corresponding data type.
    -#' @return a SparkDataFrame
     #' @rdname gapply
     #' @name gapply
     #' @export
    +#' @seealso \link{gapplyCollect}
     #' @examples
     #' \dontrun{
    --- End diff --
    
    @sun-rui, @felixcheung , @shivaram, do you think it is better to move all examples to DataFrame.R and not spread those in both DataFrame.R and Groups.R ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13760: [SPARK-16012][SparkR] GapplyCollect - applies a R functi...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13760
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13760: [SPARK-16012][SparkR] gapplyCollect - applies a R...

Posted by shivaram <gi...@git.apache.org>.

Github user shivaram commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13760#discussion_r68090167
  
    --- Diff: R/pkg/R/group.R ---
    @@ -199,17 +199,10 @@ createMethods()
     #' Applies a R function to each group in the input GroupedData
     #'
     #' @param x a GroupedData
    -#' @param func A function to be applied to each group partition specified by GroupedData.
    -#'             The function `func` takes as argument a key - grouping columns and
    -#'             a data frame - a local R data.frame.
    -#'             The output of `func` is a local R data.frame.
    -#' @param schema The schema of the resulting SparkDataFrame after the function is applied.
    -#'               The schema must match to output of `func`. It has to be defined for each
    -#'               output column with preferred output column name and corresponding data type.
    -#' @return a SparkDataFrame
     #' @rdname gapply
     #' @name gapply
     #' @export
    +#' @seealso \link{gapplyCollect}
     #' @examples
     #' \dontrun{
    --- End diff --
    
    Similar to the params being duplicated, right now the examples from the DataFrame and GroupedData versions are duplicated in the generated Rd file. I think this is fine in general, but in the case of `gapply` and `dapply` we can unify the examples and just show that we can call `gapply` with or without a `group_by` ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13760: [SPARK-16012][SparkR] GapplyCollect - applies a R functi...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13760
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/60818/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13760: [SPARK-16012][SparkR] gapplyCollect - applies a R...

Posted by NarineK <gi...@git.apache.org>.

Github user NarineK commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13760#discussion_r67936513
  
    --- Diff: R/pkg/R/DataFrame.R ---
    @@ -1347,6 +1347,65 @@ setMethod("gapply",
                 gapply(grouped, func, schema)
               })
     
    +#' gapplyCollect
    +#'
    +#' Groups the SparkDataFrame using the specified columns, applies the R function to each
    +#' group and collects the result back to R as data.frame.
    +#'
    +#' @param x A SparkDataFrame
    +#' @rdname gapplyCollect
    +#' @name gapplyCollect
    +#' @export
    +#' @examples
    +#'
    +#' \dontrun{
    +#' Computes the arithmetic mean of the second column by grouping
    +#' on the first and third columns. Output the grouping values and the average.
    +#'
    +#' result <- gapplyCollect(
    +#'   df,
    +#'   list("a", "c"),
    +#'   function(key, x) {
    +#'     y <- data.frame(key, mean(x$b), stringsAsFactors = FALSE)
    +#'     colnames(y) <- c("key_a", "key_c", "mean_b")
    +#'     y
    +#'   })
    +#'
    +#' Result
    +#' ------
    +#' key_a key_c mean_b
    +#' 3 3 3.0
    +#' 1 1 1.5
    +#'
    +#' Fits linear models on iris dataset by grouping on the 'Species' column and
    +#' using 'Sepal_Length' as a target variable, 'Sepal_Width', 'Petal_Length'
    +#' and 'Petal_Width' as training features.
    +#'
    +#' df <- createDataFrame (iris)
    +#' result <- gapplyCollect(
    +#'   df,
    +#'   list(df$"Species"),
    +#'   function(key, x) {
    +#'     m <- suppressWarnings(lm(Sepal_Length ~
    +#'     Sepal_Width + Petal_Length + Petal_Width, x))
    --- End diff --
    
    Hi @sun-rui, there is an indent (2 spaces) similar to : https://github.com/NarineK/spark/blob/21de08fea1a7b10ee40270eeda9e5a249231f0cf/R/pkg/R/DataFrame.R#L1369
    What do you mean by indent ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13760: [SPARK-16012][SparkR] GapplyCollect - applies a R functi...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13760
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13760: [SPARK-16012][SparkR] implement gapplyCollect whi...

Posted by NarineK <gi...@git.apache.org>.

Github user NarineK commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13760#discussion_r68291007
  
    --- Diff: R/pkg/R/group.R ---
    @@ -199,17 +199,10 @@ createMethods()
     #' Applies a R function to each group in the input GroupedData
     #'
     #' @param x a GroupedData
    -#' @param func A function to be applied to each group partition specified by GroupedData.
    -#'             The function `func` takes as argument a key - grouping columns and
    -#'             a data frame - a local R data.frame.
    -#'             The output of `func` is a local R data.frame.
    -#' @param schema The schema of the resulting SparkDataFrame after the function is applied.
    -#'               The schema must match to output of `func`. It has to be defined for each
    -#'               output column with preferred output column name and corresponding data type.
    -#' @return a SparkDataFrame
     #' @rdname gapply
     #' @name gapply
     #' @export
    +#' @seealso \link{gapplyCollect}
     #' @examples
     #' \dontrun{
    --- End diff --
    
    @felixcheung , I took the same examples what gapply has. I did it similar to dapply and dapplyCollect. dapply and dapplyCollect have the same examples too. I think @sun-rui preferred having those the same.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13760: [SPARK-16012][SparkR] implement gapplyCollect whi...

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13760#discussion_r68269911
  
    --- Diff: R/pkg/R/group.R ---
    @@ -243,17 +236,73 @@ setMethod("gapply",
               signature(x = "GroupedData"),
               function(x, func, schema) {
                 try(if (is.null(schema)) stop("schema cannot be NULL"))
    --- End diff --
    
    why do have it inside `try` again? don't we want this to fail?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13760: [SPARK-16012][SparkR] gapplyCollect - applies a R functi...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13760
  
    **[Test build #61064 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61064/consoleFull)** for PR 13760 at commit [`022f87d`](https://github.com/apache/spark/commit/022f87d4194d09589e88921ecc4dc9f3e0a7c5d0).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13760: [SPARK-16012][SparkR] GapplyCollect - applies a R functi...

Posted by shivaram <gi...@git.apache.org>.

Github user shivaram commented on the issue:

    https://github.com/apache/spark/pull/13760
  
    Thanks @NarineK -- cc @sun-rui for review


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13760: [SPARK-16012][SparkR] GapplyCollect - applies a R functi...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13760
  
    **[Test build #60871 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60871/consoleFull)** for PR 13760 at commit [`21de08f`](https://github.com/apache/spark/commit/21de08fea1a7b10ee40270eeda9e5a249231f0cf).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13760: [SPARK-16012][SparkR] GapplyCollect - applies a R functi...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13760
  
    **[Test build #60778 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60778/consoleFull)** for PR 13760 at commit [`4b9cd3e`](https://github.com/apache/spark/commit/4b9cd3eff112bda6eca7018a249d7bc821b9312e).
     * This patch **fails some tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13760: [SPARK-16012][SparkR] implement gapplyCollect whi...

Posted by shivaram <gi...@git.apache.org>.

Github user shivaram commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13760#discussion_r68508656
  
    --- Diff: R/pkg/R/group.R ---
    @@ -243,17 +236,73 @@ setMethod("gapply",
               signature(x = "GroupedData"),
               function(x, func, schema) {
                 try(if (is.null(schema)) stop("schema cannot be NULL"))
    --- End diff --
    
    I think we use `stop` or `warn` directly in SparkR without wrapping in a `try` block. I think the `try` block is useful if we want to catch an error of one kind and then reformat it or show a different error etc. In this case `stop` should be sufficient.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13760: [SPARK-16012][SparkR] gapplyCollect - applies a R functi...

Posted by sun-rui <gi...@git.apache.org>.

Github user sun-rui commented on the issue:

    https://github.com/apache/spark/pull/13760
  
    LGTM except on minor comment


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13760: [SPARK-16012][SparkR] gapplyCollect - applies a R functi...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13760
  
    **[Test build #60971 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60971/consoleFull)** for PR 13760 at commit [`1d62c38`](https://github.com/apache/spark/commit/1d62c38602e393a2e0707caec287aa3353f02b04).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13760: [SPARK-16012][SparkR] GapplyCollect - applies a R...

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13760#discussion_r67778002
  
    --- Diff: R/pkg/R/group.R ---
    @@ -191,18 +184,72 @@ createMethods()
     setMethod("gapply",
               signature(x = "GroupedData"),
               function(x, func, schema) {
    -            try(if (is.null(schema)) stop("schema cannot be NULL"))
    -            packageNamesArr <- serialize(.sparkREnv[[".packages"]],
    -                                 connection = NULL)
    -            broadcastArr <- lapply(ls(.broadcastNames),
    -                              function(name) { get(name, .broadcastNames) })
    -            sdf <- callJStatic(
    -                     "org.apache.spark.sql.api.r.SQLUtils",
    -                     "gapply",
    -                     x@sgd,
    -                     serialize(cleanClosure(func), connection = NULL),
    -                     packageNamesArr,
    -                     broadcastArr,
    -                     schema$jobj)
    -            dataFrame(sdf)
    +            gapplyInternal(x, func, schema)
    +          })
    +
    +#' gapplyCollect
    +#'
    +#' Applies a R function to each group in the input GroupedData and collects the result
    +#' back to R as a data.frame.
    +#'
    +#' @param x a GroupedData
    +#' @param func A function to be applied to each group partition specified by GroupedData.
    +#'             The function `func` takes as argument a key - grouping columns and
    +#'             a data frame - a local R data.frame.
    +#'             The output of `func` is a local R data.frame.
    +#' @return a SparkDataFrame
    +#' @rdname gapplyCollect
    +#' @name gapplyCollect
    +#' @seealso gapply \link{gapply}
    --- End diff --
    
    please add `@export`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13760: [SPARK-16012][SparkR] GapplyCollect - applies a R functi...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13760
  
    **[Test build #60779 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60779/consoleFull)** for PR 13760 at commit [`f96fae9`](https://github.com/apache/spark/commit/f96fae9405eb684cb74c7ff13a4ecaf61184e6bc).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13760: [SPARK-16012][SparkR] gapplyCollect - applies a R functi...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13760
  
    **[Test build #60972 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60972/consoleFull)** for PR 13760 at commit [`77fb205`](https://github.com/apache/spark/commit/77fb2056d9d3e1a269fc6fba2c8f07806c78a208).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13760: [SPARK-16012][SparkR] GapplyCollect - applies a R functi...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13760
  
    **[Test build #60871 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60871/consoleFull)** for PR 13760 at commit [`21de08f`](https://github.com/apache/spark/commit/21de08fea1a7b10ee40270eeda9e5a249231f0cf).
     * This patch passes all tests.
     * This patch **does not merge cleanly**.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13760: [SPARK-16012][SparkR] Implement gapplyCollect whi...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/13760


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13760: [SPARK-16012][SparkR] Implement gapplyCollect which will...

Posted by shivaram <gi...@git.apache.org>.

Github user shivaram commented on the issue:

    https://github.com/apache/spark/pull/13760
  
    Thanks all. LGTM. Merging this to master and branch-2.0


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13760: [SPARK-16012][SparkR] implement gapplyCollect which will...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13760
  
    **[Test build #61300 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61300/consoleFull)** for PR 13760 at commit [`1e3f0ac`](https://github.com/apache/spark/commit/1e3f0acbbcfb6110ee9b15d4cda8e00d6c32a5c5).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13760: [SPARK-16012][SparkR] GapplyCollect - applies a R functi...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13760
  
    **[Test build #60779 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60779/consoleFull)** for PR 13760 at commit [`f96fae9`](https://github.com/apache/spark/commit/f96fae9405eb684cb74c7ff13a4ecaf61184e6bc).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13760: [SPARK-16012][SparkR] gapplyCollect - applies a R...

Posted by NarineK <gi...@git.apache.org>.

Github user NarineK commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13760#discussion_r68114211
  
    --- Diff: R/pkg/R/group.R ---
    @@ -242,18 +235,73 @@ createMethods()
     setMethod("gapply",
               signature(x = "GroupedData"),
               function(x, func, schema) {
    -            try(if (is.null(schema)) stop("schema cannot be NULL"))
    --- End diff --
    
    yeah, we need it. I tried to do it like dapply but dapply forces it by signature and gapply not.
    will bring it back thnx


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13760: [SPARK-16012][SparkR] GapplyCollect - applies a R functi...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13760
  
    **[Test build #60815 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60815/consoleFull)** for PR 13760 at commit [`7dd883f`](https://github.com/apache/spark/commit/7dd883f070d3a620a5b45765ccbe6b5876baa52f).
     * This patch **fails SparkR unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13760: [SPARK-16012][SparkR] GapplyCollect - applies a R functi...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13760
  
    Build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13760: [SPARK-16012][SparkR] implement gapplyCollect which will...

Posted by shivaram <gi...@git.apache.org>.

Github user shivaram commented on the issue:

    https://github.com/apache/spark/pull/13760
  
    @felixcheung Any other comments on this ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13760: [SPARK-16012][SparkR] gapplyCollect - applies a R...

Posted by shivaram <gi...@git.apache.org>.

Github user shivaram commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13760#discussion_r68087556
  
    --- Diff: R/pkg/R/group.R ---
    @@ -242,18 +235,73 @@ createMethods()
     setMethod("gapply",
               signature(x = "GroupedData"),
               function(x, func, schema) {
    -            try(if (is.null(schema)) stop("schema cannot be NULL"))
    --- End diff --
    
    this check of schema not being null still needs to be preserved for the the `gapply` call ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13760: [SPARK-16012][SparkR] GapplyCollect - applies a R functi...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13760
  
    **[Test build #60818 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60818/consoleFull)** for PR 13760 at commit [`11c7cd6`](https://github.com/apache/spark/commit/11c7cd6d4bcbff86492e4e996f3317d98bf64901).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13760: [SPARK-16012][SparkR] implement gapplyCollect whi...

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13760#discussion_r68541122
  
    --- Diff: R/pkg/R/DataFrame.R ---
    @@ -1419,6 +1427,80 @@ setMethod("gapply",
                 gapply(grouped, func, schema)
               })
     
    +#' gapplyCollect
    +#'
    +#' Groups the SparkDataFrame using the specified columns, applies the R function to each
    +#' group and collects the result back to R as data.frame.
    +#'
    +#' @param x A SparkDataFrame
    +#' @rdname gapplyCollect
    +#' @name gapplyCollect
    --- End diff --
    
    add `#' @family SparkDataFrame functions`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13760: [SPARK-16012][SparkR] GapplyCollect - applies a R functi...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13760
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/60777/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13760: [SPARK-16012][SparkR] implement gapplyCollect whi...

Posted by NarineK <gi...@git.apache.org>.

Github user NarineK commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13760#discussion_r68571761
  
    --- Diff: R/pkg/R/group.R ---
    @@ -198,62 +198,61 @@ createMethods()
     #'
     #' Applies a R function to each group in the input GroupedData
     #'
    -#' @param x a GroupedData
    -#' @param func A function to be applied to each group partition specified by GroupedData.
    -#'             The function `func` takes as argument a key - grouping columns and
    -#'             a data frame - a local R data.frame.
    -#'             The output of `func` is a local R data.frame.
    -#' @param schema The schema of the resulting SparkDataFrame after the function is applied.
    -#'               The schema must match to output of `func`. It has to be defined for each
    -#'               output column with preferred output column name and corresponding data type.
    -#' @return a SparkDataFrame
    +#' @param x A GroupedData
     #' @rdname gapply
     #' @name gapply
     #' @export
    -#' @examples
    -#' \dontrun{
    -#' Computes the arithmetic mean of the second column by grouping
    -#' on the first and third columns. Output the grouping values and the average.
    -#'
    -#' df <- createDataFrame (
    -#' list(list(1L, 1, "1", 0.1), list(1L, 2, "1", 0.2), list(3L, 3, "3", 0.3)),
    -#'   c("a", "b", "c", "d"))
    -#'
    -#' Here our output contains three columns, the key which is a combination of two
    -#' columns with data types integer and string and the mean which is a double.
    -#' schema <-  structType(structField("a", "integer"), structField("c", "string"),
    -#'   structField("avg", "double"))
    -#' df1 <- gapply(
    -#'   df,
    -#'   list("a", "c"),
    -#'   function(key, x) {
    -#'     y <- data.frame(key, mean(x$b), stringsAsFactors = FALSE)
    -#'   },
    -#' schema)
    -#' collect(df1)
    -#'
    -#' Result
    -#' ------
    -#' a c avg
    -#' 3 3 3.0
    -#' 1 1 1.5
    -#' }
    +#' @seealso \link{gapplyCollect}
    --- End diff --
    
    moved


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13760: [SPARK-16012][SparkR] gapplyCollect - applies a R functi...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13760
  
    **[Test build #61064 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61064/consoleFull)** for PR 13760 at commit [`022f87d`](https://github.com/apache/spark/commit/022f87d4194d09589e88921ecc4dc9f3e0a7c5d0).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13760: [SPARK-16012][SparkR] implement gapplyCollect which will...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13760
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61125/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13760: [SPARK-16012][SparkR] implement gapplyCollect whi...

Posted by NarineK <gi...@git.apache.org>.

Github user NarineK commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13760#discussion_r68565249
  
    --- Diff: R/pkg/R/group.R ---
    @@ -198,62 +198,61 @@ createMethods()
     #'
     #' Applies a R function to each group in the input GroupedData
     #'
    -#' @param x a GroupedData
    -#' @param func A function to be applied to each group partition specified by GroupedData.
    -#'             The function `func` takes as argument a key - grouping columns and
    -#'             a data frame - a local R data.frame.
    -#'             The output of `func` is a local R data.frame.
    -#' @param schema The schema of the resulting SparkDataFrame after the function is applied.
    -#'               The schema must match to output of `func`. It has to be defined for each
    -#'               output column with preferred output column name and corresponding data type.
    -#' @return a SparkDataFrame
    +#' @param x A GroupedData
     #' @rdname gapply
     #' @name gapply
     #' @export
    -#' @examples
    -#' \dontrun{
    -#' Computes the arithmetic mean of the second column by grouping
    -#' on the first and third columns. Output the grouping values and the average.
    -#'
    -#' df <- createDataFrame (
    -#' list(list(1L, 1, "1", 0.1), list(1L, 2, "1", 0.2), list(3L, 3, "3", 0.3)),
    -#'   c("a", "b", "c", "d"))
    -#'
    -#' Here our output contains three columns, the key which is a combination of two
    -#' columns with data types integer and string and the mean which is a double.
    -#' schema <-  structType(structField("a", "integer"), structField("c", "string"),
    -#'   structField("avg", "double"))
    -#' df1 <- gapply(
    -#'   df,
    -#'   list("a", "c"),
    -#'   function(key, x) {
    -#'     y <- data.frame(key, mean(x$b), stringsAsFactors = FALSE)
    -#'   },
    -#' schema)
    -#' collect(df1)
    -#'
    -#' Result
    -#' ------
    -#' a c avg
    -#' 3 3 3.0
    -#' 1 1 1.5
    -#' }
    +#' @seealso \link{gapplyCollect}
     #' @note gapply(GroupedData) since 2.0.0
     setMethod("gapply",
               signature(x = "GroupedData"),
               function(x, func, schema) {
    -            try(if (is.null(schema)) stop("schema cannot be NULL"))
    -            packageNamesArr <- serialize(.sparkREnv[[".packages"]],
    -                                 connection = NULL)
    -            broadcastArr <- lapply(ls(.broadcastNames),
    -                              function(name) { get(name, .broadcastNames) })
    -            sdf <- callJStatic(
    -                     "org.apache.spark.sql.api.r.SQLUtils",
    -                     "gapply",
    -                     x@sgd,
    -                     serialize(cleanClosure(func), connection = NULL),
    -                     packageNamesArr,
    -                     broadcastArr,
    -                     schema$jobj)
    -            dataFrame(sdf)
    +            if (is.null(schema)) stop("schema cannot be NULL")
    +            gapplyInternal(x, func, schema)
               })
    +
    +#' gapplyCollect
    +#'
    +#' Applies a R function to each group in the input GroupedData and collects the result
    +#' back to R as a data.frame.
    +#'
    +#' @param x A GroupedData
    +#' @param func A function to be applied to each group partition specified by GroupedData.
    +#'             The function `func` takes as argument a key - grouping columns and
    +#'             a data frame - a local R data.frame.
    +#'             The output of `func` is a local R data.frame.
    +#' @return a SparkDataFrame
    +#' @rdname gapplyCollect
    +#' @name gapplyCollect
    +#' @export
    +#' @seealso \link{gapply}
    +#' @note gapplyCollect(GroupedData) since 2.0.0
    +setMethod("gapplyCollect",
    +          signature(x = "GroupedData"),
    +          function(x, func) {
    +            gdf <- gapplyInternal(x, func, NULL)
    +            content <- callJMethod(gdf@sdf, "collect")
    +            # content is a list of items of struct type. Each item has a single field
    +            # which is a serialized data.frame corresponds to one group of the
    +            # SparkDataFrame.
    +            ldfs <- lapply(content, function(x) { unserialize(x[[1]]) })
    +            ldf <- do.call(rbind, ldfs)
    +            row.names(ldf) <- NULL
    +            ldf
    +          })
    +
    +gapplyInternal <- function(x, func, schema) {
    +  packageNamesArr <- serialize(.sparkREnv[[".packages"]],
    +                       connection = NULL)
    +  broadcastArr <- lapply(ls(.broadcastNames),
    +                    function(name) { get(name, .broadcastNames) })
    +  sdf <- callJStatic(
    +           "org.apache.spark.sql.api.r.SQLUtils",
    +           "gapply",
    +           x@sgd,
    +           serialize(cleanClosure(func), connection = NULL),
    +           packageNamesArr,
    +           broadcastArr,
    +           if (is.null(schema)) { schema } else { schema$jobj })
    --- End diff --
    
    class(schama) is 'structType'.
    I can change it, but again the original  version is consistent with dapply


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13760: [SPARK-16012][SparkR] gapplyCollect - applies a R functi...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13760
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13760: [SPARK-16012][SparkR] implement gapplyCollect whi...

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13760#discussion_r68269542
  
    --- Diff: R/pkg/R/group.R ---
    @@ -199,17 +199,10 @@ createMethods()
     #' Applies a R function to each group in the input GroupedData
     #'
     #' @param x a GroupedData
    -#' @param func A function to be applied to each group partition specified by GroupedData.
    -#'             The function `func` takes as argument a key - grouping columns and
    -#'             a data frame - a local R data.frame.
    -#'             The output of `func` is a local R data.frame.
    -#' @param schema The schema of the resulting SparkDataFrame after the function is applied.
    -#'               The schema must match to output of `func`. It has to be defined for each
    -#'               output column with preferred output column name and corresponding data type.
    -#' @return a SparkDataFrame
     #' @rdname gapply
     #' @name gapply
     #' @export
    +#' @seealso \link{gapplyCollect}
     #' @examples
     #' \dontrun{
    --- End diff --
    
    examples are important, could we come up with different examples?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13760: [SPARK-16012][SparkR] implement gapplyCollect which will...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13760
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13760: [SPARK-16012][SparkR] gapplyCollect - applies a R functi...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13760
  
    **[Test build #60972 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60972/consoleFull)** for PR 13760 at commit [`77fb205`](https://github.com/apache/spark/commit/77fb2056d9d3e1a269fc6fba2c8f07806c78a208).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13760: [SPARK-16012][SparkR] gapplyCollect - applies a R...

Posted by NarineK <gi...@git.apache.org>.

Github user NarineK commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13760#discussion_r68116142
  
    --- Diff: R/pkg/R/group.R ---
    @@ -199,17 +199,10 @@ createMethods()
     #' Applies a R function to each group in the input GroupedData
     #'
     #' @param x a GroupedData
    -#' @param func A function to be applied to each group partition specified by GroupedData.
    -#'             The function `func` takes as argument a key - grouping columns and
    -#'             a data frame - a local R data.frame.
    -#'             The output of `func` is a local R data.frame.
    -#' @param schema The schema of the resulting SparkDataFrame after the function is applied.
    -#'               The schema must match to output of `func`. It has to be defined for each
    -#'               output column with preferred output column name and corresponding data type.
    -#' @return a SparkDataFrame
     #' @rdname gapply
     #' @name gapply
     #' @export
    +#' @seealso \link{gapplyCollect}
     #' @examples
     #' \dontrun{
    --- End diff --
    
    @shivaram, yes, the example for calculating the average is almost the same - in groups.R I use group_by and in DataFrame not but we could also combine all in let's say in DataFrame.R and so something like this:
    
    ```
    df <- createDataFrame (
    list(list(1L, 1, "1", 0.1), list(1L, 2, "1", 0.2), list(3L, 3, "3", 0.3)),
      c("a", "b", "c", "d"))
    
    Here our output contains three columns, the key which is a combination of two
    columns with data types integer and string and the mean which is a double.
    schema <-  structType(structField("a", "integer"), structField("c", "string"),
      structField("avg", "double"))
    df1 <- gapply(
      df,
      c("a", "c"),
      function(key, x) {
        y <- data.frame(key, mean(x$b), stringsAsFactors = FALSE)
      },
    schema)
    
    or we can also group the data and afterwards call gapply on GroupedData:
    gdf <- group_by(df, "a", "c")
    df1 <- gapply(
      gdf,
      function(key, x) {
        y <- data.frame(key, mean(x$b), stringsAsFactors = FALSE)
      },
    schema)
    collect(df1)
    ```
    Is this better ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13760: [SPARK-16012][SparkR] implement gapplyCollect whi...

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13760#discussion_r68541385
  
    --- Diff: R/pkg/R/group.R ---
    @@ -198,62 +198,61 @@ createMethods()
     #'
     #' Applies a R function to each group in the input GroupedData
     #'
    -#' @param x a GroupedData
    -#' @param func A function to be applied to each group partition specified by GroupedData.
    -#'             The function `func` takes as argument a key - grouping columns and
    -#'             a data frame - a local R data.frame.
    -#'             The output of `func` is a local R data.frame.
    -#' @param schema The schema of the resulting SparkDataFrame after the function is applied.
    -#'               The schema must match to output of `func`. It has to be defined for each
    -#'               output column with preferred output column name and corresponding data type.
    -#' @return a SparkDataFrame
    +#' @param x A GroupedData
     #' @rdname gapply
     #' @name gapply
     #' @export
    -#' @examples
    -#' \dontrun{
    -#' Computes the arithmetic mean of the second column by grouping
    -#' on the first and third columns. Output the grouping values and the average.
    -#'
    -#' df <- createDataFrame (
    -#' list(list(1L, 1, "1", 0.1), list(1L, 2, "1", 0.2), list(3L, 3, "3", 0.3)),
    -#'   c("a", "b", "c", "d"))
    -#'
    -#' Here our output contains three columns, the key which is a combination of two
    -#' columns with data types integer and string and the mean which is a double.
    -#' schema <-  structType(structField("a", "integer"), structField("c", "string"),
    -#'   structField("avg", "double"))
    -#' df1 <- gapply(
    -#'   df,
    -#'   list("a", "c"),
    -#'   function(key, x) {
    -#'     y <- data.frame(key, mean(x$b), stringsAsFactors = FALSE)
    -#'   },
    -#' schema)
    -#' collect(df1)
    -#'
    -#' Result
    -#' ------
    -#' a c avg
    -#' 3 3 3.0
    -#' 1 1 1.5
    -#' }
    +#' @seealso \link{gapplyCollect}
    --- End diff --
    
    since most other tags are in DataFrame.R you might want to move this there too.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13760: [SPARK-16012][SparkR] implement gapplyCollect which will...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13760
  
    **[Test build #61125 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61125/consoleFull)** for PR 13760 at commit [`963f14f`](https://github.com/apache/spark/commit/963f14f29264f90af71a24260729e1885846a475).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13760: [SPARK-16012][SparkR] implement gapplyCollect whi...

Posted by NarineK <gi...@git.apache.org>.

Github user NarineK commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13760#discussion_r68571781
  
    --- Diff: R/pkg/R/DataFrame.R ---
    @@ -1419,6 +1427,80 @@ setMethod("gapply",
                 gapply(grouped, func, schema)
               })
     
    +#' gapplyCollect
    +#'
    +#' Groups the SparkDataFrame using the specified columns, applies the R function to each
    +#' group and collects the result back to R as data.frame.
    +#'
    +#' @param x A SparkDataFrame
    +#' @rdname gapplyCollect
    +#' @name gapplyCollect
    --- End diff --
    
    done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13760: [SPARK-16012][SparkR] GapplyCollect - applies a R...

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13760#discussion_r67777776
  
    --- Diff: R/pkg/R/group.R ---
    @@ -150,16 +150,9 @@ createMethods()
     #' Applies a R function to each group in the input GroupedData
     #'
     #' @param x a GroupedData
    -#' @param func A function to be applied to each group partition specified by GroupedData.
    -#'             The function `func` takes as argument a key - grouping columns and
    -#'             a data frame - a local R data.frame.
    -#'             The output of `func` is a local R data.frame.
    -#' @param schema The schema of the resulting SparkDataFrame after the function is applied.
    -#'               The schema must match to output of `func`. It has to be defined for each
    -#'               output column with preferred output column name and corresponding data type.
    -#' @return a SparkDataFrame
     #' @rdname gapply
     #' @name gapply
    +#' @seealso gapplyCollect \link{gapplyCollect}
    --- End diff --
    
    you can leave it as `#' @seealso \link{gapplyCollect}`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13760: [SPARK-16012][SparkR] gapplyCollect - applies a R...

Posted by sun-rui <gi...@git.apache.org>.

Github user sun-rui commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13760#discussion_r67983701
  
    --- Diff: R/pkg/inst/tests/testthat/test_sparkSQL.R ---
    @@ -2236,12 +2236,15 @@ test_that("gapply() on a DataFrame", {
       actual <- collect(df1)
       expect_identical(actual, expected)
     
    +  df1Collect <- gapplyCollect(df, list("a"), function(key, x) { x })
    --- End diff --
    
    maybe better change list("a") to "a" to test if a scalar column parameter can work


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13760: [SPARK-16012][SparkR] gapplyCollect - applies a R functi...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13760
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13760: [SPARK-16012][SparkR] implement gapplyCollect which will...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13760
  
    **[Test build #61125 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61125/consoleFull)** for PR 13760 at commit [`963f14f`](https://github.com/apache/spark/commit/963f14f29264f90af71a24260729e1885846a475).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13760: [SPARK-16012][SparkR] GapplyCollect - applies a R...

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13760#discussion_r67777908
  
    --- Diff: R/pkg/R/group.R ---
    @@ -150,16 +150,9 @@ createMethods()
     #' Applies a R function to each group in the input GroupedData
     #'
     #' @param x a GroupedData
    -#' @param func A function to be applied to each group partition specified by GroupedData.
    -#'             The function `func` takes as argument a key - grouping columns and
    -#'             a data frame - a local R data.frame.
    -#'             The output of `func` is a local R data.frame.
    -#' @param schema The schema of the resulting SparkDataFrame after the function is applied.
    -#'               The schema must match to output of `func`. It has to be defined for each
    -#'               output column with preferred output column name and corresponding data type.
    -#' @return a SparkDataFrame
     #' @rdname gapply
     #' @name gapply
    +#' @seealso gapplyCollect \link{gapplyCollect}
    --- End diff --
    
    please add `@export`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13760: [SPARK-16012][SparkR] implement gapplyCollect whi...

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13760#discussion_r68542159
  
    --- Diff: R/pkg/inst/tests/testthat/test_sparkSQL.R ---
    @@ -2236,12 +2236,15 @@ test_that("gapply() on a DataFrame", {
       actual <- collect(df1)
       expect_identical(actual, expected)
     
    +  df1Collect <- gapplyCollect(df, list("a"), function(key, x) { x })
    --- End diff --
    
    we should have both?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13760: [SPARK-16012][SparkR] GapplyCollect - applies a R functi...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13760
  
    **[Test build #60818 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60818/consoleFull)** for PR 13760 at commit [`11c7cd6`](https://github.com/apache/spark/commit/11c7cd6d4bcbff86492e4e996f3317d98bf64901).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13760: [SPARK-16012][SparkR] gapplyCollect - applies a R functi...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13760
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/60971/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13760: [SPARK-16012][SparkR] implement gapplyCollect whi...

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13760#discussion_r68541007
  
    --- Diff: R/pkg/R/DataFrame.R ---
    @@ -1419,6 +1427,80 @@ setMethod("gapply",
                 gapply(grouped, func, schema)
               })
     
    +#' gapplyCollect
    +#'
    +#' Groups the SparkDataFrame using the specified columns, applies the R function to each
    +#' group and collects the result back to R as data.frame.
    +#'
    +#' @param x A SparkDataFrame
    +#' @rdname gapplyCollect
    +#' @name gapplyCollect
    --- End diff --
    
    same here, `@return` (though it does say on L1433)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13760: [SPARK-16012][SparkR] gapplyCollect - applies a R functi...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13760
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/60972/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13760: [SPARK-16012][SparkR] implement gapplyCollect whi...

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13760#discussion_r68540924
  
    --- Diff: R/pkg/R/DataFrame.R ---
    @@ -1370,14 +1370,22 @@ setMethod("dapplyCollect",
     #' columns with data types integer and string and the mean which is a double.
     #' schema <-  structType(structField("a", "integer"), structField("c", "string"),
     #'   structField("avg", "double"))
    -#' df1 <- gapply(
    +#' result <- gapply(
    --- End diff --
    
    if what is returned is a DataFrame it might help to keep it a variant of "df"
    in fact, you might want to add `@return` to document to return value and type


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13760: [SPARK-16012][SparkR] GapplyCollect - applies a R functi...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13760
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13760: [SPARK-16012][SparkR] implement gapplyCollect which will...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13760
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61300/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13760: [SPARK-16012][SparkR] GapplyCollect - applies a R functi...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13760
  
    **[Test build #60778 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60778/consoleFull)** for PR 13760 at commit [`4b9cd3e`](https://github.com/apache/spark/commit/4b9cd3eff112bda6eca7018a249d7bc821b9312e).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13760: [SPARK-16012][SparkR] GapplyCollect - applies a R functi...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13760
  
    **[Test build #60777 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60777/consoleFull)** for PR 13760 at commit [`de5dbb0`](https://github.com/apache/spark/commit/de5dbb0be0a3fcc42096a10470c543eaf7aa6d5c).
     * This patch **fails some tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13760: [SPARK-16012][SparkR] GapplyCollect - applies a R functi...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13760
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/60871/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13760: [SPARK-16012][SparkR] GapplyCollect - applies a R...

Posted by sun-rui <gi...@git.apache.org>.

Github user sun-rui commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13760#discussion_r67799741
  
    --- Diff: R/pkg/R/DataFrame.R ---
    @@ -1347,6 +1347,65 @@ setMethod("gapply",
                 gapply(grouped, func, schema)
               })
     
    +#' gapplyCollect
    +#'
    +#' Groups the SparkDataFrame using the specified columns, applies the R function to each
    +#' group and collects the result back to R as data.frame.
    +#'
    +#' @param x A SparkDataFrame
    +#' @rdname gapplyCollect
    +#' @name gapplyCollect
    +#' @export
    +#' @examples
    +#'
    +#' \dontrun{
    +#' Computes the arithmetic mean of the second column by grouping
    +#' on the first and third columns. Output the grouping values and the average.
    +#'
    +#' result <- gapplyCollect(
    +#'   df,
    +#'   list("a", "c"),
    --- End diff --
    
    use c("a", "c") is more natural?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13760: [SPARK-16012][SparkR] implement gapplyCollect whi...

Posted by NarineK <gi...@git.apache.org>.

Github user NarineK commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13760#discussion_r68298040
  
    --- Diff: R/pkg/R/group.R ---
    @@ -243,17 +236,73 @@ setMethod("gapply",
               signature(x = "GroupedData"),
               function(x, func, schema) {
                 try(if (is.null(schema)) stop("schema cannot be NULL"))
    --- End diff --
    
    with or without try both work work fine.
    
    Without try the error looks like: 
     Error in .local(x, ...) : schema cannot be NULL 
    
    with try: 
    Error in try(if (is.null(schema)) stop("schema cannot be NULL")) : 
      schema cannot be NULL
    
    Is there a convention in SparkR for showing an error message ?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13760: [SPARK-16012][SparkR] GapplyCollect - applies a R functi...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13760
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13760: [SPARK-16012][SparkR] implement gapplyCollect which will...

Posted by NarineK <gi...@git.apache.org>.

Github user NarineK commented on the issue:

    https://github.com/apache/spark/pull/13760
  
    @felixcheung , I've addressed the comments or put a comment for the non-addressed ones.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13760: [SPARK-16012][SparkR] implement gapplyCollect which will...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13760
  
    **[Test build #61300 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61300/consoleFull)** for PR 13760 at commit [`1e3f0ac`](https://github.com/apache/spark/commit/1e3f0acbbcfb6110ee9b15d4cda8e00d6c32a5c5).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13760: [SPARK-16012][SparkR] implement gapplyCollect whi...

Posted by NarineK <gi...@git.apache.org>.

Github user NarineK commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13760#discussion_r68563365
  
    --- Diff: R/pkg/R/group.R ---
    @@ -198,62 +198,61 @@ createMethods()
     #'
     #' Applies a R function to each group in the input GroupedData
     #'
    -#' @param x a GroupedData
    -#' @param func A function to be applied to each group partition specified by GroupedData.
    -#'             The function `func` takes as argument a key - grouping columns and
    -#'             a data frame - a local R data.frame.
    -#'             The output of `func` is a local R data.frame.
    -#' @param schema The schema of the resulting SparkDataFrame after the function is applied.
    -#'               The schema must match to output of `func`. It has to be defined for each
    -#'               output column with preferred output column name and corresponding data type.
    -#' @return a SparkDataFrame
    +#' @param x A GroupedData
     #' @rdname gapply
     #' @name gapply
     #' @export
    -#' @examples
    -#' \dontrun{
    -#' Computes the arithmetic mean of the second column by grouping
    -#' on the first and third columns. Output the grouping values and the average.
    -#'
    -#' df <- createDataFrame (
    -#' list(list(1L, 1, "1", 0.1), list(1L, 2, "1", 0.2), list(3L, 3, "3", 0.3)),
    -#'   c("a", "b", "c", "d"))
    -#'
    -#' Here our output contains three columns, the key which is a combination of two
    -#' columns with data types integer and string and the mean which is a double.
    -#' schema <-  structType(structField("a", "integer"), structField("c", "string"),
    -#'   structField("avg", "double"))
    -#' df1 <- gapply(
    -#'   df,
    -#'   list("a", "c"),
    -#'   function(key, x) {
    -#'     y <- data.frame(key, mean(x$b), stringsAsFactors = FALSE)
    -#'   },
    -#' schema)
    -#' collect(df1)
    -#'
    -#' Result
    -#' ------
    -#' a c avg
    -#' 3 3 3.0
    -#' 1 1 1.5
    -#' }
    +#' @seealso \link{gapplyCollect}
     #' @note gapply(GroupedData) since 2.0.0
     setMethod("gapply",
               signature(x = "GroupedData"),
               function(x, func, schema) {
    -            try(if (is.null(schema)) stop("schema cannot be NULL"))
    -            packageNamesArr <- serialize(.sparkREnv[[".packages"]],
    -                                 connection = NULL)
    -            broadcastArr <- lapply(ls(.broadcastNames),
    -                              function(name) { get(name, .broadcastNames) })
    -            sdf <- callJStatic(
    -                     "org.apache.spark.sql.api.r.SQLUtils",
    -                     "gapply",
    -                     x@sgd,
    -                     serialize(cleanClosure(func), connection = NULL),
    -                     packageNamesArr,
    -                     broadcastArr,
    -                     schema$jobj)
    -            dataFrame(sdf)
    +            if (is.null(schema)) stop("schema cannot be NULL")
    +            gapplyInternal(x, func, schema)
               })
    +
    +#' gapplyCollect
    +#'
    +#' Applies a R function to each group in the input GroupedData and collects the result
    +#' back to R as a data.frame.
    +#'
    +#' @param x A GroupedData
    +#' @param func A function to be applied to each group partition specified by GroupedData.
    +#'             The function `func` takes as argument a key - grouping columns and
    +#'             a data frame - a local R data.frame.
    +#'             The output of `func` is a local R data.frame.
    +#' @return a SparkDataFrame
    +#' @rdname gapplyCollect
    +#' @name gapplyCollect
    +#' @export
    +#' @seealso \link{gapply}
    +#' @note gapplyCollect(GroupedData) since 2.0.0
    +setMethod("gapplyCollect",
    +          signature(x = "GroupedData"),
    +          function(x, func) {
    +            gdf <- gapplyInternal(x, func, NULL)
    +            content <- callJMethod(gdf@sdf, "collect")
    --- End diff --
    
    collect(gdf) doesn't really work. the collect is called on the dataframe: `gdf@sdf`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13760: [SPARK-16012][SparkR] implement gapplyCollect whi...

Posted by NarineK <gi...@git.apache.org>.

Github user NarineK commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13760#discussion_r68542491
  
    --- Diff: R/pkg/R/DataFrame.R ---
    @@ -1370,14 +1370,22 @@ setMethod("dapplyCollect",
     #' columns with data types integer and string and the mean which is a double.
     #' schema <-  structType(structField("a", "integer"), structField("c", "string"),
     #'   structField("avg", "double"))
    -#' df1 <- gapply(
    +#' result <- gapply(
    --- End diff --
    
    thanks, @felixcheung , I think I kept it consistent with dapply/dapplyCollect. Those do not have @return. I can add it to gapply


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13760: [SPARK-16012][SparkR] implement gapplyCollect whi...

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13760#discussion_r68542673
  
    --- Diff: R/pkg/R/group.R ---
    @@ -198,62 +198,61 @@ createMethods()
     #'
     #' Applies a R function to each group in the input GroupedData
     #'
    -#' @param x a GroupedData
    -#' @param func A function to be applied to each group partition specified by GroupedData.
    -#'             The function `func` takes as argument a key - grouping columns and
    -#'             a data frame - a local R data.frame.
    -#'             The output of `func` is a local R data.frame.
    -#' @param schema The schema of the resulting SparkDataFrame after the function is applied.
    -#'               The schema must match to output of `func`. It has to be defined for each
    -#'               output column with preferred output column name and corresponding data type.
    -#' @return a SparkDataFrame
    +#' @param x A GroupedData
     #' @rdname gapply
     #' @name gapply
     #' @export
    -#' @examples
    -#' \dontrun{
    -#' Computes the arithmetic mean of the second column by grouping
    -#' on the first and third columns. Output the grouping values and the average.
    -#'
    -#' df <- createDataFrame (
    -#' list(list(1L, 1, "1", 0.1), list(1L, 2, "1", 0.2), list(3L, 3, "3", 0.3)),
    -#'   c("a", "b", "c", "d"))
    -#'
    -#' Here our output contains three columns, the key which is a combination of two
    -#' columns with data types integer and string and the mean which is a double.
    -#' schema <-  structType(structField("a", "integer"), structField("c", "string"),
    -#'   structField("avg", "double"))
    -#' df1 <- gapply(
    -#'   df,
    -#'   list("a", "c"),
    -#'   function(key, x) {
    -#'     y <- data.frame(key, mean(x$b), stringsAsFactors = FALSE)
    -#'   },
    -#' schema)
    -#' collect(df1)
    -#'
    -#' Result
    -#' ------
    -#' a c avg
    -#' 3 3 3.0
    -#' 1 1 1.5
    -#' }
    +#' @seealso \link{gapplyCollect}
     #' @note gapply(GroupedData) since 2.0.0
     setMethod("gapply",
               signature(x = "GroupedData"),
               function(x, func, schema) {
    -            try(if (is.null(schema)) stop("schema cannot be NULL"))
    -            packageNamesArr <- serialize(.sparkREnv[[".packages"]],
    -                                 connection = NULL)
    -            broadcastArr <- lapply(ls(.broadcastNames),
    -                              function(name) { get(name, .broadcastNames) })
    -            sdf <- callJStatic(
    -                     "org.apache.spark.sql.api.r.SQLUtils",
    -                     "gapply",
    -                     x@sgd,
    -                     serialize(cleanClosure(func), connection = NULL),
    -                     packageNamesArr,
    -                     broadcastArr,
    -                     schema$jobj)
    -            dataFrame(sdf)
    +            if (is.null(schema)) stop("schema cannot be NULL")
    +            gapplyInternal(x, func, schema)
               })
    +
    +#' gapplyCollect
    +#'
    +#' Applies a R function to each group in the input GroupedData and collects the result
    +#' back to R as a data.frame.
    +#'
    +#' @param x A GroupedData
    +#' @param func A function to be applied to each group partition specified by GroupedData.
    +#'             The function `func` takes as argument a key - grouping columns and
    +#'             a data frame - a local R data.frame.
    +#'             The output of `func` is a local R data.frame.
    +#' @return a SparkDataFrame
    +#' @rdname gapplyCollect
    +#' @name gapplyCollect
    +#' @export
    +#' @seealso \link{gapply}
    +#' @note gapplyCollect(GroupedData) since 2.0.0
    +setMethod("gapplyCollect",
    +          signature(x = "GroupedData"),
    +          function(x, func) {
    +            gdf <- gapplyInternal(x, func, NULL)
    +            content <- callJMethod(gdf@sdf, "collect")
    --- End diff --
    
    why not call `content <- collect(gdf)`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13760: [SPARK-16012][SparkR] implement gapplyCollect whi...

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13760#discussion_r68541684
  
    --- Diff: R/pkg/R/group.R ---
    @@ -198,62 +198,61 @@ createMethods()
     #'
     #' Applies a R function to each group in the input GroupedData
     #'
    -#' @param x a GroupedData
    -#' @param func A function to be applied to each group partition specified by GroupedData.
    -#'             The function `func` takes as argument a key - grouping columns and
    -#'             a data frame - a local R data.frame.
    -#'             The output of `func` is a local R data.frame.
    -#' @param schema The schema of the resulting SparkDataFrame after the function is applied.
    -#'               The schema must match to output of `func`. It has to be defined for each
    -#'               output column with preferred output column name and corresponding data type.
    -#' @return a SparkDataFrame
    +#' @param x A GroupedData
     #' @rdname gapply
     #' @name gapply
     #' @export
    -#' @examples
    -#' \dontrun{
    -#' Computes the arithmetic mean of the second column by grouping
    -#' on the first and third columns. Output the grouping values and the average.
    -#'
    -#' df <- createDataFrame (
    -#' list(list(1L, 1, "1", 0.1), list(1L, 2, "1", 0.2), list(3L, 3, "3", 0.3)),
    -#'   c("a", "b", "c", "d"))
    -#'
    -#' Here our output contains three columns, the key which is a combination of two
    -#' columns with data types integer and string and the mean which is a double.
    -#' schema <-  structType(structField("a", "integer"), structField("c", "string"),
    -#'   structField("avg", "double"))
    -#' df1 <- gapply(
    -#'   df,
    -#'   list("a", "c"),
    -#'   function(key, x) {
    -#'     y <- data.frame(key, mean(x$b), stringsAsFactors = FALSE)
    -#'   },
    -#' schema)
    -#' collect(df1)
    -#'
    -#' Result
    -#' ------
    -#' a c avg
    -#' 3 3 3.0
    -#' 1 1 1.5
    -#' }
    +#' @seealso \link{gapplyCollect}
     #' @note gapply(GroupedData) since 2.0.0
     setMethod("gapply",
               signature(x = "GroupedData"),
               function(x, func, schema) {
    -            try(if (is.null(schema)) stop("schema cannot be NULL"))
    -            packageNamesArr <- serialize(.sparkREnv[[".packages"]],
    -                                 connection = NULL)
    -            broadcastArr <- lapply(ls(.broadcastNames),
    -                              function(name) { get(name, .broadcastNames) })
    -            sdf <- callJStatic(
    -                     "org.apache.spark.sql.api.r.SQLUtils",
    -                     "gapply",
    -                     x@sgd,
    -                     serialize(cleanClosure(func), connection = NULL),
    -                     packageNamesArr,
    -                     broadcastArr,
    -                     schema$jobj)
    -            dataFrame(sdf)
    +            if (is.null(schema)) stop("schema cannot be NULL")
    +            gapplyInternal(x, func, schema)
               })
    +
    +#' gapplyCollect
    +#'
    +#' Applies a R function to each group in the input GroupedData and collects the result
    --- End diff --
    
    I think having this would somewhat duplicates gapplyCollect(SparkDataFrame), since they go to the same rd file. Could you check?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org