You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by sun-rui <gi...@git.apache.org> on 2015/10/08 07:30:39 UTC

[GitHub] spark pull request: [SPARK-10996][SPARKR] Implement sampleBy() in ...

GitHub user sun-rui opened a pull request:

    https://github.com/apache/spark/pull/9023

    [SPARK-10996][SPARKR] Implement sampleBy() in DataFrameStatFunctions.

    

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/sun-rui/spark SPARK-10996

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/9023.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #9023
    
----
commit 4bb5bfd56a351c4decb0aed5c9128c5cc957ee02
Author: Sun Rui <ru...@intel.com>
Date:   2015-10-08T04:36:40Z

    [SPARK-10996][SPARKR] Implement sampleBy() in DataFrameStatFunctions.

commit cefa0ceb36fb58ed5dcbe19f240005f256e615ca
Author: Sun Rui <ru...@intel.com>
Date:   2015-10-08T05:22:14Z

    Minor fix.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10996][SPARKR] Implement sampleBy() in ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9023#issuecomment-147914489
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43698/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10996][SPARKR] Implement sampleBy() in ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9023#issuecomment-146424973
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10996][SPARKR] Implement sampleBy() in ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9023#issuecomment-146428350
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43382/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10996][SPARKR] Implement sampleBy() in ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9023#issuecomment-147046622
  
      [Test build #43522 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43522/consoleFull) for   PR 9023 at commit [`09ecea8`](https://github.com/apache/spark/commit/09ecea80cab9b1531bdd6111354d0a5c8e7f8f8f).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10996][SPARKR] Implement sampleBy() in ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9023#issuecomment-146782780
  
      [Test build #43461 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43461/console) for   PR 9023 at commit [`a389372`](https://github.com/apache/spark/commit/a3893720d9d2fb4a0186dfc42c04ff9e223a67c0).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10996][SPARKR] Implement sampleBy() in ...

Posted by sun-rui <gi...@git.apache.org>.
Github user sun-rui commented on the pull request:

    https://github.com/apache/spark/pull/9023#issuecomment-147020680
  
    The conversion from a named list to an env to be passed to JVM backend was used in several functions, so I extract it to a common util function that can be reused. Functions are updated to use this util function, but no change to its original logical, just code re-organization.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10996][SPARKR] Implement sampleBy() in ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9023#issuecomment-147909314
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10996][SPARKR] Implement sampleBy() in ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9023#issuecomment-146428347
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10996][SPARKR] Implement sampleBy() in ...

Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9023#discussion_r41692723
  
    --- Diff: R/pkg/R/utils.R ---
    @@ -588,3 +588,13 @@ mergePartitions <- function(rdd, zip) {
     
       PipelinedRDD(rdd, partitionFunc)
     }
    +
    +# Convert a named list to an environment to be passed to JVM
    +convertNamedListToEnv <- function(namedList) {
    +  names <- names(namedList)
    --- End diff --
    
    check `length(na.omit(names(namedList))) == length(namedList)`?
    ```
    > a = list(0.1, 0.2)
    > length(na.omit(names(a))) == length(a)
    [1] FALSE
    > a = list(0.1, 0.2)
    > names(a) = c("abc")
    > length(na.omit(names(a))) == length(a)
    [1] FALSE
    > a = list(first=0.1, second=0.2)
    > length(na.omit(names(a))) == length(a)
    [1] TRUE
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10996][SPARKR] Implement sampleBy() in ...

Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on the pull request:

    https://github.com/apache/spark/pull/9023#issuecomment-146649139
  
    It looked in this a bit since I thought I would be working on SPARK-9443
    
    Would it be more R-like to support named list like this:
    ```
    fractions = list(first=0.1, second=0.2)
    sampleBy(df, "key", fractions, 0)
    ```
    
    because then it is already very easy to access a particular fractions
    ```
    > fractions[["first"]]
    [1] 0.1
    ```
    
    I think this is more intuitive as a R user then env()


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10996][SPARKR] Implement sampleBy() in ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9023#issuecomment-146776937
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10996][SPARKR] Implement sampleBy() in ...

Posted by sun-rui <gi...@git.apache.org>.
Github user sun-rui commented on the pull request:

    https://github.com/apache/spark/pull/9023#issuecomment-146741164
  
    @felixcheung, yes, I agree. I will change fractions as a named list.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10996][SPARKR] Implement sampleBy() in ...

Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9023#discussion_r41692768
  
    --- Diff: R/pkg/R/DataFrame.R ---
    @@ -1795,17 +1795,15 @@ setMethod("fillna",
                   if (length(colNames) == 0 || !all(colNames != "")) {
                     stop("value should be an a named list with each name being a column name.")
                   }
    -
    -              # Convert to the named list to an environment to be passed to JVM
    -              valueMap <- new.env()
    -              for (col in colNames) {
    -                # Check each item in the named list is of valid type
    -                v <- value[[col]]
    +              # Check each item in the named list is of valid type
    +              lapply(value, function(v) {
    --- End diff --
    
    nit: this is going through it twice..  perhaps add a FUN param to `convertNamedListToEnv` to allow for a custom validation for each value?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10996][SPARKR] Implement sampleBy() in ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9023#issuecomment-147046379
  
     Build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10996][SPARKR] Implement sampleBy() in ...

Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9023#discussion_r41657723
  
    --- Diff: R/pkg/R/sparkR.R ---
    @@ -163,19 +163,13 @@ sparkR.init <- function(
         sparkHome <- suppressWarnings(normalizePath(sparkHome))
       }
     
    -  sparkEnvirMap <- new.env()
    -  for (varname in names(sparkEnvir)) {
    -    sparkEnvirMap[[varname]] <- sparkEnvir[[varname]]
    -  }
    +  sparkEnvirMap <- convertNamedListToEnv(sparkEnvir)
    --- End diff --
    
    looks like there are other changes in this PR..


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10996][SPARKR] Implement sampleBy() in ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9023#issuecomment-146777022
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10996][SPARKR] Implement sampleBy() in ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9023#issuecomment-147914487
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10996][SPARKR] Implement sampleBy() in ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9023#issuecomment-147069856
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43522/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10996][SPARKR] Implement sampleBy() in ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9023#issuecomment-147914074
  
      [Test build #43698 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43698/console) for   PR 9023 at commit [`e97ef0d`](https://github.com/apache/spark/commit/e97ef0da930bae42228ce0590e22a73e012bce96).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10996][SPARKR] Implement sampleBy() in ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9023#issuecomment-147046382
  
    Build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10996][SPARKR] Implement sampleBy() in ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9023#issuecomment-147069854
  
    Build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10996][SPARKR] Implement sampleBy() in ...

Posted by sun-rui <gi...@git.apache.org>.
Github user sun-rui commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9023#discussion_r41704114
  
    --- Diff: R/pkg/R/stats.R ---
    @@ -100,3 +100,36 @@ setMethod("corr",
                 statFunctions <- callJMethod(x@sdf, "stat")
                 callJMethod(statFunctions, "corr", col1, col2, method)
               })
    +
    +
    +#' sampleBy
    +#'
    +#' Returns a stratified sample without replacement based on the fraction given on each stratum.
    +#' 
    +#' @param x A SparkSQL DataFrame
    +#' @param col column that defines strata
    +#' @param fractions A named list giving sampling fraction for each stratum. If a stratum is
    +#'                  not specified, we treat its fraction as zero.
    +#' @param seed random seed
    +#' @return A new DataFrame that represents the stratified sample
    +#'
    +#' @rdname statfunctions
    +#' @name sampleBy
    +#' @export
    +#' @examples
    +#'\dontrun{
    +#' df <- jsonFile(sqlContext, "/path/to/file.json")
    +#' sample <- sampleBy(df, "key", fractions, 36)
    +#' }
    +setMethod("sampleBy",
    +          signature(x = "DataFrame", col = "character",
    +                    fractions = "list", seed = "numeric"),
    +          function(x, col, fractions, seed) {
    +            fractionsEnv <- convertNamedListToEnv(fractions)
    --- End diff --
    
    the signature prevents the required parameters from being missing.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10996][SPARKR] Implement sampleBy() in ...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/9023


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10996][SPARKR] Implement sampleBy() in ...

Posted by sun-rui <gi...@git.apache.org>.
Github user sun-rui commented on the pull request:

    https://github.com/apache/spark/pull/9023#issuecomment-147909042
  
    @shivaram, rebased to master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10996][SPARKR] Implement sampleBy() in ...

Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9023#discussion_r41699681
  
    --- Diff: R/pkg/R/stats.R ---
    @@ -100,3 +100,36 @@ setMethod("corr",
                 statFunctions <- callJMethod(x@sdf, "stat")
                 callJMethod(statFunctions, "corr", col1, col2, method)
               })
    +
    +
    +#' sampleBy
    +#'
    +#' Returns a stratified sample without replacement based on the fraction given on each stratum.
    +#' 
    +#' @param x A SparkSQL DataFrame
    +#' @param col column that defines strata
    +#' @param fractions A named list giving sampling fraction for each stratum. If a stratum is
    +#'                  not specified, we treat its fraction as zero.
    +#' @param seed random seed
    +#' @return A new DataFrame that represents the stratified sample
    +#'
    +#' @rdname statfunctions
    +#' @name sampleBy
    +#' @export
    +#' @examples
    +#'\dontrun{
    +#' df <- jsonFile(sqlContext, "/path/to/file.json")
    +#' sample <- sampleBy(df, "key", fractions, 36)
    +#' }
    +setMethod("sampleBy",
    +          signature(x = "DataFrame", col = "character",
    +                    fractions = "list", seed = "numeric"),
    +          function(x, col, fractions, seed) {
    +            fractionsEnv <- convertNamedListToEnv(fractions)
    --- End diff --
    
    also should this support when fractions or seed are omitted/missing?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10996][SPARKR] Implement sampleBy() in ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9023#issuecomment-147069823
  
      [Test build #43522 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43522/console) for   PR 9023 at commit [`09ecea8`](https://github.com/apache/spark/commit/09ecea80cab9b1531bdd6111354d0a5c8e7f8f8f).
     * This patch **passes all tests**.
     * This patch **does not merge cleanly**.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10996][SPARKR] Implement sampleBy() in ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9023#issuecomment-146425647
  
      [Test build #43382 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43382/consoleFull) for   PR 9023 at commit [`cefa0ce`](https://github.com/apache/spark/commit/cefa0ceb36fb58ed5dcbe19f240005f256e615ca).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10996][SPARKR] Implement sampleBy() in ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9023#issuecomment-147909323
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10996][SPARKR] Implement sampleBy() in ...

Posted by sun-rui <gi...@git.apache.org>.
Github user sun-rui commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9023#discussion_r41693821
  
    --- Diff: R/pkg/R/utils.R ---
    @@ -588,3 +588,13 @@ mergePartitions <- function(rdd, zip) {
     
       PipelinedRDD(rdd, partitionFunc)
     }
    +
    +# Convert a named list to an environment to be passed to JVM
    +convertNamedListToEnv <- function(namedList) {
    +  names <- names(namedList)
    --- End diff --
    
    yes, this check makes sense. I will add it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10996][SPARKR] Implement sampleBy() in ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9023#issuecomment-146778937
  
      [Test build #43461 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43461/consoleFull) for   PR 9023 at commit [`a389372`](https://github.com/apache/spark/commit/a3893720d9d2fb4a0186dfc42c04ff9e223a67c0).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10996][SPARKR] Implement sampleBy() in ...

Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9023#discussion_r41704334
  
    --- Diff: R/pkg/R/stats.R ---
    @@ -100,3 +100,36 @@ setMethod("corr",
                 statFunctions <- callJMethod(x@sdf, "stat")
                 callJMethod(statFunctions, "corr", col1, col2, method)
               })
    +
    +
    +#' sampleBy
    +#'
    +#' Returns a stratified sample without replacement based on the fraction given on each stratum.
    +#' 
    +#' @param x A SparkSQL DataFrame
    +#' @param col column that defines strata
    +#' @param fractions A named list giving sampling fraction for each stratum. If a stratum is
    +#'                  not specified, we treat its fraction as zero.
    +#' @param seed random seed
    +#' @return A new DataFrame that represents the stratified sample
    +#'
    +#' @rdname statfunctions
    +#' @name sampleBy
    +#' @export
    +#' @examples
    +#'\dontrun{
    +#' df <- jsonFile(sqlContext, "/path/to/file.json")
    +#' sample <- sampleBy(df, "key", fractions, 36)
    +#' }
    +setMethod("sampleBy",
    +          signature(x = "DataFrame", col = "character",
    +                    fractions = "list", seed = "numeric"),
    +          function(x, col, fractions, seed) {
    +            fractionsEnv <- convertNamedListToEnv(fractions)
    --- End diff --
    
    oh right. :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10996][SPARKR] Implement sampleBy() in ...

Posted by shivaram <gi...@git.apache.org>.
Github user shivaram commented on the pull request:

    https://github.com/apache/spark/pull/9023#issuecomment-146608515
  
    cc @felixcheung 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10996][SPARKR] Implement sampleBy() in ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9023#issuecomment-146783130
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43461/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10996][SPARKR] Implement sampleBy() in ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9023#issuecomment-146424986
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10996][SPARKR] Implement sampleBy() in ...

Posted by shivaram <gi...@git.apache.org>.
Github user shivaram commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9023#discussion_r41648341
  
    --- Diff: R/pkg/R/DataFrame.R ---
    @@ -1795,17 +1795,15 @@ setMethod("fillna",
                   if (length(colNames) == 0 || !all(colNames != "")) {
                     stop("value should be an a named list with each name being a column name.")
                   }
    -
    -              # Convert to the named list to an environment to be passed to JVM
    -              valueMap <- new.env()
    -              for (col in colNames) {
    -                # Check each item in the named list is of valid type
    -                v <- value[[col]]
    +              # Check each item in the named list is of valid type
    +              lapply(value, function(v) {
    --- End diff --
    
    Why is the change in `fillna` a part of this PR ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10996][SPARKR] Implement sampleBy() in ...

Posted by shivaram <gi...@git.apache.org>.
Github user shivaram commented on the pull request:

    https://github.com/apache/spark/pull/9023#issuecomment-147784338
  
    @sun-rui Change looks pretty good. I didn't notice the refactoring of convertListToEnvironment -- I think thats a good idea. Could you bring this up to date with master ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10996][SPARKR] Implement sampleBy() in ...

Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on the pull request:

    https://github.com/apache/spark/pull/9023#issuecomment-147030568
  
    got it. suggest adding a check, looks good otherwise.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10996][SPARKR] Implement sampleBy() in ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9023#issuecomment-146428102
  
      [Test build #43382 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43382/console) for   PR 9023 at commit [`cefa0ce`](https://github.com/apache/spark/commit/cefa0ceb36fb58ed5dcbe19f240005f256e615ca).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10996][SPARKR] Implement sampleBy() in ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9023#issuecomment-147909701
  
      [Test build #43698 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43698/consoleFull) for   PR 9023 at commit [`e97ef0d`](https://github.com/apache/spark/commit/e97ef0da930bae42228ce0590e22a73e012bce96).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10996][SPARKR] Implement sampleBy() in ...

Posted by shivaram <gi...@git.apache.org>.
Github user shivaram commented on the pull request:

    https://github.com/apache/spark/pull/9023#issuecomment-147939288
  
    Thanks @sun-rui LGTM. Merging this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10996][SPARKR] Implement sampleBy() in ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9023#issuecomment-146783126
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org