You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by yinxusen <gi...@git.apache.org> on 2016/02/09 00:53:14 UTC

[GitHub] spark pull request: [SPARK-13011] K-means wrapper in SparkR

GitHub user yinxusen opened a pull request:

    https://github.com/apache/spark/pull/11124

    [SPARK-13011] K-means wrapper in SparkR

    https://issues.apache.org/jira/browse/SPARK-13011

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/yinxusen/spark SPARK-13011

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/11124.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #11124
    
----
commit 13961129e2e6e519114e9ddca61420c0aff91b57
Author: Xusen Yin <yi...@gmail.com>
Date:   2016-01-31T05:26:02Z

    init version of kmeans

commit 071d1b76ffbd02b6b1a9707217646af20beee1c5
Author: Xusen Yin <yi...@gmail.com>
Date:   2016-01-31T05:26:09Z

    Merge branch 'master' into SPARK-13011

commit 509bfedf1b152b87621baff5245e2e74ae379166
Author: Xusen Yin <yi...@gmail.com>
Date:   2016-02-04T19:10:24Z

    refine code

commit 376f0bec2230e6723595c68de70f18040a0d990e
Author: Xusen Yin <yi...@gmail.com>
Date:   2016-02-05T00:08:53Z

    refine spark side wrapper

commit 4e38f7e104eb8895dfdc067a10933d864b07e708
Author: Xusen Yin <yi...@gmail.com>
Date:   2016-02-08T23:21:06Z

    refine interface

commit fcf749d4e7f21c055fcc35fedf3261bdd11e7af7
Author: Xusen Yin <yi...@gmail.com>
Date:   2016-02-08T23:49:30Z

    add test

commit ca0ea4a2e709f9c381be8fe164216e76e2823603
Author: Xusen Yin <yi...@gmail.com>
Date:   2016-02-08T23:51:39Z

    revert private for SparkRWrappers

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13011] K-means wrapper in SparkR

Posted by yinxusen <gi...@git.apache.org>.
Github user yinxusen commented on the pull request:

    https://github.com/apache/spark/pull/11124#issuecomment-187601570
  
    test it please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13011] K-means wrapper in SparkR

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11124#issuecomment-183783291
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/51249/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13011] K-means wrapper in SparkR

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11124#issuecomment-183783032
  
    **[Test build #51249 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/51249/consoleFull)** for PR 11124 at commit [`44c7595`](https://github.com/apache/spark/commit/44c759567916bb651103d36535bd77f358ab0186).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13011] K-means wrapper in SparkR

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11124#issuecomment-181649156
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/50945/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13011] K-means wrapper in SparkR

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11124#discussion_r52681513
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/r/SparkRWrappers.scala ---
    @@ -51,6 +52,22 @@ private[r] object SparkRWrappers {
         pipeline.fit(df)
       }
     
    +  def fitKMeans(
    +      df: DataFrame,
    +      initMode: String,
    +      maxIter: Double,
    +      k: Double,
    +      columns: Array[String]): PipelineModel = {
    +    val assembler = new VectorAssembler().setInputCols(columns).setOutputCol("temp-features")
    --- End diff --
    
    Do not name the output column. There exists a chance that the output column name already exists. The default one should work well. In L66, use `setFeaturesCol(assembler.getOutputCol)`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13011] K-means wrapper in SparkR

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11124#issuecomment-187624547
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13011] K-means wrapper in SparkR

Posted by yinxusen <gi...@git.apache.org>.
Github user yinxusen commented on the pull request:

    https://github.com/apache/spark/pull/11124#issuecomment-183092811
  
    @mengxr I am on my way implementing `summary`. The summary of `stats:kmeans` contains "cluster"      "centers"      "totss"        "withinss"     "tot.withinss" "betweenss"    "size"         "iter"         "ifault". However I think there is no need to implement all of them in this PR. I choose "cluster", "centers" and "size" in this version. What do you think?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13011] K-means wrapper in SparkR

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11124#issuecomment-183794873
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13011] K-means wrapper in SparkR

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11124#issuecomment-181649032
  
    **[Test build #50945 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50945/consoleFull)** for PR 11124 at commit [`ca0ea4a`](https://github.com/apache/spark/commit/ca0ea4a2e709f9c381be8fe164216e76e2823603).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13011] K-means wrapper in SparkR

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11124#issuecomment-187603205
  
    **[Test build #51751 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/51751/consoleFull)** for PR 11124 at commit [`2276323`](https://github.com/apache/spark/commit/2276323a5334c9785507b0d31f4349a82bbd86cd).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13011] K-means wrapper in SparkR

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11124#issuecomment-183805907
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13011] K-means wrapper in SparkR

Posted by yinxusen <gi...@git.apache.org>.
Github user yinxusen commented on the pull request:

    https://github.com/apache/spark/pull/11124#issuecomment-183791574
  
    @mengxr I add two summary variables `cluster` and `size` in `KMeansModel` to resemble counterparts in R's kmeans. IMHO even though `summary(kmeans.model)` looks incompleted, the `kmeans.model` itself contains those variables. I add these two first.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13011] K-means wrapper in SparkR

Posted by yinxusen <gi...@git.apache.org>.
Github user yinxusen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11124#discussion_r52637690
  
    --- Diff: R/pkg/R/mllib.R ---
    @@ -126,3 +126,27 @@ setMethod("summary", signature(object = "PipelineModel"),
                   return(list(coefficients = coefficients))
                 }
               })
    +
    +#' Fit a k-means model
    +#'
    +#' Fit a k-means model, similarly to R's kmeans().
    +#'
    +#' @param x DataFrame for training
    +#' @param centers Number of centers
    +#' @param iter.max Maximum iteration number
    +#' @param algorithm Algorithm choosen to fit the model
    +#' @return A fitted k-means model
    +#' @rdname kmeans
    +#' @export
    +#' @examples
    +#'\dontrun{
    +#' model <- kmeans(x, centers = 2, algorithm="random")
    +#'}
    +setMethod("kmeans", signature(x = "DataFrame"),
    +          function(x, centers, iter.max = 10, algorithm = c("random", "k-means||")) {
    +            columnNames <- as.array(colnames(x))
    +            algorithm <- if(missing(algorithm)) "random" else match.arg(algorithm)
    +            model <- callJStatic("org.apache.spark.ml.api.r.SparkRWrappers", "fitKMeans", x@sdf,
    +                                 algorithm, iter.max, centers, columnNames)
    +            return(new("PipelineModel", model = model))
    --- End diff --
    
    Sorry I missed this. Fix it in next commit.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13011] K-means wrapper in SparkR

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11124#issuecomment-183779253
  
    **[Test build #51249 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/51249/consoleFull)** for PR 11124 at commit [`44c7595`](https://github.com/apache/spark/commit/44c759567916bb651103d36535bd77f358ab0186).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13011] K-means wrapper in SparkR

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11124#issuecomment-187624551
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/51751/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13011] K-means wrapper in SparkR

Posted by sun-rui <gi...@git.apache.org>.
Github user sun-rui commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11124#discussion_r52836307
  
    --- Diff: R/pkg/inst/tests/testthat/test_mllib.R ---
    @@ -113,3 +113,18 @@ test_that("summary works on base GLM models", {
       baseSummary <- summary(baseModel)
       expect_true(abs(baseSummary$deviance - 12.19313) < 1e-4)
     })
    +
    +test_that("kmeans", {
    +  newIris <- iris
    +  newIris$Species <- NULL
    +  training <- suppressWarnings(createDataFrame(sqlContext, newIris))
    +
    +  # Cahce the DataFrame here to work around the bug SPARK-13178.
    --- End diff --
    
    typo: cache


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13011] K-means wrapper in SparkR

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/11124#issuecomment-187589974
  
    @yinxusen Could you resolve conflicts?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13011] K-means wrapper in SparkR

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11124#issuecomment-181634614
  
    **[Test build #50945 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50945/consoleFull)** for PR 11124 at commit [`ca0ea4a`](https://github.com/apache/spark/commit/ca0ea4a2e709f9c381be8fe164216e76e2823603).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13011] K-means wrapper in SparkR

Posted by yinxusen <gi...@git.apache.org>.
Github user yinxusen commented on the pull request:

    https://github.com/apache/spark/pull/11124#issuecomment-183791953
  
    test it please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13011] K-means wrapper in SparkR

Posted by yinxusen <gi...@git.apache.org>.
Github user yinxusen commented on the pull request:

    https://github.com/apache/spark/pull/11124#issuecomment-187599874
  
    @mengxr, do it now.
    
    On Mon, Feb 22, 2016 at 11:57 PM, Xiangrui Meng <no...@github.com>
    wrote:
    
    > @yinxusen <https://github.com/yinxusen> Could you resolve conflicts?
    >
    > —
    > Reply to this email directly or view it on GitHub
    > <https://github.com/apache/spark/pull/11124#issuecomment-187589974>.
    >
    
    
    
    -- 
    Cheers
    -----------------------------------
    Xusen Yin    (尹绪森)
    LinkedIn: https://cn.linkedin.com/in/xusenyin



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13011] K-means wrapper in SparkR

Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11124#discussion_r52521719
  
    --- Diff: R/pkg/R/mllib.R ---
    @@ -126,3 +126,27 @@ setMethod("summary", signature(object = "PipelineModel"),
                   return(list(coefficients = coefficients))
                 }
               })
    +
    +#' Fit a k-means model
    +#'
    +#' Fit a k-means model, similarly to R's kmeans().
    +#'
    +#' @param x DataFrame for training
    +#' @param centers Number of centers
    +#' @param iter.max Maximum iteration number
    +#' @param algorithm Algorithm choosen to fit the model
    +#' @return A fitted k-means model
    +#' @rdname kmeans
    +#' @export
    +#' @examples
    +#'\dontrun{
    +#' model <- kmeans(x, centers = 2, algorithm="random")
    +#'}
    +setMethod("kmeans", signature(x = "DataFrame"),
    +          function(x, centers, iter.max = 10, algorithm = c("random", "k-means||")) {
    +            columnNames <- as.array(colnames(x))
    +            algorithm <- if(missing(algorithm)) "random" else match.arg(algorithm)
    +            model <- callJStatic("org.apache.spark.ml.api.r.SparkRWrappers", "fitKMeans", x@sdf,
    +                                 algorithm, iter.max, centers, columnNames)
    +            return(new("PipelineModel", model = model))
    --- End diff --
    
    does the resultant model work with `summary(model)`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13011] K-means wrapper in SparkR

Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11124#discussion_r52520956
  
    --- Diff: R/pkg/R/mllib.R ---
    @@ -126,3 +126,27 @@ setMethod("summary", signature(object = "PipelineModel"),
                   return(list(coefficients = coefficients))
                 }
               })
    +
    +#' Fit a k-means model
    +#'
    +#' Fit a k-means model, similarly to R's kmeans().
    +#'
    +#' @param x DataFrame for training
    +#' @param centers Number of centers
    +#' @param iter.max Maximum iteration number
    +#' @param algorithm Algorithm choosen to fit the model
    +#' @return A fitted k-means model
    +#' @rdname kmeans
    +#' @export
    +#' @examples
    +#'\dontrun{
    +#' model <- kmeans(x, centers = 2, algorithm="random")
    +#'}
    +setMethod("kmeans", signature(x = "DataFrame"),
    --- End diff --
    
    shall this match the signature of stats::kmeans in R?
    https://stat.ethz.ch/R-manual/R-devel/library/stats/html/kmeans.html


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13011] K-means wrapper in SparkR

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/11124#issuecomment-187969935
  
    LGTM. Merged into master. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13011] K-means wrapper in SparkR

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11124#discussion_r52681503
  
    --- Diff: R/pkg/R/mllib.R ---
    @@ -126,3 +126,27 @@ setMethod("summary", signature(object = "PipelineModel"),
                   return(list(coefficients = coefficients))
                 }
               })
    +
    +#' Fit a k-means model
    +#'
    +#' Fit a k-means model, similarly to R's kmeans().
    +#'
    +#' @param x DataFrame for training
    +#' @param centers Number of centers
    +#' @param iter.max Maximum iteration number
    +#' @param algorithm Algorithm choosen to fit the model
    +#' @return A fitted k-means model
    +#' @rdname kmeans
    +#' @export
    +#' @examples
    +#'\dontrun{
    +#' model <- kmeans(x, centers = 2, algorithm="random")
    +#'}
    +setMethod("kmeans", signature(x = "DataFrame"),
    --- End diff --
    
    The current signature looks okay to me. Just need to make sure we don't shadow R's own `kmeans`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13011] K-means wrapper in SparkR

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11124#issuecomment-183805908
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/51253/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13011] K-means wrapper in SparkR

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11124#issuecomment-183794878
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/51252/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13011] K-means wrapper in SparkR

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11124#issuecomment-183783288
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13011] K-means wrapper in SparkR

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/11124#issuecomment-183092021
  
    @yinxusen I made one pass. Could you update the PR and implement both `summary` and `fitted`? Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13011] K-means wrapper in SparkR

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11124#issuecomment-187624262
  
    **[Test build #51751 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/51751/consoleFull)** for PR 11124 at commit [`2276323`](https://github.com/apache/spark/commit/2276323a5334c9785507b0d31f4349a82bbd86cd).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13011] K-means wrapper in SparkR

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11124#issuecomment-181649155
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13011] K-means wrapper in SparkR

Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11124#discussion_r52521378
  
    --- Diff: R/pkg/inst/tests/testthat/test_mllib.R ---
    @@ -113,3 +113,18 @@ test_that("summary works on base GLM models", {
       baseSummary <- summary(baseModel)
       expect_true(abs(baseSummary$deviance - 12.19313) < 1e-4)
     })
    +
    +test_that("kmeans", {
    +  newIris <- iris
    +  newIris$Species <- NULL
    +  training <- suppressWarnings(createDataFrame(sqlContext, newIris))
    +
    +  # Cahce the DataFrame here to work around the bug SPARK-13178.
    +  cache(training)
    +  take(training, 1)
    +
    +  model <- kmeans(x = training, centers = 2)
    +  sample <- take(select(predict(model, training), "prediction"), 1)
    +  expect_equal(typeof(sample$prediction), "integer")
    +  expect_equal(sample$prediction, 1)
    +})
    --- End diff --
    
    could you add another test with `algorithm` parameter?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13011] K-means wrapper in SparkR

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/11124


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13011] K-means wrapper in SparkR

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11124#issuecomment-183805607
  
    **[Test build #51253 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/51253/consoleFull)** for PR 11124 at commit [`ecb2850`](https://github.com/apache/spark/commit/ecb285003ea94cbdcdda53eb65350c4d5ac53379).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13011] K-means wrapper in SparkR

Posted by yinxusen <gi...@git.apache.org>.
Github user yinxusen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11124#discussion_r52690728
  
    --- Diff: R/pkg/R/mllib.R ---
    @@ -126,3 +126,27 @@ setMethod("summary", signature(object = "PipelineModel"),
                   return(list(coefficients = coefficients))
                 }
               })
    +
    +#' Fit a k-means model
    +#'
    +#' Fit a k-means model, similarly to R's kmeans().
    +#'
    +#' @param x DataFrame for training
    +#' @param centers Number of centers
    +#' @param iter.max Maximum iteration number
    +#' @param algorithm Algorithm choosen to fit the model
    +#' @return A fitted k-means model
    +#' @rdname kmeans
    +#' @export
    +#' @examples
    +#'\dontrun{
    +#' model <- kmeans(x, centers = 2, algorithm="random")
    +#'}
    +setMethod("kmeans", signature(x = "DataFrame"),
    --- End diff --
    
    I'll add test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13011] K-means wrapper in SparkR

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11124#issuecomment-183792181
  
    **[Test build #51253 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/51253/consoleFull)** for PR 11124 at commit [`ecb2850`](https://github.com/apache/spark/commit/ecb285003ea94cbdcdda53eb65350c4d5ac53379).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13011] K-means wrapper in SparkR

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/11124#issuecomment-183131188
  
    I checked R's output. It seems that `summary` on a k-means model doesn't provide much information:
    
    ~~~>
    summary(model)
                 Length Class  Mode   
    cluster      3      -none- numeric
    centers      2      -none- numeric
    totss        1      -none- numeric
    withinss     2      -none- numeric
    tot.withinss 1      -none- numeric
    betweenss    1      -none- numeric
    size         2      -none- numeric
    iter         1      -none- numeric
    ifault       1      -none- numeric
    ~~~
    
    It seems like `summary` is not implemented for k-means model. So I would just implement `model$centers` and `fitted(model, method=c("classes"))`. Note that we don't provide `cluster` (cluster assignments) and `size` (cluster sizes) in Scala.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13011] K-means wrapper in SparkR

Posted by yinxusen <gi...@git.apache.org>.
Github user yinxusen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11124#discussion_r52635130
  
    --- Diff: R/pkg/R/mllib.R ---
    @@ -126,3 +126,27 @@ setMethod("summary", signature(object = "PipelineModel"),
                   return(list(coefficients = coefficients))
                 }
               })
    +
    +#' Fit a k-means model
    +#'
    +#' Fit a k-means model, similarly to R's kmeans().
    +#'
    +#' @param x DataFrame for training
    +#' @param centers Number of centers
    +#' @param iter.max Maximum iteration number
    +#' @param algorithm Algorithm choosen to fit the model
    +#' @return A fitted k-means model
    +#' @rdname kmeans
    +#' @export
    +#' @examples
    +#'\dontrun{
    +#' model <- kmeans(x, centers = 2, algorithm="random")
    +#'}
    +setMethod("kmeans", signature(x = "DataFrame"),
    --- End diff --
    
    I don't think we can make the signature the same as stats:kmeans for now, since `KMeans` in MLlib doesn't provide the same interface with R's. E.g. `KMeans` in MLlib doesn't support warm-start so the `centers` has only to be a "numeric", and it doesn't have arguments like `nstart` and `trace`. Refer to [the implementation of `glm` in SparkR](https://github.com/apache/spark/blob/master/R/pkg/R/mllib.R#L53), which also cannot match the signature with R's 100%.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13011] K-means wrapper in SparkR

Posted by yinxusen <gi...@git.apache.org>.
Github user yinxusen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11124#discussion_r52635341
  
    --- Diff: R/pkg/inst/tests/testthat/test_mllib.R ---
    @@ -113,3 +113,18 @@ test_that("summary works on base GLM models", {
       baseSummary <- summary(baseModel)
       expect_true(abs(baseSummary$deviance - 12.19313) < 1e-4)
     })
    +
    +test_that("kmeans", {
    +  newIris <- iris
    +  newIris$Species <- NULL
    +  training <- suppressWarnings(createDataFrame(sqlContext, newIris))
    +
    +  # Cahce the DataFrame here to work around the bug SPARK-13178.
    +  cache(training)
    +  take(training, 1)
    +
    +  model <- kmeans(x = training, centers = 2)
    +  sample <- take(select(predict(model, training), "prediction"), 1)
    +  expect_equal(typeof(sample$prediction), "integer")
    +  expect_equal(sample$prediction, 1)
    +})
    --- End diff --
    
    Sure, I'll add it soon.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13011] K-means wrapper in SparkR

Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11124#discussion_r52546712
  
    --- Diff: R/pkg/inst/tests/testthat/test_mllib.R ---
    @@ -113,3 +113,18 @@ test_that("summary works on base GLM models", {
       baseSummary <- summary(baseModel)
       expect_true(abs(baseSummary$deviance - 12.19313) < 1e-4)
     })
    +
    +test_that("kmeans", {
    +  newIris <- iris
    +  newIris$Species <- NULL
    +  training <- suppressWarnings(createDataFrame(sqlContext, newIris))
    +
    +  # Cahce the DataFrame here to work around the bug SPARK-13178.
    +  cache(training)
    +  take(training, 1)
    +
    +  model <- kmeans(x = training, centers = 2)
    +  sample <- take(select(predict(model, training), "prediction"), 1)
    +  expect_equal(typeof(sample$prediction), "integer")
    +  expect_equal(sample$prediction, 1)
    +})
    --- End diff --
    
    also please add a test for `stats::kmeans` like [this](https://github.com/apache/spark/blob/master/R/pkg/inst/tests/testthat/test_sparkSQL.R#L1274)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13011] K-means wrapper in SparkR

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11124#discussion_r52681510
  
    --- Diff: R/pkg/R/mllib.R ---
    @@ -126,3 +126,27 @@ setMethod("summary", signature(object = "PipelineModel"),
                   return(list(coefficients = coefficients))
                 }
               })
    +
    +#' Fit a k-means model
    +#'
    +#' Fit a k-means model, similarly to R's kmeans().
    +#'
    +#' @param x DataFrame for training
    +#' @param centers Number of centers
    +#' @param iter.max Maximum iteration number
    +#' @param algorithm Algorithm choosen to fit the model
    +#' @return A fitted k-means model
    +#' @rdname kmeans
    +#' @export
    +#' @examples
    +#'\dontrun{
    +#' model <- kmeans(x, centers = 2, algorithm="random")
    +#'}
    +setMethod("kmeans", signature(x = "DataFrame"),
    +          function(x, centers, iter.max = 10, algorithm = c("random", "k-means||")) {
    +            columnNames <- as.array(colnames(x))
    +            algorithm <- if(missing(algorithm)) "random" else match.arg(algorithm)
    +            model <- callJStatic("org.apache.spark.ml.api.r.SparkRWrappers", "fitKMeans", x@sdf,
    +                                 algorithm, iter.max, centers, columnNames)
    +            return(new("PipelineModel", model = model))
    --- End diff --
    
    It is equally important to implement `fitted`, which makes predictions.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org