You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by yinxusen <gi...@git.apache.org> on 2016/02/09 00:53:14 UTC
[GitHub] spark pull request: [SPARK-13011] K-means wrapper in SparkR
GitHub user yinxusen opened a pull request:
https://github.com/apache/spark/pull/11124
[SPARK-13011] K-means wrapper in SparkR
https://issues.apache.org/jira/browse/SPARK-13011
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/yinxusen/spark SPARK-13011
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/11124.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #11124
----
commit 13961129e2e6e519114e9ddca61420c0aff91b57
Author: Xusen Yin <yi...@gmail.com>
Date: 2016-01-31T05:26:02Z
init version of kmeans
commit 071d1b76ffbd02b6b1a9707217646af20beee1c5
Author: Xusen Yin <yi...@gmail.com>
Date: 2016-01-31T05:26:09Z
Merge branch 'master' into SPARK-13011
commit 509bfedf1b152b87621baff5245e2e74ae379166
Author: Xusen Yin <yi...@gmail.com>
Date: 2016-02-04T19:10:24Z
refine code
commit 376f0bec2230e6723595c68de70f18040a0d990e
Author: Xusen Yin <yi...@gmail.com>
Date: 2016-02-05T00:08:53Z
refine spark side wrapper
commit 4e38f7e104eb8895dfdc067a10933d864b07e708
Author: Xusen Yin <yi...@gmail.com>
Date: 2016-02-08T23:21:06Z
refine interface
commit fcf749d4e7f21c055fcc35fedf3261bdd11e7af7
Author: Xusen Yin <yi...@gmail.com>
Date: 2016-02-08T23:49:30Z
add test
commit ca0ea4a2e709f9c381be8fe164216e76e2823603
Author: Xusen Yin <yi...@gmail.com>
Date: 2016-02-08T23:51:39Z
revert private for SparkRWrappers
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-13011] K-means wrapper in SparkR
Posted by yinxusen <gi...@git.apache.org>.
Github user yinxusen commented on the pull request:
https://github.com/apache/spark/pull/11124#issuecomment-187601570
test it please
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-13011] K-means wrapper in SparkR
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/11124#issuecomment-183783291
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/51249/
Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-13011] K-means wrapper in SparkR
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/11124#issuecomment-183783032
**[Test build #51249 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/51249/consoleFull)** for PR 11124 at commit [`44c7595`](https://github.com/apache/spark/commit/44c759567916bb651103d36535bd77f358ab0186).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-13011] K-means wrapper in SparkR
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/11124#issuecomment-181649156
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/50945/
Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-13011] K-means wrapper in SparkR
Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:
https://github.com/apache/spark/pull/11124#discussion_r52681513
--- Diff: mllib/src/main/scala/org/apache/spark/ml/r/SparkRWrappers.scala ---
@@ -51,6 +52,22 @@ private[r] object SparkRWrappers {
pipeline.fit(df)
}
+ def fitKMeans(
+ df: DataFrame,
+ initMode: String,
+ maxIter: Double,
+ k: Double,
+ columns: Array[String]): PipelineModel = {
+ val assembler = new VectorAssembler().setInputCols(columns).setOutputCol("temp-features")
--- End diff --
Do not name the output column. There exists a chance that the output column name already exists. The default one should work well. In L66, use `setFeaturesCol(assembler.getOutputCol)`.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-13011] K-means wrapper in SparkR
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/11124#issuecomment-187624547
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-13011] K-means wrapper in SparkR
Posted by yinxusen <gi...@git.apache.org>.
Github user yinxusen commented on the pull request:
https://github.com/apache/spark/pull/11124#issuecomment-183092811
@mengxr I am on my way implementing `summary`. The summary of `stats:kmeans` contains "cluster" "centers" "totss" "withinss" "tot.withinss" "betweenss" "size" "iter" "ifault". However I think there is no need to implement all of them in this PR. I choose "cluster", "centers" and "size" in this version. What do you think?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-13011] K-means wrapper in SparkR
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/11124#issuecomment-183794873
Merged build finished. Test FAILed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-13011] K-means wrapper in SparkR
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/11124#issuecomment-181649032
**[Test build #50945 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50945/consoleFull)** for PR 11124 at commit [`ca0ea4a`](https://github.com/apache/spark/commit/ca0ea4a2e709f9c381be8fe164216e76e2823603).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-13011] K-means wrapper in SparkR
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/11124#issuecomment-187603205
**[Test build #51751 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/51751/consoleFull)** for PR 11124 at commit [`2276323`](https://github.com/apache/spark/commit/2276323a5334c9785507b0d31f4349a82bbd86cd).
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-13011] K-means wrapper in SparkR
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/11124#issuecomment-183805907
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-13011] K-means wrapper in SparkR
Posted by yinxusen <gi...@git.apache.org>.
Github user yinxusen commented on the pull request:
https://github.com/apache/spark/pull/11124#issuecomment-183791574
@mengxr I add two summary variables `cluster` and `size` in `KMeansModel` to resemble counterparts in R's kmeans. IMHO even though `summary(kmeans.model)` looks incompleted, the `kmeans.model` itself contains those variables. I add these two first.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-13011] K-means wrapper in SparkR
Posted by yinxusen <gi...@git.apache.org>.
Github user yinxusen commented on a diff in the pull request:
https://github.com/apache/spark/pull/11124#discussion_r52637690
--- Diff: R/pkg/R/mllib.R ---
@@ -126,3 +126,27 @@ setMethod("summary", signature(object = "PipelineModel"),
return(list(coefficients = coefficients))
}
})
+
+#' Fit a k-means model
+#'
+#' Fit a k-means model, similarly to R's kmeans().
+#'
+#' @param x DataFrame for training
+#' @param centers Number of centers
+#' @param iter.max Maximum iteration number
+#' @param algorithm Algorithm choosen to fit the model
+#' @return A fitted k-means model
+#' @rdname kmeans
+#' @export
+#' @examples
+#'\dontrun{
+#' model <- kmeans(x, centers = 2, algorithm="random")
+#'}
+setMethod("kmeans", signature(x = "DataFrame"),
+ function(x, centers, iter.max = 10, algorithm = c("random", "k-means||")) {
+ columnNames <- as.array(colnames(x))
+ algorithm <- if(missing(algorithm)) "random" else match.arg(algorithm)
+ model <- callJStatic("org.apache.spark.ml.api.r.SparkRWrappers", "fitKMeans", x@sdf,
+ algorithm, iter.max, centers, columnNames)
+ return(new("PipelineModel", model = model))
--- End diff --
Sorry I missed this. Fix it in next commit.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-13011] K-means wrapper in SparkR
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/11124#issuecomment-183779253
**[Test build #51249 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/51249/consoleFull)** for PR 11124 at commit [`44c7595`](https://github.com/apache/spark/commit/44c759567916bb651103d36535bd77f358ab0186).
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-13011] K-means wrapper in SparkR
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/11124#issuecomment-187624551
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/51751/
Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-13011] K-means wrapper in SparkR
Posted by sun-rui <gi...@git.apache.org>.
Github user sun-rui commented on a diff in the pull request:
https://github.com/apache/spark/pull/11124#discussion_r52836307
--- Diff: R/pkg/inst/tests/testthat/test_mllib.R ---
@@ -113,3 +113,18 @@ test_that("summary works on base GLM models", {
baseSummary <- summary(baseModel)
expect_true(abs(baseSummary$deviance - 12.19313) < 1e-4)
})
+
+test_that("kmeans", {
+ newIris <- iris
+ newIris$Species <- NULL
+ training <- suppressWarnings(createDataFrame(sqlContext, newIris))
+
+ # Cahce the DataFrame here to work around the bug SPARK-13178.
--- End diff --
typo: cache
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-13011] K-means wrapper in SparkR
Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the pull request:
https://github.com/apache/spark/pull/11124#issuecomment-187589974
@yinxusen Could you resolve conflicts?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-13011] K-means wrapper in SparkR
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/11124#issuecomment-181634614
**[Test build #50945 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50945/consoleFull)** for PR 11124 at commit [`ca0ea4a`](https://github.com/apache/spark/commit/ca0ea4a2e709f9c381be8fe164216e76e2823603).
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-13011] K-means wrapper in SparkR
Posted by yinxusen <gi...@git.apache.org>.
Github user yinxusen commented on the pull request:
https://github.com/apache/spark/pull/11124#issuecomment-183791953
test it please
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-13011] K-means wrapper in SparkR
Posted by yinxusen <gi...@git.apache.org>.
Github user yinxusen commented on the pull request:
https://github.com/apache/spark/pull/11124#issuecomment-187599874
@mengxr, do it now.
On Mon, Feb 22, 2016 at 11:57 PM, Xiangrui Meng <no...@github.com>
wrote:
> @yinxusen <https://github.com/yinxusen> Could you resolve conflicts?
>
> —
> Reply to this email directly or view it on GitHub
> <https://github.com/apache/spark/pull/11124#issuecomment-187589974>.
>
--
Cheers
-----------------------------------
Xusen Yin (尹绪森)
LinkedIn: https://cn.linkedin.com/in/xusenyin
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-13011] K-means wrapper in SparkR
Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on a diff in the pull request:
https://github.com/apache/spark/pull/11124#discussion_r52521719
--- Diff: R/pkg/R/mllib.R ---
@@ -126,3 +126,27 @@ setMethod("summary", signature(object = "PipelineModel"),
return(list(coefficients = coefficients))
}
})
+
+#' Fit a k-means model
+#'
+#' Fit a k-means model, similarly to R's kmeans().
+#'
+#' @param x DataFrame for training
+#' @param centers Number of centers
+#' @param iter.max Maximum iteration number
+#' @param algorithm Algorithm choosen to fit the model
+#' @return A fitted k-means model
+#' @rdname kmeans
+#' @export
+#' @examples
+#'\dontrun{
+#' model <- kmeans(x, centers = 2, algorithm="random")
+#'}
+setMethod("kmeans", signature(x = "DataFrame"),
+ function(x, centers, iter.max = 10, algorithm = c("random", "k-means||")) {
+ columnNames <- as.array(colnames(x))
+ algorithm <- if(missing(algorithm)) "random" else match.arg(algorithm)
+ model <- callJStatic("org.apache.spark.ml.api.r.SparkRWrappers", "fitKMeans", x@sdf,
+ algorithm, iter.max, centers, columnNames)
+ return(new("PipelineModel", model = model))
--- End diff --
does the resultant model work with `summary(model)`?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-13011] K-means wrapper in SparkR
Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on a diff in the pull request:
https://github.com/apache/spark/pull/11124#discussion_r52520956
--- Diff: R/pkg/R/mllib.R ---
@@ -126,3 +126,27 @@ setMethod("summary", signature(object = "PipelineModel"),
return(list(coefficients = coefficients))
}
})
+
+#' Fit a k-means model
+#'
+#' Fit a k-means model, similarly to R's kmeans().
+#'
+#' @param x DataFrame for training
+#' @param centers Number of centers
+#' @param iter.max Maximum iteration number
+#' @param algorithm Algorithm choosen to fit the model
+#' @return A fitted k-means model
+#' @rdname kmeans
+#' @export
+#' @examples
+#'\dontrun{
+#' model <- kmeans(x, centers = 2, algorithm="random")
+#'}
+setMethod("kmeans", signature(x = "DataFrame"),
--- End diff --
shall this match the signature of stats::kmeans in R?
https://stat.ethz.ch/R-manual/R-devel/library/stats/html/kmeans.html
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-13011] K-means wrapper in SparkR
Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the pull request:
https://github.com/apache/spark/pull/11124#issuecomment-187969935
LGTM. Merged into master. Thanks!
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-13011] K-means wrapper in SparkR
Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:
https://github.com/apache/spark/pull/11124#discussion_r52681503
--- Diff: R/pkg/R/mllib.R ---
@@ -126,3 +126,27 @@ setMethod("summary", signature(object = "PipelineModel"),
return(list(coefficients = coefficients))
}
})
+
+#' Fit a k-means model
+#'
+#' Fit a k-means model, similarly to R's kmeans().
+#'
+#' @param x DataFrame for training
+#' @param centers Number of centers
+#' @param iter.max Maximum iteration number
+#' @param algorithm Algorithm choosen to fit the model
+#' @return A fitted k-means model
+#' @rdname kmeans
+#' @export
+#' @examples
+#'\dontrun{
+#' model <- kmeans(x, centers = 2, algorithm="random")
+#'}
+setMethod("kmeans", signature(x = "DataFrame"),
--- End diff --
The current signature looks okay to me. Just need to make sure we don't shadow R's own `kmeans`.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-13011] K-means wrapper in SparkR
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/11124#issuecomment-183805908
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/51253/
Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-13011] K-means wrapper in SparkR
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/11124#issuecomment-183794878
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/51252/
Test FAILed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-13011] K-means wrapper in SparkR
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/11124#issuecomment-183783288
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-13011] K-means wrapper in SparkR
Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the pull request:
https://github.com/apache/spark/pull/11124#issuecomment-183092021
@yinxusen I made one pass. Could you update the PR and implement both `summary` and `fitted`? Thanks!
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-13011] K-means wrapper in SparkR
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/11124#issuecomment-187624262
**[Test build #51751 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/51751/consoleFull)** for PR 11124 at commit [`2276323`](https://github.com/apache/spark/commit/2276323a5334c9785507b0d31f4349a82bbd86cd).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-13011] K-means wrapper in SparkR
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/11124#issuecomment-181649155
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-13011] K-means wrapper in SparkR
Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on a diff in the pull request:
https://github.com/apache/spark/pull/11124#discussion_r52521378
--- Diff: R/pkg/inst/tests/testthat/test_mllib.R ---
@@ -113,3 +113,18 @@ test_that("summary works on base GLM models", {
baseSummary <- summary(baseModel)
expect_true(abs(baseSummary$deviance - 12.19313) < 1e-4)
})
+
+test_that("kmeans", {
+ newIris <- iris
+ newIris$Species <- NULL
+ training <- suppressWarnings(createDataFrame(sqlContext, newIris))
+
+ # Cahce the DataFrame here to work around the bug SPARK-13178.
+ cache(training)
+ take(training, 1)
+
+ model <- kmeans(x = training, centers = 2)
+ sample <- take(select(predict(model, training), "prediction"), 1)
+ expect_equal(typeof(sample$prediction), "integer")
+ expect_equal(sample$prediction, 1)
+})
--- End diff --
could you add another test with `algorithm` parameter?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-13011] K-means wrapper in SparkR
Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:
https://github.com/apache/spark/pull/11124
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-13011] K-means wrapper in SparkR
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/11124#issuecomment-183805607
**[Test build #51253 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/51253/consoleFull)** for PR 11124 at commit [`ecb2850`](https://github.com/apache/spark/commit/ecb285003ea94cbdcdda53eb65350c4d5ac53379).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-13011] K-means wrapper in SparkR
Posted by yinxusen <gi...@git.apache.org>.
Github user yinxusen commented on a diff in the pull request:
https://github.com/apache/spark/pull/11124#discussion_r52690728
--- Diff: R/pkg/R/mllib.R ---
@@ -126,3 +126,27 @@ setMethod("summary", signature(object = "PipelineModel"),
return(list(coefficients = coefficients))
}
})
+
+#' Fit a k-means model
+#'
+#' Fit a k-means model, similarly to R's kmeans().
+#'
+#' @param x DataFrame for training
+#' @param centers Number of centers
+#' @param iter.max Maximum iteration number
+#' @param algorithm Algorithm choosen to fit the model
+#' @return A fitted k-means model
+#' @rdname kmeans
+#' @export
+#' @examples
+#'\dontrun{
+#' model <- kmeans(x, centers = 2, algorithm="random")
+#'}
+setMethod("kmeans", signature(x = "DataFrame"),
--- End diff --
I'll add test
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-13011] K-means wrapper in SparkR
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/11124#issuecomment-183792181
**[Test build #51253 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/51253/consoleFull)** for PR 11124 at commit [`ecb2850`](https://github.com/apache/spark/commit/ecb285003ea94cbdcdda53eb65350c4d5ac53379).
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-13011] K-means wrapper in SparkR
Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the pull request:
https://github.com/apache/spark/pull/11124#issuecomment-183131188
I checked R's output. It seems that `summary` on a k-means model doesn't provide much information:
~~~>
summary(model)
Length Class Mode
cluster 3 -none- numeric
centers 2 -none- numeric
totss 1 -none- numeric
withinss 2 -none- numeric
tot.withinss 1 -none- numeric
betweenss 1 -none- numeric
size 2 -none- numeric
iter 1 -none- numeric
ifault 1 -none- numeric
~~~
It seems like `summary` is not implemented for k-means model. So I would just implement `model$centers` and `fitted(model, method=c("classes"))`. Note that we don't provide `cluster` (cluster assignments) and `size` (cluster sizes) in Scala.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-13011] K-means wrapper in SparkR
Posted by yinxusen <gi...@git.apache.org>.
Github user yinxusen commented on a diff in the pull request:
https://github.com/apache/spark/pull/11124#discussion_r52635130
--- Diff: R/pkg/R/mllib.R ---
@@ -126,3 +126,27 @@ setMethod("summary", signature(object = "PipelineModel"),
return(list(coefficients = coefficients))
}
})
+
+#' Fit a k-means model
+#'
+#' Fit a k-means model, similarly to R's kmeans().
+#'
+#' @param x DataFrame for training
+#' @param centers Number of centers
+#' @param iter.max Maximum iteration number
+#' @param algorithm Algorithm choosen to fit the model
+#' @return A fitted k-means model
+#' @rdname kmeans
+#' @export
+#' @examples
+#'\dontrun{
+#' model <- kmeans(x, centers = 2, algorithm="random")
+#'}
+setMethod("kmeans", signature(x = "DataFrame"),
--- End diff --
I don't think we can make the signature the same as stats:kmeans for now, since `KMeans` in MLlib doesn't provide the same interface with R's. E.g. `KMeans` in MLlib doesn't support warm-start so the `centers` has only to be a "numeric", and it doesn't have arguments like `nstart` and `trace`. Refer to [the implementation of `glm` in SparkR](https://github.com/apache/spark/blob/master/R/pkg/R/mllib.R#L53), which also cannot match the signature with R's 100%.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-13011] K-means wrapper in SparkR
Posted by yinxusen <gi...@git.apache.org>.
Github user yinxusen commented on a diff in the pull request:
https://github.com/apache/spark/pull/11124#discussion_r52635341
--- Diff: R/pkg/inst/tests/testthat/test_mllib.R ---
@@ -113,3 +113,18 @@ test_that("summary works on base GLM models", {
baseSummary <- summary(baseModel)
expect_true(abs(baseSummary$deviance - 12.19313) < 1e-4)
})
+
+test_that("kmeans", {
+ newIris <- iris
+ newIris$Species <- NULL
+ training <- suppressWarnings(createDataFrame(sqlContext, newIris))
+
+ # Cahce the DataFrame here to work around the bug SPARK-13178.
+ cache(training)
+ take(training, 1)
+
+ model <- kmeans(x = training, centers = 2)
+ sample <- take(select(predict(model, training), "prediction"), 1)
+ expect_equal(typeof(sample$prediction), "integer")
+ expect_equal(sample$prediction, 1)
+})
--- End diff --
Sure, I'll add it soon.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-13011] K-means wrapper in SparkR
Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on a diff in the pull request:
https://github.com/apache/spark/pull/11124#discussion_r52546712
--- Diff: R/pkg/inst/tests/testthat/test_mllib.R ---
@@ -113,3 +113,18 @@ test_that("summary works on base GLM models", {
baseSummary <- summary(baseModel)
expect_true(abs(baseSummary$deviance - 12.19313) < 1e-4)
})
+
+test_that("kmeans", {
+ newIris <- iris
+ newIris$Species <- NULL
+ training <- suppressWarnings(createDataFrame(sqlContext, newIris))
+
+ # Cahce the DataFrame here to work around the bug SPARK-13178.
+ cache(training)
+ take(training, 1)
+
+ model <- kmeans(x = training, centers = 2)
+ sample <- take(select(predict(model, training), "prediction"), 1)
+ expect_equal(typeof(sample$prediction), "integer")
+ expect_equal(sample$prediction, 1)
+})
--- End diff --
also please add a test for `stats::kmeans` like [this](https://github.com/apache/spark/blob/master/R/pkg/inst/tests/testthat/test_sparkSQL.R#L1274)
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-13011] K-means wrapper in SparkR
Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:
https://github.com/apache/spark/pull/11124#discussion_r52681510
--- Diff: R/pkg/R/mllib.R ---
@@ -126,3 +126,27 @@ setMethod("summary", signature(object = "PipelineModel"),
return(list(coefficients = coefficients))
}
})
+
+#' Fit a k-means model
+#'
+#' Fit a k-means model, similarly to R's kmeans().
+#'
+#' @param x DataFrame for training
+#' @param centers Number of centers
+#' @param iter.max Maximum iteration number
+#' @param algorithm Algorithm choosen to fit the model
+#' @return A fitted k-means model
+#' @rdname kmeans
+#' @export
+#' @examples
+#'\dontrun{
+#' model <- kmeans(x, centers = 2, algorithm="random")
+#'}
+setMethod("kmeans", signature(x = "DataFrame"),
+ function(x, centers, iter.max = 10, algorithm = c("random", "k-means||")) {
+ columnNames <- as.array(colnames(x))
+ algorithm <- if(missing(algorithm)) "random" else match.arg(algorithm)
+ model <- callJStatic("org.apache.spark.ml.api.r.SparkRWrappers", "fitKMeans", x@sdf,
+ algorithm, iter.max, centers, columnNames)
+ return(new("PipelineModel", model = model))
--- End diff --
It is equally important to implement `fitted`, which makes predictions.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org