You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by yanboliang <gi...@git.apache.org> on 2017/03/01 09:31:24 UTC

[GitHub] spark pull request #17117: [SPARK-10780][ML] Support initial model for KMean...

GitHub user yanboliang opened a pull request:

    https://github.com/apache/spark/pull/17117

    [SPARK-10780][ML] Support initial model for KMeans.

    ## What changes were proposed in this pull request?
    Support initial model for ```KMeans```.
    
    ## How was this patch tested?
    Add unit tests.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/yanboliang/spark spark-10780

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/17117.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #17117
    
----
commit ddd8d86b8a8fca89705a438bc6336e589fa27c8b
Author: Yanbo Liang <yb...@gmail.com>
Date:   2017-03-01T09:29:12Z

    Support initial model for KMeans.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17117: [SPARK-10780][ML] Support initial model for KMean...

Posted by dbtsai <gi...@git.apache.org>.
Github user dbtsai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17117#discussion_r104820054
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala ---
    @@ -335,17 +369,70 @@ class KMeans @Since("1.5.0") (
         model
       }
     
    +  /**
    +   * Check validity for interactions between parameters.
    +   */
    +  private def assertInitialModelValid(): Unit = {
    +    if ($(initMode) == MLlibKMeans.K_MEANS_INITIAL_MODEL) {
    +      if (isSet(initialModel)) {
    +        val initialModelK = $(initialModel).parentModel.k
    +        if (initialModelK != $(k)) {
    +          throw new IllegalArgumentException("The initial model's cluster count = " +
    +            s"$initialModelK, mismatched with k = $k.")
    +        }
    +      } else {
    +        throw new IllegalArgumentException("Users must set param initialModel if you choose " +
    +          "'initialModel' as the initialization algorithm.")
    +      }
    +    } else {
    +      if (isSet(initialModel)) {
    --- End diff --
    
    Also, this is not needed if we do the overwriting work in `setInitialModel`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17117: [SPARK-10780][ML] Support initial model for KMean...

Posted by yanboliang <gi...@git.apache.org>.
Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17117#discussion_r104165327
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/clustering/KMeansSuite.scala ---
    @@ -22,22 +22,28 @@ import scala.util.Random
     import org.apache.spark.SparkFunSuite
     import org.apache.spark.ml.linalg.{Vector, Vectors}
     import org.apache.spark.ml.param.ParamMap
    -import org.apache.spark.ml.util.{DefaultReadWriteTest, MLTestingUtils}
    -import org.apache.spark.mllib.clustering.{KMeans => MLlibKMeans}
    +import org.apache.spark.ml.util.{DefaultReadWriteTest, Identifiable, MLTestingUtils}
    +import org.apache.spark.ml.util.TestingUtils._
    +import org.apache.spark.mllib.clustering.{KMeans => MLlibKMeans, KMeansModel => MLlibKMeansModel}
    +import org.apache.spark.mllib.linalg.{Vectors => MLlibVectors}
     import org.apache.spark.mllib.util.MLlibTestSparkContext
     import org.apache.spark.sql.{DataFrame, Dataset, SparkSession}
     
     private[clustering] case class TestRow(features: Vector)
     
     class KMeansSuite extends SparkFunSuite with MLlibTestSparkContext with DefaultReadWriteTest {
     
    +  import testImplicits._
    +
       final val k = 5
       @transient var dataset: Dataset[_] = _
    +  @transient var rData: Dataset[_] = _
     
       override def beforeAll(): Unit = {
         super.beforeAll()
     
         dataset = KMeansSuite.generateKMeansData(spark, 50, 3, k)
    +    rData = GaussianMixtureSuite.rData.map(GaussianMixtureSuite.FeatureData).toDF()
    --- End diff --
    
    Updated.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17117: [SPARK-10780][ML] Support initial model for KMean...

Posted by yanboliang <gi...@git.apache.org>.
Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17117#discussion_r104170315
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/util/DefaultReadWriteTest.scala ---
    @@ -111,12 +113,20 @@ trait DefaultReadWriteTest extends TempDirectory { self: Suite =>
         val estimator2 = testDefaultReadWrite(estimator)
         testParams.foreach { case (p, v) =>
           val param = estimator.getParam(p)
    -      assert(estimator.get(param).get === estimator2.get(param).get)
    +      if (param.name == "initialModel") {
    +        // Estimator's `initialModel` has same type as the model produced by this estimator.
    --- End diff --
    
    Let's merged #17151 firstly, then I will update this accordingly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17117: [SPARK-10780][ML] Support initial model for KMean...

Posted by sethah <gi...@git.apache.org>.
Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17117#discussion_r104084997
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala ---
    @@ -253,7 +255,18 @@ object KMeansModel extends MLReadable[KMeansModel] {
     @Since("1.5.0")
     class KMeans @Since("1.5.0") (
         @Since("1.5.0") override val uid: String)
    -  extends Estimator[KMeansModel] with KMeansParams with DefaultParamsWritable {
    +  extends Estimator[KMeansModel]
    +    with KMeansParams with HasInitialModel[KMeansModel] with MLWritable {
    +
    +  /**
    +   * A KMeansModel to use for warm start.
    +   * Note the cluster count of initial model must be equal with [[k]],
    +   * otherwise, throws IllegalArgumentException.
    +   * @group param
    +   */
    +  @Since("2.2.0")
    +  final val initialModel: Param[KMeansModel] =
    --- End diff --
    
    I prefer doing this in the same way that ALS does it. By having separate param traits `KMeansParams extends KMeansModelParams with HasInitialModel`. It's more explicit since now our `KMeans` class would have extra params on top of `KMeansParams`. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17117: [SPARK-10780][ML] Support initial model for KMean...

Posted by yanboliang <gi...@git.apache.org>.
Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17117#discussion_r104164797
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala ---
    @@ -253,7 +255,18 @@ object KMeansModel extends MLReadable[KMeansModel] {
     @Since("1.5.0")
     class KMeans @Since("1.5.0") (
         @Since("1.5.0") override val uid: String)
    -  extends Estimator[KMeansModel] with KMeansParams with DefaultParamsWritable {
    +  extends Estimator[KMeansModel]
    +    with KMeansParams with HasInitialModel[KMeansModel] with MLWritable {
    +
    +  /**
    +   * A KMeansModel to use for warm start.
    +   * Note the cluster count of initial model must be equal with [[k]],
    +   * otherwise, throws IllegalArgumentException.
    +   * @group param
    +   */
    +  @Since("2.2.0")
    +  final val initialModel: Param[KMeansModel] =
    --- End diff --
    
    Make sense, I refactored them as ```ALSParams``` and ```ALSModelParams```.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17117: [SPARK-10780][ML] Support initial model for KMean...

Posted by sethah <gi...@git.apache.org>.
Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17117#discussion_r104095197
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/clustering/KMeansSuite.scala ---
    @@ -182,6 +224,7 @@ object KMeansSuite {
         "predictionCol" -> "myPrediction",
         "k" -> 3,
         "maxIter" -> 2,
    -    "tol" -> 0.01
    +    "tol" -> 0.01,
    +    "initialModel" -> generateRandomKMeansModel(3, 3)
    --- End diff --
    
    It would be nicer to change `testEstimatorAndModelReadWrite` to accept `estimatorTestParams` and `modelTestParams` separately so we don't have to hard code certain params to be filtered out inside that method. Though we wouldn't have to that in this PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17117: [SPARK-10780][ML] Support initial model for KMean...

Posted by yanboliang <gi...@git.apache.org>.
Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17117#discussion_r104924893
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/clustering/KMeansSuite.scala ---
    @@ -152,6 +158,35 @@ class KMeansSuite extends SparkFunSuite with MLlibTestSparkContext with DefaultR
         val kmeans = new KMeans()
         testEstimatorAndModelReadWrite(kmeans, dataset, KMeansSuite.allParamSettings, checkModelData)
       }
    +
    +  test("training with initial model") {
    +    val kmeans = new KMeans().setK(2).setSeed(1)
    +    val model1 = kmeans.fit(rData)
    +    val model2 = kmeans.setInitMode("initialModel").setInitialModel(model1).fit(rData)
    +    model2.clusterCenters.zip(model1.clusterCenters)
    +      .foreach { case (center2, center1) => assert(center2 ~== center1 absTol 1E-8) }
    +  }
    +
    +  test("training with initial model, error cases") {
    +    val kmeans = new KMeans().setK(k).setSeed(1).setMaxIter(1)
    +
    +    // Sets initMode with 'initialModel', but does not specify initial model.
    +    intercept[IllegalArgumentException] {
    --- End diff --
    
    I think we can not override param in any ```set***``` function, see reason at [here](https://github.com/apache/spark/pull/17117#discussion_r104922364). This is why I didn't follow the idea of previous PR. If we prefer the previous way, we must to handle override in ```fit``` function, I'll update following this idea tomorrow.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17117: [SPARK-10780][ML] Support initial model for KMean...

Posted by dbtsai <gi...@git.apache.org>.
Github user dbtsai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17117#discussion_r104819439
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala ---
    @@ -335,17 +369,70 @@ class KMeans @Since("1.5.0") (
         model
       }
     
    +  /**
    +   * Check validity for interactions between parameters.
    +   */
    +  private def assertInitialModelValid(): Unit = {
    +    if ($(initMode) == MLlibKMeans.K_MEANS_INITIAL_MODEL) {
    +      if (isSet(initialModel)) {
    +        val initialModelK = $(initialModel).parentModel.k
    +        if (initialModelK != $(k)) {
    +          throw new IllegalArgumentException("The initial model's cluster count = " +
    +            s"$initialModelK, mismatched with k = $k.")
    +        }
    +      } else {
    +        throw new IllegalArgumentException("Users must set param initialModel if you choose " +
    +          "'initialModel' as the initialization algorithm.")
    +      }
    +    } else {
    +      if (isSet(initialModel)) {
    +        logWarning(s"Param initialModel will take no effect when initMode is $initMode.")
    +      }
    +    }
    +  }
    +
       @Since("1.5.0")
       override def transformSchema(schema: StructType): StructType = {
    +    assertInitialModelValid()
    --- End diff --
    
    Why this is not checked in `fit`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17117: [SPARK-10780][ML] Support initial model for KMean...

Posted by sethah <gi...@git.apache.org>.
Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17117#discussion_r104092773
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/clustering/KMeansSuite.scala ---
    @@ -152,6 +158,35 @@ class KMeansSuite extends SparkFunSuite with MLlibTestSparkContext with DefaultR
         val kmeans = new KMeans()
         testEstimatorAndModelReadWrite(kmeans, dataset, KMeansSuite.allParamSettings, checkModelData)
       }
    +
    +  test("training with initial model") {
    +    val kmeans = new KMeans().setK(2).setSeed(1)
    +    val model1 = kmeans.fit(rData)
    +    val model2 = kmeans.setInitMode("initialModel").setInitialModel(model1).fit(rData)
    +    model2.clusterCenters.zip(model1.clusterCenters)
    +      .foreach { case (center2, center1) => assert(center2 ~== center1 absTol 1E-8) }
    +  }
    +
    +  test("training with initial model, error cases") {
    +    val kmeans = new KMeans().setK(k).setSeed(1).setMaxIter(1)
    +
    +    // Sets initMode with 'initialModel', but does not specify initial model.
    +    intercept[IllegalArgumentException] {
    --- End diff --
    
    I'm not sure I agree with the behavior. We discussed it quite a bit in the other PR - maybe you can summarize the reason you went away from the previous decisions? At any rate, it seems currently we have the following behavior:
    
    | k      | initMode           | initialModel  | result |
    --- | --- | --- | ---
    | ?    | not set | set | ignore InitialModel |
    | ?    | set | not set | error |
    |  set (k != initialModelK)   | set | set | error |
    |  set (k == initialModelK)   | set | set | use initialModel |
    
    If we keep this behavior, we should add a test for the first case. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17117: [SPARK-10780][ML] Support initial model for KMean...

Posted by dbtsai <gi...@git.apache.org>.
Github user dbtsai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17117#discussion_r105025280
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala ---
    @@ -335,17 +369,70 @@ class KMeans @Since("1.5.0") (
         model
       }
     
    +  /**
    +   * Check validity for interactions between parameters.
    +   */
    +  private def assertInitialModelValid(): Unit = {
    +    if ($(initMode) == MLlibKMeans.K_MEANS_INITIAL_MODEL) {
    +      if (isSet(initialModel)) {
    +        val initialModelK = $(initialModel).parentModel.k
    +        if (initialModelK != $(k)) {
    +          throw new IllegalArgumentException("The initial model's cluster count = " +
    +            s"$initialModelK, mismatched with k = $k.")
    +        }
    +      } else {
    +        throw new IllegalArgumentException("Users must set param initialModel if you choose " +
    +          "'initialModel' as the initialization algorithm.")
    +      }
    +    } else {
    +      if (isSet(initialModel)) {
    +        logWarning(s"Param initialModel will take no effect when initMode is $initMode.")
    +      }
    +    }
    +  }
    +
       @Since("1.5.0")
       override def transformSchema(schema: StructType): StructType = {
    +    assertInitialModelValid()
    --- End diff --
    
    `transformSchema ` is called in `transform` method, and `model.transform` is called in computing the summary. I think we should fail it earlier instead of checking it in the end. Also, it's implicit that it's being checked when computing summary. We should explicitly check it. 
    
    If we have small logic in checking, I'll have those checking code in `fit` method.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17117: [SPARK-10780][ML] Support initial model for KMeans.

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17117
  
    **[Test build #73852 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73852/testReport)** for PR 17117 at commit [`7d842e0`](https://github.com/apache/spark/commit/7d842e0c38c3e3d1b0567378ae5a98ecc1597166).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17117: [SPARK-10780][ML] Support initial model for KMeans.

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17117
  
    **[Test build #73852 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73852/testReport)** for PR 17117 at commit [`7d842e0`](https://github.com/apache/spark/commit/7d842e0c38c3e3d1b0567378ae5a98ecc1597166).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17117: [SPARK-10780][ML] Support initial model for KMean...

Posted by dbtsai <gi...@git.apache.org>.
Github user dbtsai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17117#discussion_r104815453
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala ---
    @@ -300,6 +318,10 @@ class KMeans @Since("1.5.0") (
       @Since("1.5.0")
       def setSeed(value: Long): this.type = set(seed, value)
     
    +  /** @group setParam */
    +  @Since("2.2.0")
    +  def setInitialModel(value: KMeansModel): this.type = set(initialModel, value)
    --- End diff --
    
    How about
    ```scala
    def setInitialModel(value: KMeansModel): this.type = {
      if (getK ~= value.getK) {
        log the warning
        set(k, value)
      }
      set(initMode, MLlibKMeans.K_MEANS_INITIAL_MODEL) // We may log, but I don't really care for this one.
      set(initialModel, value)
    }
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17117: [SPARK-10780][ML] Support initial model for KMean...

Posted by sethah <gi...@git.apache.org>.
Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17117#discussion_r104092158
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/clustering/KMeansSuite.scala ---
    @@ -22,22 +22,28 @@ import scala.util.Random
     import org.apache.spark.SparkFunSuite
     import org.apache.spark.ml.linalg.{Vector, Vectors}
     import org.apache.spark.ml.param.ParamMap
    -import org.apache.spark.ml.util.{DefaultReadWriteTest, MLTestingUtils}
    -import org.apache.spark.mllib.clustering.{KMeans => MLlibKMeans}
    +import org.apache.spark.ml.util.{DefaultReadWriteTest, Identifiable, MLTestingUtils}
    +import org.apache.spark.ml.util.TestingUtils._
    +import org.apache.spark.mllib.clustering.{KMeans => MLlibKMeans, KMeansModel => MLlibKMeansModel}
    +import org.apache.spark.mllib.linalg.{Vectors => MLlibVectors}
     import org.apache.spark.mllib.util.MLlibTestSparkContext
     import org.apache.spark.sql.{DataFrame, Dataset, SparkSession}
     
     private[clustering] case class TestRow(features: Vector)
     
     class KMeansSuite extends SparkFunSuite with MLlibTestSparkContext with DefaultReadWriteTest {
     
    +  import testImplicits._
    +
       final val k = 5
       @transient var dataset: Dataset[_] = _
    +  @transient var rData: Dataset[_] = _
     
       override def beforeAll(): Unit = {
         super.beforeAll()
     
         dataset = KMeansSuite.generateKMeansData(spark, 50, 3, k)
    +    rData = GaussianMixtureSuite.rData.map(GaussianMixtureSuite.FeatureData).toDF()
    --- End diff --
    
    `GaussianMixtureSuite.rData.map(Tuple1.apply).toDF()` ? Mapping the dummy case class from another test suite is less clear.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17117: [SPARK-10780][ML] Support initial model for KMeans.

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17117
  
    **[Test build #73686 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73686/testReport)** for PR 17117 at commit [`2824d85`](https://github.com/apache/spark/commit/2824d856ea27ecdb1c2fedf0e394dcd692f86fad).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17117: [SPARK-10780][ML] Support initial model for KMean...

Posted by sethah <gi...@git.apache.org>.
Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17117#discussion_r104830001
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala ---
    @@ -37,9 +37,9 @@ import org.apache.spark.storage.StorageLevel
     import org.apache.spark.util.VersionUtils.majorVersion
     
     /**
    - * Common params for KMeans and KMeansModel
    + * Common params for KMeans and KMeansModel.
      */
    -private[clustering] trait KMeansParams extends Params with HasMaxIter with HasFeaturesCol
    +private[clustering] trait KMeansModelParams extends Params with HasMaxIter with HasFeaturesCol
       with HasSeed with HasPredictionCol with HasTol {
    --- End diff --
    
    Yeah, we decided in the previous discussion to not store the initial model in the produced model, for several reasons, including model serialization.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17117: [SPARK-10780][ML] Support initial model for KMeans.

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17117
  
    **[Test build #73686 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73686/testReport)** for PR 17117 at commit [`2824d85`](https://github.com/apache/spark/commit/2824d856ea27ecdb1c2fedf0e394dcd692f86fad).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17117: [SPARK-10780][ML] Support initial model for KMean...

Posted by yanboliang <gi...@git.apache.org>.
Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17117#discussion_r103675288
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala ---
    @@ -337,15 +366,61 @@ class KMeans @Since("1.5.0") (
     
       @Since("1.5.0")
       override def transformSchema(schema: StructType): StructType = {
    +    if ($(initMode) == MLlibKMeans.K_MEANS_INITIAL_MODEL) {
    +      if (isSet(initialModel)) {
    +        val initialModelK = $(initialModel).parentModel.k
    +        if (initialModelK != $(k)) {
    +          throw new IllegalArgumentException("The initial model's cluster count = " +
    +            s"$initialModelK, mismatched with k = $k.")
    +        }
    +      } else {
    +        throw new IllegalArgumentException("Users must set param initialModel if you choose " +
    +          "'initialModel' as the initialization algorithm.")
    +      }
    +    } else {
    +      if (isSet(initialModel)) {
    +        logWarning(s"Param initialModel will take no effect when initMode is $initMode.")
    +      }
    +    }
         validateAndTransformSchema(schema)
       }
    +
    +  @Since("2.2.0")
    +  override def write: MLWriter = new KMeans.KMeansWriter(this)
     }
     
     @Since("1.6.0")
    -object KMeans extends DefaultParamsReadable[KMeans] {
    +object KMeans extends MLReadable[KMeans] {
     
       @Since("1.6.0")
       override def load(path: String): KMeans = super.load(path)
    +
    +  @Since("2.2.0")
    +  override def read: MLReader[KMeans] = new KMeansReader
    +
    +  /** [[MLWriter]] instance for [[KMeans]] */
    +  private[KMeans] class KMeansWriter(instance: KMeans) extends MLWriter {
    +
    +    override protected def saveImpl(path: String): Unit = {
    +      DefaultParamsWriter.saveInitialModel(instance, path)
    +      DefaultParamsWriter.saveMetadata(instance, path, sc)
    +    }
    --- End diff --
    
    I was trying to move ```saveInitialModel``` into ```saveMetadata``` and making it more succinct. We can do this for ```MLWriter```, but it's hard for ```MLReader[T]```. Since we need to explicitly pass the type of ```initialModel``` as well, so we need refactor ```MLReader[T]``` to ```MLReader[T, M]```. However, I think lots of estimators/transformers will not use ```initialModel```, so the extra type ```[M]``` does not make sense.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17117: [SPARK-10780][ML] Support initial model for KMeans.

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17117
  
    **[Test build #73850 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73850/testReport)** for PR 17117 at commit [`4226149`](https://github.com/apache/spark/commit/422614918010b3c1c298341f7650c5c4d2364d9f).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17117: [SPARK-10780][ML] Support initial model for KMean...

Posted by sethah <gi...@git.apache.org>.
Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17117#discussion_r104829816
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/clustering/KMeansSuite.scala ---
    @@ -152,6 +158,35 @@ class KMeansSuite extends SparkFunSuite with MLlibTestSparkContext with DefaultR
         val kmeans = new KMeans()
         testEstimatorAndModelReadWrite(kmeans, dataset, KMeansSuite.allParamSettings, checkModelData)
       }
    +
    +  test("training with initial model") {
    +    val kmeans = new KMeans().setK(2).setSeed(1)
    +    val model1 = kmeans.fit(rData)
    +    val model2 = kmeans.setInitMode("initialModel").setInitialModel(model1).fit(rData)
    +    model2.clusterCenters.zip(model1.clusterCenters)
    +      .foreach { case (center2, center1) => assert(center2 ~== center1 absTol 1E-8) }
    +  }
    +
    +  test("training with initial model, error cases") {
    +    val kmeans = new KMeans().setK(k).setSeed(1).setMaxIter(1)
    +
    +    // Sets initMode with 'initialModel', but does not specify initial model.
    +    intercept[IllegalArgumentException] {
    --- End diff --
    
    Yeah, I think the general idea laid out in the previous PR is preferable. As you and DB say, if you'd like to make the second `.setInitMode` overwrite the initial model then that is fine. With that change, then the behavior I would prefer is:
    
    ````scala
      test("params") {
        val initialK = 3
        val initialEstimator = new KMeans()
          .setK(initialK)
        val initialModel = initialEstimator.fit(dataset)
    
        val km = new KMeans()
          .setK(initialK + 1)
          .setInitMode(MLlibKMeans.RANDOM)
    
        assert(km.getK === initialK + 1)
        assert(km.getInitMode === MLlibKMeans.RANDOM)
    
        km.setInitialModel(initialModel)
    
        // initialModel sets k and init mode
        assert(km.getInitMode === MLlibKMeans.K_MEANS_INITIAL_MODEL)
        assert(km.getK === initialK)
        assert(km.getInitialModel.getK === initialK)
    
        // setting k is ignored
        km.setK(initialK + 1)
        assert(km.getK === initialK)
    
        // this should work since we already set initialModel
        km.setInitMode(MLlibKMeans.K_MEANS_INITIAL_MODEL)
    
        // changing initMode clears the initial model
        km.setInitMode(MLlibKMeans.RANDOM)
        assert(km.getInitMode === MLlibKMeans.RANDOM)
        assert(!km.isSet(km.initialModel))
        // k is retained from initial model
        assert(km.getK === initialK)
        // now k can be set
        km.setK(initialK + 1)
        assert(km.getK === initialK + 1)
    
        // kmeans should throw an error since we shouldn't be allowed to set init mode to "initialModel"
        intercept[IllegalArgumentException] {
          km.setInitMode(MLlibKMeans.K_MEANS_INITIAL_MODEL)
        }
      }
    ````


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17117: [SPARK-10780][ML] Support initial model for KMean...

Posted by sethah <gi...@git.apache.org>.
Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17117#discussion_r104091867
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala ---
    @@ -418,6 +418,8 @@ object KMeans {
       val RANDOM = "random"
       @Since("0.8.0")
       val K_MEANS_PARALLEL = "k-means||"
    +  @Since("2.2.0")
    +  val K_MEANS_INITIAL_MODEL = "initialModel"
    --- End diff --
    
    It can be private I think. That, or we should update the valid options for the `setInitializationMode` doc. But I think it's best to make it private.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17117: [SPARK-10780][ML] Support initial model for KMeans.

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17117
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73850/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17117: [SPARK-10780][ML] Support initial model for KMeans.

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17117
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17117: [SPARK-10780][ML] Support initial model for KMean...

Posted by yanboliang <gi...@git.apache.org>.
Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17117#discussion_r104168600
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/clustering/KMeansSuite.scala ---
    @@ -152,6 +158,35 @@ class KMeansSuite extends SparkFunSuite with MLlibTestSparkContext with DefaultR
         val kmeans = new KMeans()
         testEstimatorAndModelReadWrite(kmeans, dataset, KMeansSuite.allParamSettings, checkModelData)
       }
    +
    +  test("training with initial model") {
    +    val kmeans = new KMeans().setK(2).setSeed(1)
    +    val model1 = kmeans.fit(rData)
    +    val model2 = kmeans.setInitMode("initialModel").setInitialModel(model1).fit(rData)
    +    model2.clusterCenters.zip(model1.clusterCenters)
    +      .foreach { case (center2, center1) => assert(center2 ~== center1 absTol 1E-8) }
    +  }
    +
    +  test("training with initial model, error cases") {
    +    val kmeans = new KMeans().setK(k).setSeed(1).setMaxIter(1)
    +
    +    // Sets initMode with 'initialModel', but does not specify initial model.
    +    intercept[IllegalArgumentException] {
    --- End diff --
    
    I disagree the way in the other PR, the reason is:
    In that PR, if users ```setInitialModel(model)```, it will call ```set(initMode, "initialModel")```.  Take the following scenarios:
    ```
    val kmeans = new KMeans().setInitialModel(initialModel) // Users want to start with an initial model.
    val model1 = kmeans.fit(dataset) // The model was fitted by warm start.
    // Then they want to try another starting way, for example, starting with "k-means||".
    val model2 = kmeans.setInitMode("k-means||") // But in #11119 's code route, it will still starts with initial model. Though we can change this by modify the code in mllib.clustering.KMeans, but I think it's confused. 
    ```
    
    Another scenario is users set ```initialModel``` by mistake, but they still want to start with ```random``` mode, they will confused what happened. 
    So I'm more prefer to let users set ```initMode``` to ```initialModel``` explicitly, and set ```initialModel``` to corresponding model. Otherwise, we just throw exceptions to let users correct their setting. I'm OK to add a test for the first case.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17117: [SPARK-10780][ML] Support initial model for KMean...

Posted by yanboliang <gi...@git.apache.org>.
Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17117#discussion_r104922364
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala ---
    @@ -300,6 +318,10 @@ class KMeans @Since("1.5.0") (
       @Since("1.5.0")
       def setSeed(value: Long): this.type = set(seed, value)
     
    +  /** @group setParam */
    +  @Since("2.2.0")
    +  def setInitialModel(value: KMeansModel): this.type = set(initialModel, value)
    --- End diff --
    
    I think we can not override param in any ```set***``` function, since ML pipeline API supports other param setting method like:
    ```
    def fit(dataset: Dataset[_], paramMap: ParamMap): M = {
        copy(paramMap).fit(dataset)
      }
    ```
    I think the only way to override param is at the start of ```fit``` function.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17117: [SPARK-10780][ML] Support initial model for KMean...

Posted by dbtsai <gi...@git.apache.org>.
Github user dbtsai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17117#discussion_r104803180
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala ---
    @@ -37,9 +37,9 @@ import org.apache.spark.storage.StorageLevel
     import org.apache.spark.util.VersionUtils.majorVersion
     
     /**
    - * Common params for KMeans and KMeansModel
    + * Common params for KMeans and KMeansModel.
      */
    -private[clustering] trait KMeansParams extends Params with HasMaxIter with HasFeaturesCol
    +private[clustering] trait KMeansModelParams extends Params with HasMaxIter with HasFeaturesCol
       with HasSeed with HasPredictionCol with HasTol {
    --- End diff --
    
    Now, `KMeansModel` mixes `KMeansModelParams`, does it mean in the model level, we can not get the information of the `initiModel`? Also, in the model, why do we need to mix the seed in?  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17117: [SPARK-10780][ML] Support initial model for KMean...

Posted by sethah <gi...@git.apache.org>.
Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17117#discussion_r104830238
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala ---
    @@ -335,17 +369,70 @@ class KMeans @Since("1.5.0") (
         model
       }
     
    +  /**
    +   * Check validity for interactions between parameters.
    +   */
    +  private def assertInitialModelValid(): Unit = {
    +    if ($(initMode) == MLlibKMeans.K_MEANS_INITIAL_MODEL) {
    +      if (isSet(initialModel)) {
    +        val initialModelK = $(initialModel).parentModel.k
    +        if (initialModelK != $(k)) {
    +          throw new IllegalArgumentException("The initial model's cluster count = " +
    +            s"$initialModelK, mismatched with k = $k.")
    +        }
    +      } else {
    +        throw new IllegalArgumentException("Users must set param initialModel if you choose " +
    +          "'initialModel' as the initialization algorithm.")
    +      }
    +    } else {
    +      if (isSet(initialModel)) {
    +        logWarning(s"Param initialModel will take no effect when initMode is $initMode.")
    +      }
    +    }
    +  }
    +
       @Since("1.5.0")
       override def transformSchema(schema: StructType): StructType = {
    +    assertInitialModelValid()
    --- End diff --
    
    `transformSchema` will be called in the train method.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17117: [SPARK-10780][ML] Support initial model for KMean...

Posted by sethah <gi...@git.apache.org>.
Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17117#discussion_r104090529
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala ---
    @@ -337,15 +366,61 @@ class KMeans @Since("1.5.0") (
     
       @Since("1.5.0")
       override def transformSchema(schema: StructType): StructType = {
    +    if ($(initMode) == MLlibKMeans.K_MEANS_INITIAL_MODEL) {
    +      if (isSet(initialModel)) {
    +        val initialModelK = $(initialModel).parentModel.k
    +        if (initialModelK != $(k)) {
    +          throw new IllegalArgumentException("The initial model's cluster count = " +
    +            s"$initialModelK, mismatched with k = $k.")
    +        }
    +      } else {
    +        throw new IllegalArgumentException("Users must set param initialModel if you choose " +
    +          "'initialModel' as the initialization algorithm.")
    +      }
    +    } else {
    +      if (isSet(initialModel)) {
    +        logWarning(s"Param initialModel will take no effect when initMode is $initMode.")
    +      }
    +    }
         validateAndTransformSchema(schema)
       }
    +
    +  @Since("2.2.0")
    +  override def write: MLWriter = new KMeans.KMeansWriter(this)
     }
     
     @Since("1.6.0")
    -object KMeans extends DefaultParamsReadable[KMeans] {
    +object KMeans extends MLReadable[KMeans] {
     
       @Since("1.6.0")
       override def load(path: String): KMeans = super.load(path)
    +
    +  @Since("2.2.0")
    +  override def read: MLReader[KMeans] = new KMeansReader
    +
    +  /** [[MLWriter]] instance for [[KMeans]] */
    +  private[KMeans] class KMeansWriter(instance: KMeans) extends MLWriter {
    +
    +    override protected def saveImpl(path: String): Unit = {
    +      DefaultParamsWriter.saveInitialModel(instance, path)
    +      DefaultParamsWriter.saveMetadata(instance, path, sc)
    +    }
    +  }
    +
    +  private class KMeansReader extends MLReader[KMeans] {
    +
    +    override def load(path: String): KMeans = {
    +      val metadata = DefaultParamsReader.loadMetadata(path, sc, classOf[KMeans].getName)
    +      val instance = new KMeans(metadata.uid)
    +
    +      DefaultParamsReader.getAndSetParams(instance, metadata)
    +      DefaultParamsReader.loadInitialModel[KMeansModel](path, sc) match {
    --- End diff --
    
    This can be done as:
    
    ````scala
     DefaultParamsReader.loadInitialModel[KMeansModel](path, sc).foreach(instance.setInitialModel)
    ````
    
    I think it's nicer, but I'm not sure if there is a universal preference for side effects with options in Spark, so I'll leave it to you to decide.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17117: [SPARK-10780][ML] Support initial model for KMean...

Posted by sethah <gi...@git.apache.org>.
Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17117#discussion_r104094526
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/util/DefaultReadWriteTest.scala ---
    @@ -111,12 +113,20 @@ trait DefaultReadWriteTest extends TempDirectory { self: Suite =>
         val estimator2 = testDefaultReadWrite(estimator)
         testParams.foreach { case (p, v) =>
           val param = estimator.getParam(p)
    -      assert(estimator.get(param).get === estimator2.get(param).get)
    +      if (param.name == "initialModel") {
    +        // Estimator's `initialModel` has same type as the model produced by this estimator.
    --- End diff --
    
    This is an assumption, and is not enforced by the compiler. There is nothing in the trait `HasInitialModel[T <: Model[T]]`that prevents us from creating an estimator with an initialModel type that is not the same type of the model that the estimator produces. We can discuss whether or not we'd like to enforce this assumption, but if we do not then this method should probably be changed. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17117: [SPARK-10780][ML] Support initial model for KMeans.

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17117
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17117: [SPARK-10780][ML] Support initial model for KMean...

Posted by dbtsai <gi...@git.apache.org>.
Github user dbtsai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17117#discussion_r105002771
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala ---
    @@ -37,9 +37,9 @@ import org.apache.spark.storage.StorageLevel
     import org.apache.spark.util.VersionUtils.majorVersion
     
     /**
    - * Common params for KMeans and KMeansModel
    + * Common params for KMeans and KMeansModel.
      */
    -private[clustering] trait KMeansParams extends Params with HasMaxIter with HasFeaturesCol
    +private[clustering] trait KMeansModelParams extends Params with HasMaxIter with HasFeaturesCol
       with HasSeed with HasPredictionCol with HasTol {
    --- End diff --
    
    Fair enough.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17117: [SPARK-10780][ML] Support initial model for KMean...

Posted by dbtsai <gi...@git.apache.org>.
Github user dbtsai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17117#discussion_r104819810
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala ---
    @@ -335,17 +369,70 @@ class KMeans @Since("1.5.0") (
         model
       }
     
    +  /**
    +   * Check validity for interactions between parameters.
    +   */
    +  private def assertInitialModelValid(): Unit = {
    +    if ($(initMode) == MLlibKMeans.K_MEANS_INITIAL_MODEL) {
    +      if (isSet(initialModel)) {
    +        val initialModelK = $(initialModel).parentModel.k
    +        if (initialModelK != $(k)) {
    --- End diff --
    
    I don't think this check is needed if we overwrite `k` when `initialModel` is set.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17117: [SPARK-10780][ML] Support initial model for KMeans.

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17117
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73852/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17117: [SPARK-10780][ML] Support initial model for KMeans.

Posted by yanboliang <gi...@git.apache.org>.
Github user yanboliang commented on the issue:

    https://github.com/apache/spark/pull/17117
  
    cc @dbtsai 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17117: [SPARK-10780][ML] Support initial model for KMeans.

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17117
  
    **[Test build #73850 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73850/testReport)** for PR 17117 at commit [`4226149`](https://github.com/apache/spark/commit/422614918010b3c1c298341f7650c5c4d2364d9f).
     * This patch **fails MiMa tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17117: [SPARK-10780][ML] Support initial model for KMean...

Posted by dbtsai <gi...@git.apache.org>.
Github user dbtsai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17117#discussion_r104800325
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/clustering/KMeansSuite.scala ---
    @@ -152,6 +158,35 @@ class KMeansSuite extends SparkFunSuite with MLlibTestSparkContext with DefaultR
         val kmeans = new KMeans()
         testEstimatorAndModelReadWrite(kmeans, dataset, KMeansSuite.allParamSettings, checkModelData)
       }
    +
    +  test("training with initial model") {
    +    val kmeans = new KMeans().setK(2).setSeed(1)
    +    val model1 = kmeans.fit(rData)
    +    val model2 = kmeans.setInitMode("initialModel").setInitialModel(model1).fit(rData)
    +    model2.clusterCenters.zip(model1.clusterCenters)
    +      .foreach { case (center2, center1) => assert(center2 ~== center1 absTol 1E-8) }
    +  }
    +
    +  test("training with initial model, error cases") {
    +    val kmeans = new KMeans().setK(k).setSeed(1).setMaxIter(1)
    +
    +    // Sets initMode with 'initialModel', but does not specify initial model.
    +    intercept[IllegalArgumentException] {
    --- End diff --
    
    My 2cents is the latter configuration should be able to overwrite the former settings and related settings with warning messages. 
    
    In your example, when `kmeans.setInitMode("k-means||")` is performed, the first `setInitialModel` should be ignored with warning message.
    
     Even we do `setK(k =3)`, and later we do `.setInitialModel(initialModel)`, we should ignore the first `setK(k =3)` with warning. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17117: [SPARK-10780][ML] Support initial model for KMeans.

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17117
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73682/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17117: [SPARK-10780][ML] Support initial model for KMeans.

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17117
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17117: [SPARK-10780][ML] Support initial model for KMeans.

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17117
  
    **[Test build #73682 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73682/testReport)** for PR 17117 at commit [`ddd8d86`](https://github.com/apache/spark/commit/ddd8d86b8a8fca89705a438bc6336e589fa27c8b).
     * This patch **fails MiMa tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17117: [SPARK-10780][ML] Support initial model for KMeans.

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17117
  
    **[Test build #73682 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73682/testReport)** for PR 17117 at commit [`ddd8d86`](https://github.com/apache/spark/commit/ddd8d86b8a8fca89705a438bc6336e589fa27c8b).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17117: [SPARK-10780][ML] Support initial model for KMean...

Posted by yanboliang <gi...@git.apache.org>.
Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17117#discussion_r104170135
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/clustering/KMeansSuite.scala ---
    @@ -182,6 +224,7 @@ object KMeansSuite {
         "predictionCol" -> "myPrediction",
         "k" -> 3,
         "maxIter" -> 2,
    -    "tol" -> 0.01
    +    "tol" -> 0.01,
    +    "initialModel" -> generateRandomKMeansModel(3, 3)
    --- End diff --
    
    Agree, I sent #17151 and feel free to comment it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17117: [SPARK-10780][ML] Support initial model for KMean...

Posted by yanboliang <gi...@git.apache.org>.
Github user yanboliang closed the pull request at:

    https://github.com/apache/spark/pull/17117


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17117: [SPARK-10780][ML] Support initial model for KMeans.

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17117
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73686/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17117: [SPARK-10780][ML] Support initial model for KMean...

Posted by dbtsai <gi...@git.apache.org>.
Github user dbtsai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17117#discussion_r105025830
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala ---
    @@ -300,6 +318,10 @@ class KMeans @Since("1.5.0") (
       @Since("1.5.0")
       def setSeed(value: Long): this.type = set(seed, value)
     
    +  /** @group setParam */
    +  @Since("2.2.0")
    +  def setInitialModel(value: KMeansModel): this.type = set(initialModel, value)
    --- End diff --
    
    Can you elaborate this? I don't fully understand why we can not overwrite setting in set method. Thanks. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17117: [SPARK-10780][ML] Support initial model for KMean...

Posted by yanboliang <gi...@git.apache.org>.
Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17117#discussion_r104165225
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala ---
    @@ -337,15 +366,61 @@ class KMeans @Since("1.5.0") (
     
       @Since("1.5.0")
       override def transformSchema(schema: StructType): StructType = {
    +    if ($(initMode) == MLlibKMeans.K_MEANS_INITIAL_MODEL) {
    +      if (isSet(initialModel)) {
    +        val initialModelK = $(initialModel).parentModel.k
    +        if (initialModelK != $(k)) {
    +          throw new IllegalArgumentException("The initial model's cluster count = " +
    +            s"$initialModelK, mismatched with k = $k.")
    +        }
    +      } else {
    +        throw new IllegalArgumentException("Users must set param initialModel if you choose " +
    +          "'initialModel' as the initialization algorithm.")
    +      }
    +    } else {
    +      if (isSet(initialModel)) {
    +        logWarning(s"Param initialModel will take no effect when initMode is $initMode.")
    +      }
    +    }
         validateAndTransformSchema(schema)
       }
    +
    +  @Since("2.2.0")
    +  override def write: MLWriter = new KMeans.KMeansWriter(this)
     }
     
     @Since("1.6.0")
    -object KMeans extends DefaultParamsReadable[KMeans] {
    +object KMeans extends MLReadable[KMeans] {
     
       @Since("1.6.0")
       override def load(path: String): KMeans = super.load(path)
    +
    +  @Since("2.2.0")
    +  override def read: MLReader[KMeans] = new KMeansReader
    +
    +  /** [[MLWriter]] instance for [[KMeans]] */
    +  private[KMeans] class KMeansWriter(instance: KMeans) extends MLWriter {
    +
    +    override protected def saveImpl(path: String): Unit = {
    +      DefaultParamsWriter.saveInitialModel(instance, path)
    +      DefaultParamsWriter.saveMetadata(instance, path, sc)
    +    }
    +  }
    +
    +  private class KMeansReader extends MLReader[KMeans] {
    +
    +    override def load(path: String): KMeans = {
    +      val metadata = DefaultParamsReader.loadMetadata(path, sc, classOf[KMeans].getName)
    +      val instance = new KMeans(metadata.uid)
    +
    +      DefaultParamsReader.getAndSetParams(instance, metadata)
    +      DefaultParamsReader.loadInitialModel[KMeansModel](path, sc) match {
    --- End diff --
    
    Yeah, your suggestion can work well, but I'm more prefer to my way, since it's more clear for developer to understand what happened.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17117: [SPARK-10780][ML] Support initial model for KMean...

Posted by sethah <gi...@git.apache.org>.
Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17117#discussion_r104090273
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala ---
    @@ -337,15 +366,61 @@ class KMeans @Since("1.5.0") (
     
       @Since("1.5.0")
       override def transformSchema(schema: StructType): StructType = {
    +    if ($(initMode) == MLlibKMeans.K_MEANS_INITIAL_MODEL) {
    --- End diff --
    
    It might be nice to factor this logic out into a method like `assertInitialModelValid` or something similar. Actually, we could add an abstract method to the `HasInitialModel` trait that each subclass can implement differently. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17117: [SPARK-10780][ML] Support initial model for KMean...

Posted by dbtsai <gi...@git.apache.org>.
Github user dbtsai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17117#discussion_r105002617
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/clustering/KMeansSuite.scala ---
    @@ -152,6 +158,35 @@ class KMeansSuite extends SparkFunSuite with MLlibTestSparkContext with DefaultR
         val kmeans = new KMeans()
         testEstimatorAndModelReadWrite(kmeans, dataset, KMeansSuite.allParamSettings, checkModelData)
       }
    +
    +  test("training with initial model") {
    +    val kmeans = new KMeans().setK(2).setSeed(1)
    +    val model1 = kmeans.fit(rData)
    +    val model2 = kmeans.setInitMode("initialModel").setInitialModel(model1).fit(rData)
    +    model2.clusterCenters.zip(model1.clusterCenters)
    +      .foreach { case (center2, center1) => assert(center2 ~== center1 absTol 1E-8) }
    +  }
    +
    +  test("training with initial model, error cases") {
    +    val kmeans = new KMeans().setK(k).setSeed(1).setMaxIter(1)
    +
    +    // Sets initMode with 'initialModel', but does not specify initial model.
    +    intercept[IllegalArgumentException] {
    --- End diff --
    
    @sethah +1 on the behavior you purpose. The only thing I would like to add on is `setK` should throw `IllegalArgumentException`.
    
    ```scala
        // initialModel sets k and init mode
        assert(km.getInitMode === MLlibKMeans.K_MEANS_INITIAL_MODEL)
        assert(km.getK === initialK)
        assert(km.getInitialModel.getK === initialK)
    
        // setting k will throw exception.
        intercept[IllegalArgumentException] {
          km.setK(initialK + 1)
        }
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17117: [SPARK-10780][ML] Support initial model for KMean...

Posted by dbtsai <gi...@git.apache.org>.
Github user dbtsai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17117#discussion_r104821332
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala ---
    @@ -335,17 +369,70 @@ class KMeans @Since("1.5.0") (
         model
       }
     
    +  /**
    +   * Check validity for interactions between parameters.
    +   */
    +  private def assertInitialModelValid(): Unit = {
    --- End diff --
    
    I think with overwriting above, the only thing we need to check will be 
    
    ```scala
    if ($(initMode) == MLlibKMeans.K_MEANS_INITIAL_MODEL && !isSet(initialModel)) {
      throw new IllegalArgumentException("Users must set param initialModel if you choose " +
               "'initialModel' as the initialization.")
    }
    ```
    
    we can just have it in the body of `fit` method.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17117: [SPARK-10780][ML] Support initial model for KMean...

Posted by sethah <gi...@git.apache.org>.
Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17117#discussion_r104084877
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala ---
    @@ -123,7 +126,8 @@ class KMeansModel private[ml] (
       @Since("2.0.0")
       override def transform(dataset: Dataset[_]): DataFrame = {
         transformSchema(dataset.schema, logging = true)
    -    val predictUDF = udf((vector: Vector) => predict(vector))
    +    val tmpParent: MLlibKMeansModel = parentModel
    --- End diff --
    
    Can we change it to `localParent`? That's the convention we have taken elsewhere when we want to get a separate pointer to a class member.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17117: [SPARK-10780][ML] Support initial model for KMeans.

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17117
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org