You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2019/12/29 00:21:41 UTC

[GitHub] [spark] huaxingao opened a new pull request #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting

huaxingao opened a new pull request #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting
URL: https://github.com/apache/spark/pull/27035
 
 
   ### What changes were proposed in this pull request?
   add weight support in BisectingKMeans
   
   
   ### Why are the changes needed?
   BisectingKMeans should support instance weighting
   
   
   ### Does this PR introduce any user-facing change?
   Yes. BisectingKMeans.setWeight
   
   
   ### How was this patch tested?
   Unit test
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] huaxingao edited a comment on issue #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting

Posted by GitBox <gi...@apache.org>.
huaxingao edited a comment on issue #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting
URL: https://github.com/apache/spark/pull/27035#issuecomment-569742346
 
 
   I saw the conflict file. I will fix this after https://github.com/apache/spark/pull/27052 is merged, so I don't have to rebase twice. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting
URL: https://github.com/apache/spark/pull/27035#issuecomment-569468197
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting
URL: https://github.com/apache/spark/pull/27035#issuecomment-571004428
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/20920/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting
URL: https://github.com/apache/spark/pull/27035#issuecomment-569471198
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/115898/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on issue #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on issue #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting
URL: https://github.com/apache/spark/pull/27035#issuecomment-569468135
 
 
   **[Test build #115898 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/115898/testReport)** for PR 27035 at commit [`e8bd33f`](https://github.com/apache/spark/commit/e8bd33f7332a1fa9380b302096960473a8b6e019).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting
URL: https://github.com/apache/spark/pull/27035#issuecomment-571756479
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a change in pull request #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on a change in pull request #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting
URL: https://github.com/apache/spark/pull/27035#discussion_r363690149
 
 

 ##########
 File path: mllib/src/main/scala/org/apache/spark/ml/clustering/BisectingKMeans.scala
 ##########
 @@ -261,31 +264,50 @@ class BisectingKMeans @Since("2.0.0") (
   @Since("2.4.0")
   def setDistanceMeasure(value: String): this.type = set(distanceMeasure, value)
 
+  /**
+   * Sets the value of param [[weightCol]].
+   * If this is not set or empty, we treat all instance weights as 1.0.
+   * Default is not set, so all instances have weight one.
+   *
+   * @group setParam
+   */
+  @Since("3.0.0")
+  def setWeightCol(value: String): this.type = set(weightCol, value)
+
   @Since("2.0.0")
   override def fit(dataset: Dataset[_]): BisectingKMeansModel = instrumented { instr =>
     transformSchema(dataset.schema, logging = true)
 
     val handlePersistence = dataset.storageLevel == StorageLevel.NONE
-    val rdd = DatasetUtils.columnToOldVector(dataset, getFeaturesCol)
+    val w = if (isDefined(weightCol) && $(weightCol).nonEmpty) {
+      col($(weightCol)).cast(DoubleType)
+    } else {
+      lit(1.0)
+    }
+
+    val instances: RDD[(OldVector, Double)] = dataset
+      .select(DatasetUtils.columnToVector(dataset, getFeaturesCol), w).rdd.map {
+      case Row(point: Vector, weight: Double) => (OldVectors.fromML(point), weight)
+    }
     if (handlePersistence) {
 
 Review comment:
   Maybe we no longer need to (or should not) handlePersistence in the .ml sides? for KMeans & BisectingKMeans?  @srowen 
   
   No matter whether input dataset is cached, we always cache a training rdd (`Vector`s in .ml, or `VectorWithNorm`s in .mlllib). So we should cache the one with norms for better performace.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting
URL: https://github.com/apache/spark/pull/27035#issuecomment-571004421
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting
URL: https://github.com/apache/spark/pull/27035#issuecomment-569471197
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting
URL: https://github.com/apache/spark/pull/27035#issuecomment-571728180
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on issue #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting

Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting
URL: https://github.com/apache/spark/pull/27035#issuecomment-569468135
 
 
   **[Test build #115898 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/115898/testReport)** for PR 27035 at commit [`e8bd33f`](https://github.com/apache/spark/commit/e8bd33f7332a1fa9380b302096960473a8b6e019).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting
URL: https://github.com/apache/spark/pull/27035#issuecomment-569727656
 
 
   Build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting
URL: https://github.com/apache/spark/pull/27035#issuecomment-571756479
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting
URL: https://github.com/apache/spark/pull/27035#issuecomment-571728191
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/21049/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] srowen closed pull request #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting

Posted by GitBox <gi...@apache.org>.
srowen closed pull request #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting
URL: https://github.com/apache/spark/pull/27035
 
 
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting
URL: https://github.com/apache/spark/pull/27035#issuecomment-569462433
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/20685/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on issue #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on issue #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting
URL: https://github.com/apache/spark/pull/27035#issuecomment-569727270
 
 
   **[Test build #115962 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/115962/testReport)** for PR 27035 at commit [`987ba2e`](https://github.com/apache/spark/commit/987ba2e941e57d9ce1bf66cba6bac625c603ea8d).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a change in pull request #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on a change in pull request #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting
URL: https://github.com/apache/spark/pull/27035#discussion_r361830133
 
 

 ##########
 File path: mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeans.scala
 ##########
 @@ -156,20 +157,34 @@ class BisectingKMeans private (
   private[spark] def run(
       input: RDD[Vector],
       instr: Option[Instrumentation]): BisectingKMeansModel = {
+    val instances: RDD[(Vector, Double)] = input.map {
+      case (point) => (point, 1.0)
+    }
+    runWithWeight(instances, None)
+  }
+
+  private[spark] def runWithWeight(
+      input: RDD[(Vector, Double)],
+      instr: Option[Instrumentation]): BisectingKMeansModel = {
     if (input.getStorageLevel == StorageLevel.NONE) {
       logWarning(s"The input RDD ${input.id} is not directly cached, which may hurt performance if"
         + " its parent RDDs are also not cached.")
     }
-    val d = input.map(_.size).first()
+    val d = input.map( i => i._1.size).first()
     logInfo(s"Feature dimension: $d.")
 
+    val dataVectorWithNorm = input.map(d => d._1)
 
 Review comment:
   It seems that var `dataVectorWithNorm` is only used to get `norms`, so what about removing it and make `val norms = input.map(d => Vectors.norm(d._1, 2.0))...`

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting
URL: https://github.com/apache/spark/pull/27035#issuecomment-571004428
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/20920/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on issue #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting

Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting
URL: https://github.com/apache/spark/pull/27035#issuecomment-569768110
 
 
   **[Test build #115962 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/115962/testReport)** for PR 27035 at commit [`987ba2e`](https://github.com/apache/spark/commit/987ba2e941e57d9ce1bf66cba6bac625c603ea8d).
    * This patch passes all tests.
    * This patch **does not merge cleanly**.
    * This patch adds no public classes.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting
URL: https://github.com/apache/spark/pull/27035#issuecomment-571019094
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on issue #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting

Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting
URL: https://github.com/apache/spark/pull/27035#issuecomment-569464382
 
 
   **[Test build #115895 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/115895/testReport)** for PR 27035 at commit [`5e7ccdb`](https://github.com/apache/spark/commit/5e7ccdb3459fc5de76473a1eb61413cad9bee6a8).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting
URL: https://github.com/apache/spark/pull/27035#issuecomment-569768694
 
 
   Build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting
URL: https://github.com/apache/spark/pull/27035#issuecomment-569768694
 
 
   Build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a change in pull request #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on a change in pull request #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting
URL: https://github.com/apache/spark/pull/27035#discussion_r363688214
 
 

 ##########
 File path: mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeans.scala
 ##########
 @@ -156,20 +157,33 @@ class BisectingKMeans private (
   private[spark] def run(
 
 Review comment:
   do we still need this method `def run(input: RDD[Vector], instr: Option[Instrumentation])` ?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting
URL: https://github.com/apache/spark/pull/27035#issuecomment-569768700
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/115962/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting
URL: https://github.com/apache/spark/pull/27035#issuecomment-569462430
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting
URL: https://github.com/apache/spark/pull/27035#issuecomment-569464394
 
 
   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/115895/
   Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting
URL: https://github.com/apache/spark/pull/27035#issuecomment-569727656
 
 
   Build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] huaxingao commented on issue #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting

Posted by GitBox <gi...@apache.org>.
huaxingao commented on issue #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting
URL: https://github.com/apache/spark/pull/27035#issuecomment-573778208
 
 
   Thanks! @srowen @zhengruifeng 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting
URL: https://github.com/apache/spark/pull/27035#issuecomment-571019094
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on issue #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on issue #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting
URL: https://github.com/apache/spark/pull/27035#issuecomment-571730771
 
 
   **[Test build #116255 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/116255/testReport)** for PR 27035 at commit [`6ca078e`](https://github.com/apache/spark/commit/6ca078ec59b1818500d5b86dd2ca7a80eb670808).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting
URL: https://github.com/apache/spark/pull/27035#issuecomment-571004421
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a change in pull request #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on a change in pull request #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting
URL: https://github.com/apache/spark/pull/27035#discussion_r364536142
 
 

 ##########
 File path: mllib/src/main/scala/org/apache/spark/ml/clustering/BisectingKMeans.scala
 ##########
 @@ -261,31 +264,50 @@ class BisectingKMeans @Since("2.0.0") (
   @Since("2.4.0")
   def setDistanceMeasure(value: String): this.type = set(distanceMeasure, value)
 
+  /**
+   * Sets the value of param [[weightCol]].
+   * If this is not set or empty, we treat all instance weights as 1.0.
+   * Default is not set, so all instances have weight one.
+   *
+   * @group setParam
+   */
+  @Since("3.0.0")
+  def setWeightCol(value: String): this.type = set(weightCol, value)
+
   @Since("2.0.0")
   override def fit(dataset: Dataset[_]): BisectingKMeansModel = instrumented { instr =>
     transformSchema(dataset.schema, logging = true)
 
     val handlePersistence = dataset.storageLevel == StorageLevel.NONE
-    val rdd = DatasetUtils.columnToOldVector(dataset, getFeaturesCol)
+    val w = if (isDefined(weightCol) && $(weightCol).nonEmpty) {
+      col($(weightCol)).cast(DoubleType)
+    } else {
+      lit(1.0)
+    }
+
+    val instances: RDD[(OldVector, Double)] = dataset
+      .select(DatasetUtils.columnToVector(dataset, getFeaturesCol), w).rdd.map {
+      case Row(point: Vector, weight: Double) => (OldVectors.fromML(point), weight)
+    }
     if (handlePersistence) {
 
 Review comment:
   I guess we should suggest end uers NOT to cache input DFs/RDDs, and the impls should always cache necessary internal DFs/RDDs.
   Some time user caching DFs outside of impl will not help performance too much, for example in decision tree, vectors will be convert to treepoints, and then the cached vectors are not used any more.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting
URL: https://github.com/apache/spark/pull/27035#issuecomment-569468198
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/20688/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] huaxingao commented on issue #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting

Posted by GitBox <gi...@apache.org>.
huaxingao commented on issue #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting
URL: https://github.com/apache/spark/pull/27035#issuecomment-569478281
 
 
   @zhengruifeng @srowen 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on issue #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting

Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting
URL: https://github.com/apache/spark/pull/27035#issuecomment-571005548
 
 
   **[Test build #116128 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/116128/testReport)** for PR 27035 at commit [`ae7948b`](https://github.com/apache/spark/commit/ae7948b2d5feea8fd61e733302d0c04b11c05774).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting
URL: https://github.com/apache/spark/pull/27035#issuecomment-569468197
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting
URL: https://github.com/apache/spark/pull/27035#issuecomment-569462430
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting
URL: https://github.com/apache/spark/pull/27035#issuecomment-569471198
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/115898/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on issue #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on issue #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting
URL: https://github.com/apache/spark/pull/27035#issuecomment-570978517
 
 
   ping @huaxingao  #27052 is merged

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting
URL: https://github.com/apache/spark/pull/27035#issuecomment-569727660
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/20755/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting
URL: https://github.com/apache/spark/pull/27035#issuecomment-571728191
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/21049/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting
URL: https://github.com/apache/spark/pull/27035#issuecomment-569462433
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/20685/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] huaxingao commented on issue #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting

Posted by GitBox <gi...@apache.org>.
huaxingao commented on issue #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting
URL: https://github.com/apache/spark/pull/27035#issuecomment-569742346
 
 
   I saw the above conflict file. I will fix this after https://github.com/apache/spark/pull/27052 is merged, so I don't have to rebase twice. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a change in pull request #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on a change in pull request #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting
URL: https://github.com/apache/spark/pull/27035#discussion_r361838152
 
 

 ##########
 File path: mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeans.scala
 ##########
 @@ -156,20 +157,34 @@ class BisectingKMeans private (
   private[spark] def run(
       input: RDD[Vector],
       instr: Option[Instrumentation]): BisectingKMeansModel = {
+    val instances: RDD[(Vector, Double)] = input.map {
+      case (point) => (point, 1.0)
+    }
+    runWithWeight(instances, None)
+  }
+
+  private[spark] def runWithWeight(
+      input: RDD[(Vector, Double)],
+      instr: Option[Instrumentation]): BisectingKMeansModel = {
     if (input.getStorageLevel == StorageLevel.NONE) {
       logWarning(s"The input RDD ${input.id} is not directly cached, which may hurt performance if"
         + " its parent RDDs are also not cached.")
     }
-    val d = input.map(_.size).first()
+    val d = input.map( i => i._1.size).first()
 
 Review comment:
   Or just keep existing line?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a change in pull request #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on a change in pull request #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting
URL: https://github.com/apache/spark/pull/27035#discussion_r361830795
 
 

 ##########
 File path: mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala
 ##########
 @@ -521,8 +521,10 @@ object KMeans {
 /**
  * A vector with its norm for fast distance computation.
  */
-private[clustering] class VectorWithNorm(val vector: Vector, val norm: Double)
-    extends Serializable {
+private[clustering] class VectorWithNorm(
 
 Review comment:
   I am neutral on adding weight in `VectorWithNorm`, then what about also using it in KMeans?
   for example: `val zippedData: RDD[(VectorWithNorm, Double)]` => `val zippedData: RDD[VectorWithNorm]`

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting
URL: https://github.com/apache/spark/pull/27035#issuecomment-569464392
 
 
   Merged build finished. Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on issue #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting

Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting
URL: https://github.com/apache/spark/pull/27035#issuecomment-571756136
 
 
   **[Test build #116255 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/116255/testReport)** for PR 27035 at commit [`6ca078e`](https://github.com/apache/spark/commit/6ca078ec59b1818500d5b86dd2ca7a80eb670808).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on issue #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting

Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting
URL: https://github.com/apache/spark/pull/27035#issuecomment-571018831
 
 
   **[Test build #116128 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/116128/testReport)** for PR 27035 at commit [`ae7948b`](https://github.com/apache/spark/commit/ae7948b2d5feea8fd61e733302d0c04b11c05774).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on issue #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting

Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting
URL: https://github.com/apache/spark/pull/27035#issuecomment-569727270
 
 
   **[Test build #115962 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/115962/testReport)** for PR 27035 at commit [`987ba2e`](https://github.com/apache/spark/commit/987ba2e941e57d9ce1bf66cba6bac625c603ea8d).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on issue #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting

Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting
URL: https://github.com/apache/spark/pull/27035#issuecomment-571730771
 
 
   **[Test build #116255 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/116255/testReport)** for PR 27035 at commit [`6ca078e`](https://github.com/apache/spark/commit/6ca078ec59b1818500d5b86dd2ca7a80eb670808).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on issue #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting

Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting
URL: https://github.com/apache/spark/pull/27035#issuecomment-569462733
 
 
   **[Test build #115895 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/115895/testReport)** for PR 27035 at commit [`5e7ccdb`](https://github.com/apache/spark/commit/5e7ccdb3459fc5de76473a1eb61413cad9bee6a8).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting
URL: https://github.com/apache/spark/pull/27035#issuecomment-569471197
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] huaxingao commented on a change in pull request #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting

Posted by GitBox <gi...@apache.org>.
huaxingao commented on a change in pull request #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting
URL: https://github.com/apache/spark/pull/27035#discussion_r362032729
 
 

 ##########
 File path: mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeans.scala
 ##########
 @@ -156,20 +157,34 @@ class BisectingKMeans private (
   private[spark] def run(
       input: RDD[Vector],
       instr: Option[Instrumentation]): BisectingKMeansModel = {
+    val instances: RDD[(Vector, Double)] = input.map {
+      case (point) => (point, 1.0)
+    }
+    runWithWeight(instances, None)
+  }
+
+  private[spark] def runWithWeight(
+      input: RDD[(Vector, Double)],
+      instr: Option[Instrumentation]): BisectingKMeansModel = {
     if (input.getStorageLevel == StorageLevel.NONE) {
       logWarning(s"The input RDD ${input.id} is not directly cached, which may hurt performance if"
         + " its parent RDDs are also not cached.")
     }
-    val d = input.map(_.size).first()
+    val d = input.map( i => i._1.size).first()
 
 Review comment:
   Do you mean ```val d = input.map(_._1.size).first()```?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on issue #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting

Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting
URL: https://github.com/apache/spark/pull/27035#issuecomment-569471151
 
 
   **[Test build #115898 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/115898/testReport)** for PR 27035 at commit [`e8bd33f`](https://github.com/apache/spark/commit/e8bd33f7332a1fa9380b302096960473a8b6e019).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on issue #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on issue #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting
URL: https://github.com/apache/spark/pull/27035#issuecomment-569462733
 
 
   **[Test build #115895 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/115895/testReport)** for PR 27035 at commit [`5e7ccdb`](https://github.com/apache/spark/commit/5e7ccdb3459fc5de76473a1eb61413cad9bee6a8).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting
URL: https://github.com/apache/spark/pull/27035#issuecomment-571756490
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/116255/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting
URL: https://github.com/apache/spark/pull/27035#issuecomment-569768700
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/115962/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] srowen commented on a change in pull request #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting

Posted by GitBox <gi...@apache.org>.
srowen commented on a change in pull request #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting
URL: https://github.com/apache/spark/pull/27035#discussion_r363771852
 
 

 ##########
 File path: mllib/src/main/scala/org/apache/spark/ml/clustering/BisectingKMeans.scala
 ##########
 @@ -261,31 +264,50 @@ class BisectingKMeans @Since("2.0.0") (
   @Since("2.4.0")
   def setDistanceMeasure(value: String): this.type = set(distanceMeasure, value)
 
+  /**
+   * Sets the value of param [[weightCol]].
+   * If this is not set or empty, we treat all instance weights as 1.0.
+   * Default is not set, so all instances have weight one.
+   *
+   * @group setParam
+   */
+  @Since("3.0.0")
+  def setWeightCol(value: String): this.type = set(weightCol, value)
+
   @Since("2.0.0")
   override def fit(dataset: Dataset[_]): BisectingKMeansModel = instrumented { instr =>
     transformSchema(dataset.schema, logging = true)
 
     val handlePersistence = dataset.storageLevel == StorageLevel.NONE
-    val rdd = DatasetUtils.columnToOldVector(dataset, getFeaturesCol)
+    val w = if (isDefined(weightCol) && $(weightCol).nonEmpty) {
+      col($(weightCol)).cast(DoubleType)
+    } else {
+      lit(1.0)
+    }
+
+    val instances: RDD[(OldVector, Double)] = dataset
+      .select(DatasetUtils.columnToVector(dataset, getFeaturesCol), w).rdd.map {
+      case Row(point: Vector, weight: Double) => (OldVectors.fromML(point), weight)
+    }
     if (handlePersistence) {
 
 Review comment:
   The persistence strategy is inconsistent. I'd prefer to standardize it more than change it though. We want to avoid silently taking a big performance hit by not caching. Either that means warning the user or caching internal RDDs for the user.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] huaxingao commented on a change in pull request #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting

Posted by GitBox <gi...@apache.org>.
huaxingao commented on a change in pull request #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting
URL: https://github.com/apache/spark/pull/27035#discussion_r363904151
 
 

 ##########
 File path: mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeans.scala
 ##########
 @@ -156,20 +157,33 @@ class BisectingKMeans private (
   private[spark] def run(
       input: RDD[Vector],
       instr: Option[Instrumentation]): BisectingKMeansModel = {
+    val instances: RDD[(Vector, Double)] = input.map {
+      case (point) => (point, 1.0)
+    }
+    runWithWeight(instances, None)
+  }
+
+  private[spark] def runWithWeight(
+      input: RDD[(Vector, Double)],
+      instr: Option[Instrumentation]): BisectingKMeansModel = {
     if (input.getStorageLevel == StorageLevel.NONE) {
       logWarning(s"The input RDD ${input.id} is not directly cached, which may hurt performance if"
         + " its parent RDDs are also not cached.")
     }
-    val d = input.map(_.size).first()
+    val d = input.map(_._1.size).first
     logInfo(s"Feature dimension: $d.")
 
     val dMeasure: DistanceMeasure = DistanceMeasure.decodeFromString(this.distanceMeasure)
     // Compute and cache vector norms for fast distance computation.
-    val norms = input.map(v => Vectors.norm(v, 2.0)).persist(StorageLevel.MEMORY_AND_DISK)
-    val vectors = input.zip(norms).map { case (x, norm) => new VectorWithNorm(x, norm) }
+    val norms = input.map(d => Vectors.norm(d._1, 2.0))
+      .persist(StorageLevel.MEMORY_AND_DISK)
+    val vectors = input.zip(norms).map {
+      case ((x, weight), norm) => new VectorWithNorm(x, norm, weight)
+    }
 
 Review comment:
   OK. Updated. Thanks!

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a change in pull request #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on a change in pull request #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting
URL: https://github.com/apache/spark/pull/27035#discussion_r361829940
 
 

 ##########
 File path: mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeans.scala
 ##########
 @@ -156,20 +157,34 @@ class BisectingKMeans private (
   private[spark] def run(
       input: RDD[Vector],
       instr: Option[Instrumentation]): BisectingKMeansModel = {
+    val instances: RDD[(Vector, Double)] = input.map {
+      case (point) => (point, 1.0)
+    }
+    runWithWeight(instances, None)
+  }
+
+  private[spark] def runWithWeight(
+      input: RDD[(Vector, Double)],
+      instr: Option[Instrumentation]): BisectingKMeansModel = {
     if (input.getStorageLevel == StorageLevel.NONE) {
       logWarning(s"The input RDD ${input.id} is not directly cached, which may hurt performance if"
         + " its parent RDDs are also not cached.")
     }
-    val d = input.map(_.size).first()
+    val d = input.map( i => i._1.size).first()
 
 Review comment:
   nit, remove space before first `i`

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting
URL: https://github.com/apache/spark/pull/27035#issuecomment-569468198
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/20688/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting
URL: https://github.com/apache/spark/pull/27035#issuecomment-569464392
 
 
   Merged build finished. Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on issue #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on issue #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting
URL: https://github.com/apache/spark/pull/27035#issuecomment-571005548
 
 
   **[Test build #116128 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/116128/testReport)** for PR 27035 at commit [`ae7948b`](https://github.com/apache/spark/commit/ae7948b2d5feea8fd61e733302d0c04b11c05774).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] srowen commented on issue #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting

Posted by GitBox <gi...@apache.org>.
srowen commented on issue #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting
URL: https://github.com/apache/spark/pull/27035#issuecomment-573687526
 
 
   Merged to master

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting
URL: https://github.com/apache/spark/pull/27035#issuecomment-571756490
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/116255/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting
URL: https://github.com/apache/spark/pull/27035#issuecomment-571019098
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/116128/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting
URL: https://github.com/apache/spark/pull/27035#issuecomment-571728180
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting
URL: https://github.com/apache/spark/pull/27035#issuecomment-569464394
 
 
   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/115895/
   Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting
URL: https://github.com/apache/spark/pull/27035#issuecomment-569727660
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/20755/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting
URL: https://github.com/apache/spark/pull/27035#issuecomment-571019098
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/116128/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] huaxingao commented on a change in pull request #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting

Posted by GitBox <gi...@apache.org>.
huaxingao commented on a change in pull request #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting
URL: https://github.com/apache/spark/pull/27035#discussion_r362032890
 
 

 ##########
 File path: mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala
 ##########
 @@ -521,8 +521,10 @@ object KMeans {
 /**
  * A vector with its norm for fast distance computation.
  */
-private[clustering] class VectorWithNorm(val vector: Vector, val norm: Double)
-    extends Serializable {
+private[clustering] class VectorWithNorm(
 
 Review comment:
   I will update KMean code. Thanks!

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] huaxingao commented on a change in pull request #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting

Posted by GitBox <gi...@apache.org>.
huaxingao commented on a change in pull request #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting
URL: https://github.com/apache/spark/pull/27035#discussion_r363904049
 
 

 ##########
 File path: mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeans.scala
 ##########
 @@ -156,20 +157,33 @@ class BisectingKMeans private (
   private[spark] def run(
 
 Review comment:
   Yes. It is called by this 
   ```
     @Since("1.6.0")
     def run(input: RDD[Vector]): BisectingKMeansModel = {
       run(input, None)
     }
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a change in pull request #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on a change in pull request #27035: [SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting
URL: https://github.com/apache/spark/pull/27035#discussion_r363688765
 
 

 ##########
 File path: mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeans.scala
 ##########
 @@ -156,20 +157,33 @@ class BisectingKMeans private (
   private[spark] def run(
       input: RDD[Vector],
       instr: Option[Instrumentation]): BisectingKMeansModel = {
+    val instances: RDD[(Vector, Double)] = input.map {
+      case (point) => (point, 1.0)
+    }
+    runWithWeight(instances, None)
+  }
+
+  private[spark] def runWithWeight(
+      input: RDD[(Vector, Double)],
+      instr: Option[Instrumentation]): BisectingKMeansModel = {
     if (input.getStorageLevel == StorageLevel.NONE) {
       logWarning(s"The input RDD ${input.id} is not directly cached, which may hurt performance if"
         + " its parent RDDs are also not cached.")
     }
-    val d = input.map(_.size).first()
+    val d = input.map(_._1.size).first
     logInfo(s"Feature dimension: $d.")
 
     val dMeasure: DistanceMeasure = DistanceMeasure.decodeFromString(this.distanceMeasure)
     // Compute and cache vector norms for fast distance computation.
-    val norms = input.map(v => Vectors.norm(v, 2.0)).persist(StorageLevel.MEMORY_AND_DISK)
-    val vectors = input.zip(norms).map { case (x, norm) => new VectorWithNorm(x, norm) }
+    val norms = input.map(d => Vectors.norm(d._1, 2.0))
+      .persist(StorageLevel.MEMORY_AND_DISK)
+    val vectors = input.zip(norms).map {
+      case ((x, weight), norm) => new VectorWithNorm(x, norm, weight)
+    }
 
 Review comment:
   The caching strategy should following https://github.com/apache/spark/pull/27052?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org