You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/02/09 06:38:55 UTC

[GitHub] [spark] zhengruifeng opened a new pull request #35457: [SPARK-36553][ML] KMeans avoid compute auxiliary statistics for large K

zhengruifeng opened a new pull request #35457:
URL: https://github.com/apache/spark/pull/35457


   ### What changes were proposed in this pull request?
   
   SPARK-31007 introduce an auxiliary statistics to speed up computation in KMeasn.
   
   However, it needs a array of size `k * (k + 1) / 2`, which may cause overflow or OOM when k is too large.
   
   So we should skip this optimization in this case.
   
   
   ### Why are the changes needed?
   
   avoid overflow or OOM when k is too large (like 50,000)
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   existing testsuites
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on a change in pull request #35457: [SPARK-36553][ML] KMeans avoid compute auxiliary statistics for large K

Posted by GitBox <gi...@apache.org>.

zhengruifeng commented on a change in pull request #35457:
URL: https://github.com/apache/spark/pull/35457#discussion_r809762604



##########
File path: mllib/src/main/scala/org/apache/spark/mllib/clustering/DistanceMeasure.scala
##########
@@ -117,6 +117,17 @@ private[spark] abstract class DistanceMeasure extends Serializable {
     packedValues
   }
 
+  def findClosest(
+      centers: Array[VectorWithNorm],
+      statistics: Option[Array[Double]],
+      point: VectorWithNorm): (Int, Double) = {
+    if (statistics.nonEmpty) {
+      findClosest(centers, statistics.get, point)
+    } else {
+      findClosest(centers, point)
+    }
+  }

Review comment:
       good point. I will update this PR




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] srowen commented on pull request #35457: [SPARK-36553][ML] KMeans avoid compute auxiliary statistics for large K

Posted by GitBox <gi...@apache.org>.

srowen commented on pull request #35457:
URL: https://github.com/apache/spark/pull/35457#issuecomment-1039768604


   Your current design is fine, I trust your judgment


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on pull request #35457: [SPARK-36553][ML] KMeans avoid compute auxiliary statistics for large K

Posted by GitBox <gi...@apache.org>.

zhengruifeng commented on pull request #35457:
URL: https://github.com/apache/spark/pull/35457#issuecomment-1034440217


   I think I made it too complex.
   
   
   
   according to @anders-rydbirk  your description in the ticket:
   
   ```
   Possible workaround:
   
       Roll back to Spark 3.0.0 since a KMeansModel generated with 3.0.0 cannot be loaded in 3.1.1.
       Reduce K. Currently trying with 45000.
   
   ```
   
   maybe we just need to chang `k * (k + 1) / 2` to `(k.toLong * (k + 1) / 2).toInt`?
   ```
   scala> val k = 50000
   val k: Int = 50000
   
   scala> k * (k + 1) / 2
   val res8: Int = -897458648
   
   scala> (k.toLong * (k + 1) / 2).toInt
   val res9: Int = 1250025000
   
   scala> val k = 45000
   val k: Int = 45000
   
   scala> k * (k + 1) / 2
   val res10: Int = 1012522500
   
   scala> (k.toLong * (k + 1) / 2).toInt
   val res11: Int = 1012522500
   ```
   
   
   
   > Sorry, I guess I mean make it into an array of arrays, not one big array.
   
   @srowen  yes, using arrays of sizes (1, 2, ..., k) is another choice


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] huaxingao commented on pull request #35457: [SPARK-36553][ML] KMeans avoid compute auxiliary statistics for large K

Posted by GitBox <gi...@apache.org>.

huaxingao commented on pull request #35457:
URL: https://github.com/apache/spark/pull/35457#issuecomment-1056130939


   I will merge this tomorrow if there are no further comments.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on pull request #35457: [SPARK-36553][ML] KMeans avoid compute auxiliary statistics for large K

Posted by GitBox <gi...@apache.org>.

zhengruifeng commented on pull request #35457:
URL: https://github.com/apache/spark/pull/35457#issuecomment-1034447835


   there are two limits:
   
   1, array size, must be less than `Int.MaxValue`;
   
   2, its size should fit in memory for initialization and broadcasting.
   
   with --driver-memory=8G, I can not create an array of 1250025000 doubles. If we switch to arrays of arrays, I am afraid it's prone to OOM for large K.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on a change in pull request #35457: [SPARK-36553][ML] KMeans avoid compute auxiliary statistics for large K

Posted by GitBox <gi...@apache.org>.

zhengruifeng commented on a change in pull request #35457:
URL: https://github.com/apache/spark/pull/35457#discussion_r811642824



##########
File path: mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala
##########
@@ -269,15 +269,22 @@ class KMeans private (
 
     instr.foreach(_.logNumFeatures(numFeatures))
 
-    val shouldDistributed = centers.length * centers.length * numFeatures.toLong > 1000000L
+    val shouldComputeStats =
+      DistanceMeasure.shouldComputeStatistics(centers.length)
+    val shouldComputeStatsLocally =
+      DistanceMeasure.shouldComputeStatisticsLocally(centers.length, numFeatures)
 
     // Execute iterations of Lloyd's algorithm until converged
     while (iteration < maxIterations && !converged) {
       val bcCenters = sc.broadcast(centers)
-      val stats = if (shouldDistributed) {
-        distanceMeasureInstance.computeStatisticsDistributedly(sc, bcCenters)
+      val stats = if (shouldComputeStats) {

Review comment:
       So the following `val (bestCenter, cost) = distanceMeasureInstance.findClosest(centers, stats, point)` will call this new method, without code change in the call sites.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] srowen commented on a change in pull request #35457: [SPARK-36553][ML] KMeans avoid compute auxiliary statistics for large K

Posted by GitBox <gi...@apache.org>.

srowen commented on a change in pull request #35457:
URL: https://github.com/apache/spark/pull/35457#discussion_r809964477



##########
File path: mllib/src/main/scala/org/apache/spark/mllib/clustering/DistanceMeasure.scala
##########
@@ -117,6 +117,17 @@ private[spark] abstract class DistanceMeasure extends Serializable {
     packedValues
   }
 
+  def findClosest(

Review comment:
       OK this is a new method but I don't see it called, maybe I'm missing something




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] srowen commented on pull request #35457: [SPARK-36553][ML] KMeans avoid compute auxiliary statistics for large K

Posted by GitBox <gi...@apache.org>.

srowen commented on pull request #35457:
URL: https://github.com/apache/spark/pull/35457#issuecomment-1034441433


   Array sizes can't be long so if it doesn't fit in an int it won't work


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on pull request #35457: [SPARK-36553][ML] KMeans avoid compute auxiliary statistics for large K

Posted by GitBox <gi...@apache.org>.

zhengruifeng commented on pull request #35457:
URL: https://github.com/apache/spark/pull/35457#issuecomment-1034401889


   cc @srowen 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on a change in pull request #35457: [SPARK-36553][ML] KMeans avoid compute auxiliary statistics for large K

Posted by GitBox <gi...@apache.org>.

zhengruifeng commented on a change in pull request #35457:
URL: https://github.com/apache/spark/pull/35457#discussion_r803249749



##########
File path: mllib/src/main/scala/org/apache/spark/mllib/clustering/DistanceMeasure.scala
##########
@@ -117,6 +117,17 @@ private[spark] abstract class DistanceMeasure extends Serializable {
     packedValues
   }
 
+  def findClosest(

Review comment:
       it is used in both training and prediction.  `statistics` is optional in it.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng edited a comment on pull request #35457: [SPARK-36553][ML] KMeans avoid compute auxiliary statistics for large K

Posted by GitBox <gi...@apache.org>.

zhengruifeng edited a comment on pull request #35457:
URL: https://github.com/apache/spark/pull/35457#issuecomment-1034425044


   since the matrix is symmetric, if we un-pack it, then we will get a even bigger matrix of size `k * k`.
   
   https://github.com/apache/spark/pull/27758#discussion_r409316055


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] srowen commented on pull request #35457: [SPARK-36553][ML] KMeans avoid compute auxiliary statistics for large K

Posted by GitBox <gi...@apache.org>.

srowen commented on pull request #35457:
URL: https://github.com/apache/spark/pull/35457#issuecomment-1046398326


   Do existing call sites bind to the new method? I can't see how a new method is called when nothing new calls it, but if you understand it and it works, nevermind


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on pull request #35457: [SPARK-36553][ML] KMeans avoid compute auxiliary statistics for large K

Posted by GitBox <gi...@apache.org>.

zhengruifeng commented on pull request #35457:
URL: https://github.com/apache/spark/pull/35457#issuecomment-1044135981


   I think this should also be back-ported to 3.1/3.2


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on pull request #35457: [SPARK-36553][ML] KMeans avoid compute auxiliary statistics for large K

Posted by GitBox <gi...@apache.org>.

zhengruifeng commented on pull request #35457:
URL: https://github.com/apache/spark/pull/35457#issuecomment-1046570349


   > Do existing call sites bind to the new method?
   
   NO.
   
   existing two methods are used in `DistanceMeasure` and `DistanceMeasureSuite`;
   
   but `def findClosest(centers: Array[VectorWithNorm], point: VectorWithNorm)` is also used in KMeans initialization algorithm `initKMeansParallel` and  `BisectingKMeans`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on pull request #35457: [SPARK-36553][ML] KMeans avoid compute auxiliary statistics for large K

Posted by GitBox <gi...@apache.org>.

zhengruifeng commented on pull request #35457:
URL: https://github.com/apache/spark/pull/35457#issuecomment-1057638742


   @huaxingao @srowen @dongjoon-hyun  Thanks for reviewing!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] huaxingao commented on pull request #35457: [SPARK-36553][ML] KMeans avoid compute auxiliary statistics for large K

Posted by GitBox <gi...@apache.org>.

huaxingao commented on pull request #35457:
URL: https://github.com/apache/spark/pull/35457#issuecomment-1057320634


   Merged to master/3.2/3.1. Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on a change in pull request #35457: [SPARK-36553][ML] KMeans avoid compute auxiliary statistics for large K

Posted by GitBox <gi...@apache.org>.

dongjoon-hyun commented on a change in pull request #35457:
URL: https://github.com/apache/spark/pull/35457#discussion_r808239477



##########
File path: mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeansModel.scala
##########
@@ -52,7 +52,14 @@ class KMeansModel (@Since("1.0.0") val clusterCenters: Array[Vector],
   @transient private lazy val statistics = if (clusterCenters == null) {
     null
   } else {
-    distanceMeasureInstance.computeStatistics(clusterCentersWithNorm)
+    val k = clusterCenters.length
+    val numFeatures = clusterCenters.head.size
+    if (DistanceMeasure.shouldComputeStatistics(k) &&
+      DistanceMeasure.shouldComputeStatisticsLocally(k, numFeatures)) {

Review comment:
       indentation?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] srowen commented on a change in pull request #35457: [SPARK-36553][ML] KMeans avoid compute auxiliary statistics for large K

Posted by GitBox <gi...@apache.org>.

srowen commented on a change in pull request #35457:
URL: https://github.com/apache/spark/pull/35457#discussion_r803244226



##########
File path: mllib/src/main/scala/org/apache/spark/mllib/clustering/DistanceMeasure.scala
##########
@@ -117,6 +117,17 @@ private[spark] abstract class DistanceMeasure extends Serializable {
     packedValues
   }
 
+  def findClosest(

Review comment:
       Is this overload used?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] srowen commented on pull request #35457: [SPARK-36553][ML] KMeans avoid compute auxiliary statistics for large K

Posted by GitBox <gi...@apache.org>.

srowen commented on pull request #35457:
URL: https://github.com/apache/spark/pull/35457#issuecomment-1034429416


   Sorry, I guess I mean make it into an array of arrays, not one big array.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] srowen commented on pull request #35457: [SPARK-36553][ML] KMeans avoid compute auxiliary statistics for large K

Posted by GitBox <gi...@apache.org>.

srowen commented on pull request #35457:
URL: https://github.com/apache/spark/pull/35457#issuecomment-1039768604


   Your current design is fine, I trust your judgment


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on a change in pull request #35457: [SPARK-36553][ML] KMeans avoid compute auxiliary statistics for large K

Posted by GitBox <gi...@apache.org>.

zhengruifeng commented on a change in pull request #35457:
URL: https://github.com/apache/spark/pull/35457#discussion_r811642393



##########
File path: mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala
##########
@@ -269,15 +269,22 @@ class KMeans private (
 
     instr.foreach(_.logNumFeatures(numFeatures))
 
-    val shouldDistributed = centers.length * centers.length * numFeatures.toLong > 1000000L
+    val shouldComputeStats =
+      DistanceMeasure.shouldComputeStatistics(centers.length)
+    val shouldComputeStatsLocally =
+      DistanceMeasure.shouldComputeStatisticsLocally(centers.length, numFeatures)
 
     // Execute iterations of Lloyd's algorithm until converged
     while (iteration < maxIterations && !converged) {
       val bcCenters = sc.broadcast(centers)
-      val stats = if (shouldDistributed) {
-        distanceMeasureInstance.computeStatisticsDistributedly(sc, bcCenters)
+      val stats = if (shouldComputeStats) {

Review comment:
       Now, it is a Option[Array[Double]]




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on a change in pull request #35457: [SPARK-36553][ML] KMeans avoid compute auxiliary statistics for large K

Posted by GitBox <gi...@apache.org>.

zhengruifeng commented on a change in pull request #35457:
URL: https://github.com/apache/spark/pull/35457#discussion_r811642211



##########
File path: mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala
##########
@@ -269,15 +269,22 @@ class KMeans private (
 
     instr.foreach(_.logNumFeatures(numFeatures))
 
-    val shouldDistributed = centers.length * centers.length * numFeatures.toLong > 1000000L
+    val shouldComputeStats =
+      DistanceMeasure.shouldComputeStatistics(centers.length)
+    val shouldComputeStatsLocally =
+      DistanceMeasure.shouldComputeStatisticsLocally(centers.length, numFeatures)
 
     // Execute iterations of Lloyd's algorithm until converged
     while (iteration < maxIterations && !converged) {
       val bcCenters = sc.broadcast(centers)
-      val stats = if (shouldDistributed) {

Review comment:
       previous stats is a Array[Double]




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on pull request #35457: [SPARK-36553][ML] KMeans avoid compute auxiliary statistics for large K

Posted by GitBox <gi...@apache.org>.

zhengruifeng commented on pull request #35457:
URL: https://github.com/apache/spark/pull/35457#issuecomment-1034425044


   since the matrix is symmetric, if we un-pack it, then we will get a even bigger matrix of size `k * k`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] anders-rydbirk commented on pull request #35457: [SPARK-36553][ML] KMeans avoid compute auxiliary statistics for large K

Posted by GitBox <gi...@apache.org>.

anders-rydbirk commented on pull request #35457:
URL: https://github.com/apache/spark/pull/35457#issuecomment-1033684294


   @zhengruifeng Thanks for picking this one up!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on a change in pull request #35457: [SPARK-36553][ML] KMeans avoid compute auxiliary statistics for large K

Posted by GitBox <gi...@apache.org>.

dongjoon-hyun commented on a change in pull request #35457:
URL: https://github.com/apache/spark/pull/35457#discussion_r808236586



##########
File path: mllib/src/main/scala/org/apache/spark/mllib/clustering/DistanceMeasure.scala
##########
@@ -117,6 +117,17 @@ private[spark] abstract class DistanceMeasure extends Serializable {
     packedValues
   }
 
+  def findClosest(
+      centers: Array[VectorWithNorm],
+      statistics: Option[Array[Double]],
+      point: VectorWithNorm): (Int, Double) = {
+    if (statistics.nonEmpty) {
+      findClosest(centers, statistics.get, point)
+    } else {
+      findClosest(centers, point)
+    }
+  }

Review comment:
       Shall we add the function description like the other existing `def findClosest` functions?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on pull request #35457: [SPARK-36553][ML] KMeans avoid compute auxiliary statistics for large K

Posted by GitBox <gi...@apache.org>.

zhengruifeng commented on pull request #35457:
URL: https://github.com/apache/spark/pull/35457#issuecomment-1039758803


   @srowen 
   
   I can switch to array[array[double] if you perfer it, I am netural on it.
   
   my main concern is, this optional statistics may be too large. In this case, k=50,000, it is much larger than the clustering centers (dim=3).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on pull request #35457: [SPARK-36553][ML] KMeans avoid compute auxiliary statistics for large K

Posted by GitBox <gi...@apache.org>.

zhengruifeng commented on pull request #35457:
URL: https://github.com/apache/spark/pull/35457#issuecomment-1039758803


   @srowen 
   
   I can switch to array[array[double] if you perfer it, I am netural on it.
   
   my main concern is, this optional statistics may be too large. In this case, k=50,000, it is much larger than the clustering centers (dim=3).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on pull request #35457: [SPARK-36553][ML] KMeans avoid compute auxiliary statistics for large K

Posted by GitBox <gi...@apache.org>.

zhengruifeng commented on pull request #35457:
URL: https://github.com/apache/spark/pull/35457#issuecomment-1046373579


   ![image](https://user-images.githubusercontent.com/7322292/154873090-857a0b9b-43c0-4120-a0a4-2b66bfabe510.png)
   
   @srowen It is used in both training (in the .ml side) and prediction (in the .mllib side), the switch is done by just changing the type of `stats` in `distanceMeasureInstance.findClosest(centers, stats, point)` from `Array[Double]` to `Option[Array[Double]]`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] huaxingao closed pull request #35457: [SPARK-36553][ML] KMeans avoid compute auxiliary statistics for large K

Posted by GitBox <gi...@apache.org>.

huaxingao closed pull request #35457:
URL: https://github.com/apache/spark/pull/35457


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org