You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2020/03/02 10:05:40 UTC

[GitHub] [spark] zhengruifeng opened a new pull request #27758: [SPARK-31007][ML] KMeans optimization based on triangle-inequality

zhengruifeng opened a new pull request #27758: [SPARK-31007][ML] KMeans optimization based on triangle-inequality
URL: https://github.com/apache/spark/pull/27758
 
 
   ### What changes were proposed in this pull request?
   apply Lemma 1 in [Using the Triangle Inequality to Accelerate K-Means](https://www.aaai.org/Papers/ICML/2003/ICML03-022.pdf):
    
   > Let x be a point, and let b and c be centers. If d(b,c)>=2d(x,b) then d(x,c) >= d(x,b);
   
   It can be directly applied in EuclideanDistance, but not in CosineDistance.
   However, luckily for CosineDistance we can get a variant in the space of radian/angle.
   
   
   ### Why are the changes needed?
   It help improving the performance of prediction
   
   
   ### Does this PR introduce any user-facing change?
   No
   
   
   ### How was this patch tested?
   existing testsuites

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng edited a comment on issue #27758: [SPARK-31007][ML] KMeans optimization based on triangle-inequality

Posted by GitBox <gi...@apache.org>.

zhengruifeng edited a comment on issue #27758: [SPARK-31007][ML] KMeans optimization based on triangle-inequality
URL: https://github.com/apache/spark/pull/27758#issuecomment-593323571
 
 
   In current impl, following Lemma is used in KMeans:
   
   0, Let x be a point, let b be a center and o be the origin, then d(x,c) >= |(d(x,o) - d(c,o))| = |norm-norm(c)|
   ```scala
         val center = centers(i)
         // Since `\|a - b\| \geq |\|a\| - \|b\||`, we can use this lower bound to avoid unnecessary
         // distance computation.
         var lowerBoundOfSqDist = center.norm - point.norm
         lowerBoundOfSqDist = lowerBoundOfSqDist * lowerBoundOfSqDist
         if (lowerBoundOfSqDist < bestDistance) {
           ...
         }
   ```
   this can only be applied in `EuclideanDistance`, but not in `CosineDistance`
   
   
   
   According to [Using the Triangle Inequality to Accelerate K-Meanswe](https://www.aaai.org/Papers/ICML/2003/ICML03-022.pdf) , we can go futher, and there are another two Lemmas can be used:
   
   1, Let x be a point, and let b and c be centers. If d(b,c)>=2d(x,b) then d(x,c) >= d(x,b);
   
   this can be applied in `EuclideanDistance`, but not in `CosineDistance`.
   However, luckily for `CosineDistance` we can get a variant in the space of radian/angle.
   
   2, Let x be a point, and let b and c be centers. Then d(x,c) >= max{0, d(x,b)-d(b,c)};
   
   this can be applied in EuclideanDistance, but not in CosineDistance
   
   The application of Lemma 2 is a little complex: It need to cache/update the distance/lower bounds to previous centers, and thus can be only applied in training, not usable in predction.
   
   So this PR is mainly for Lemma 1. Its idea is quite simple, if point x is close to center b enough (less than a pre-computed radius), then we can say point x belong to center c without computing the distances between x and other centers. It can be used in both training and predction.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27758: [SPARK-31007][ML] KMeans optimization based on triangle-inequality

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27758: [SPARK-31007][ML] KMeans optimization based on triangle-inequality
URL: https://github.com/apache/spark/pull/27758#issuecomment-593359296
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/119166/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] srowen commented on a change in pull request #27758: [SPARK-31007][ML][WIP] KMeans optimization based on triangle-inequality

Posted by GitBox <gi...@apache.org>.

srowen commented on a change in pull request #27758: [SPARK-31007][ML][WIP] KMeans optimization based on triangle-inequality
URL: https://github.com/apache/spark/pull/27758#discussion_r408157153
 
 

 ##########
 File path: mllib/src/main/scala/org/apache/spark/mllib/clustering/DistanceMeasure.scala
 ##########
 @@ -17,23 +17,124 @@
 
 package org.apache.spark.mllib.clustering
 
+import org.apache.spark.SparkContext
 import org.apache.spark.annotation.Since
+import org.apache.spark.broadcast.Broadcast
 import org.apache.spark.mllib.linalg.{Vector, Vectors}
 import org.apache.spark.mllib.linalg.BLAS.{axpy, dot, scal}
 import org.apache.spark.mllib.util.MLUtils
 
 private[spark] abstract class DistanceMeasure extends Serializable {
 
+  /**
+   * Statistics used in triangle inequality to obtain useful bounds to find closest centers.
+   * @param distance distance between two centers
+   */
+  def computeStatistics(distance: Double): Double
+
+  /**
+   * Statistics used in triangle inequality to obtain useful bounds to find closest centers.
+   *
+   * @return A symmetric matrix containing statistics, matrix(i)(j) represents:
 
 Review comment:
   It might be too hard to implement, but if it's symmetric you only need half this many elements to represent. The indexing becomes more complex though

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27758: [SPARK-31007][ML] KMeans optimization based on triangle-inequality

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27758: [SPARK-31007][ML] KMeans optimization based on triangle-inequality
URL: https://github.com/apache/spark/pull/27758#issuecomment-593811985
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23949/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] srowen commented on a change in pull request #27758: [SPARK-31007][ML] KMeans optimization based on triangle-inequality

Posted by GitBox <gi...@apache.org>.

srowen commented on a change in pull request #27758: [SPARK-31007][ML] KMeans optimization based on triangle-inequality
URL: https://github.com/apache/spark/pull/27758#discussion_r386456920
 
 

 ##########
 File path: mllib/src/main/scala/org/apache/spark/mllib/clustering/DistanceMeasure.scala
 ##########
 @@ -154,22 +198,86 @@ object DistanceMeasure {
 }
 
 private[spark] class EuclideanDistanceMeasure extends DistanceMeasure {
+
+  /**
+   * @return Radii of centers. If distance between point x and center c is less than
+   *         the radius of center c, then center c is the closest center to point x.
+   *         For Euclidean distance, radius of center c is half of the distance between
+   *         center c and its closest center.
+   */
+  override def computeRadii(centers: Array[VectorWithNorm]): Array[Double] = {
+    val k = centers.length
+    if (k == 1) {
+      Array(Double.NaN)
+    } else {
+      val distances = Array.fill(k)(Double.PositiveInfinity)
+      var i = 0
+      while (i < k) {
+        var j = i + 1
+        while (j < k) {
+          val d = distance(centers(i), centers(j))
+          if (d < distances(i)) distances(i) = d
+          if (d < distances(j)) distances(j) = d
+          j += 1
+        }
+        i += 1
+      }
+
+      distances.map(_ / 2)
+    }
+  }
+
+  /**
+   * @return the index of the closest center to the given point, as well as the cost.
+   */
+  override def findClosest(
+      centers: Array[VectorWithNorm],
+      radii: Array[Double],
+      point: VectorWithNorm): (Int, Double) = {
+    var bestDistance = Double.PositiveInfinity
+    var bestIndex = 0
+    var i = 0
+    var found = false
+    while (i < centers.length && !found) {
+      val center = centers(i)
+      // Since `\|a - b\| \geq |\|a\| - \|b\||`, we can use this lower bound to avoid unnecessary
+      // distance computation.
+      var lowerBoundOfSqDist = center.norm - point.norm
+      lowerBoundOfSqDist = lowerBoundOfSqDist * lowerBoundOfSqDist
+      if (lowerBoundOfSqDist < bestDistance) {
+        val d = EuclideanDistanceMeasure.fastSquaredDistance(center, point)
+        val r = radii(i)
+        if (d < r * r) {
+          bestDistance = d
+          bestIndex = i
+          found = true
 
 Review comment:
   Same, just return?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] srowen commented on a change in pull request #27758: [SPARK-31007][ML] KMeans optimization based on triangle-inequality

Posted by GitBox <gi...@apache.org>.

srowen commented on a change in pull request #27758: [SPARK-31007][ML] KMeans optimization based on triangle-inequality
URL: https://github.com/apache/spark/pull/27758#discussion_r387140942
 
 

 ##########
 File path: mllib/src/main/scala/org/apache/spark/mllib/clustering/DistanceMeasure.scala
 ##########
 @@ -234,6 +342,39 @@ private[spark] object EuclideanDistanceMeasure {
 }
 
 private[spark] class CosineDistanceMeasure extends DistanceMeasure {
+
+  /**
+   * @return Radii of centers. If distance between point x and center c is less than
+   *         the radius of center c, then center c is the closest center to point x.
+   *         For Cosine distance, it is similar to Euclidean distance. However, here
+   *         radian/angle is used instead of Cosine distance: for center c, finding
+   *         its closest center, computing the radian/angle between them, halving the
+   *         radian/angle, and converting it back to Cosine distance at the end.
+   */
+  override def computeRadii(centers: Array[VectorWithNorm]): Array[Double] = {
+    val k = centers.length
+    if (k == 1) {
+      Array(Double.NaN)
+    } else {
+      val distances = Array.fill(k)(Double.PositiveInfinity)
+      var i = 0
+      while (i < k) {
+        var j = i + 1
+        while (j < k) {
+          val d = distance(centers(i), centers(j))
+          if (d < distances(i)) distances(i) = d
+          if (d < distances(j)) distances(j) = d
+          j += 1
+        }
+        i += 1
+      }
+
+      // d = 1 - cos(x)
+      // r = 1 - cos(x/2) = 1 - sqrt((cos(x) + 1) / 2) = 1 - sqrt(1 - d/2)
+      distances.map(d => 1 - math.sqrt(1 - d / 2))
 
 Review comment:
   OK, the last expression comes from the cos(2x) identity, OK.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] srowen commented on a change in pull request #27758: [SPARK-31007][ML][WIP] KMeans optimization based on triangle-inequality

Posted by GitBox <gi...@apache.org>.

srowen commented on a change in pull request #27758: [SPARK-31007][ML][WIP] KMeans optimization based on triangle-inequality
URL: https://github.com/apache/spark/pull/27758#discussion_r408157550
 
 

 ##########
 File path: mllib/src/main/scala/org/apache/spark/mllib/clustering/DistanceMeasure.scala
 ##########
 @@ -17,23 +17,124 @@
 
 package org.apache.spark.mllib.clustering
 
+import org.apache.spark.SparkContext
 import org.apache.spark.annotation.Since
+import org.apache.spark.broadcast.Broadcast
 import org.apache.spark.mllib.linalg.{Vector, Vectors}
 import org.apache.spark.mllib.linalg.BLAS.{axpy, dot, scal}
 import org.apache.spark.mllib.util.MLUtils
 
 private[spark] abstract class DistanceMeasure extends Serializable {
 
+  /**
+   * Statistics used in triangle inequality to obtain useful bounds to find closest centers.
+   * @param distance distance between two centers
+   */
+  def computeStatistics(distance: Double): Double
+
+  /**
+   * Statistics used in triangle inequality to obtain useful bounds to find closest centers.
+   *
+   * @return A symmetric matrix containing statistics, matrix(i)(j) represents:
+   *         1, a lower bound r of the center i, if i==j. If distance between point x and center i
+   *         is less than f(r), then center i is the closest center to point x.
+   *         2, a lower bound r=matrix(i)(j) to help avoiding unnecessary distance computation.
+   *         Given point x, let i be current closest center, and d be current best distance,
+   *         if d < f(r), then we no longer need to compute the distance to center j.
+   */
+  def computeStatistics(centers: Array[VectorWithNorm]): Array[Array[Double]] = {
+    val k = centers.length
+    if (k == 1) return Array(Array(Double.NaN))
+
+    val stats = Array.ofDim[Double](k, k)
+    var i = 0
+    while (i < k) {
+      stats(i)(i) = Double.PositiveInfinity
+      i += 1
+    }
+    i = 0
+    while (i < k) {
+      var j = i + 1
+      while (j < k) {
+        val d = distance(centers(i), centers(j))
+        val s = computeStatistics(d)
+        stats(i)(j) = s
 
 Review comment:
   If you're micro-optimizing, I suppose you can lift stats(i) out of the loop, but it may not o anything

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27758: [SPARK-31007][ML][WIP] KMeans optimization based on triangle-inequality

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27758: [SPARK-31007][ML][WIP] KMeans optimization based on triangle-inequality
URL: https://github.com/apache/spark/pull/27758#issuecomment-613446379
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27758: [SPARK-31007][ML] KMeans optimization based on triangle-inequality

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27758: [SPARK-31007][ML] KMeans optimization based on triangle-inequality
URL: https://github.com/apache/spark/pull/27758#issuecomment-593811985
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23949/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27758: [SPARK-31007][ML][WIP] KMeans optimization based on triangle-inequality

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27758: [SPARK-31007][ML][WIP] KMeans optimization based on triangle-inequality
URL: https://github.com/apache/spark/pull/27758#issuecomment-613407209
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/25962/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27758: [SPARK-31007][ML][WIP] KMeans optimization based on triangle-inequality

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27758: [SPARK-31007][ML][WIP] KMeans optimization based on triangle-inequality
URL: https://github.com/apache/spark/pull/27758#issuecomment-615111279
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/26086/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on issue #27758: [SPARK-31007][ML] KMeans optimization based on triangle-inequality

Posted by GitBox <gi...@apache.org>.

zhengruifeng commented on issue #27758: [SPARK-31007][ML] KMeans optimization based on triangle-inequality
URL: https://github.com/apache/spark/pull/27758#issuecomment-593323571
 
 
   In current impl, following Lemma is used in KMeans:
   
   0, Let x be a point, let b be a center and o be the origin, then d(x,c) >= |(d(x,o) - d(c,o))| = |norm-norm(c)|
   this can be applied in `EuclideanDistance`, but not in `CosineDistance`
   
   According to [Using the Triangle Inequality to Accelerate K-Meanswe](https://www.aaai.org/Papers/ICML/2003/ICML03-022.pdf) can go futher, and there are another two Lemmas can be used:
   
   1, Let x be a point, and let b and c be centers. If d(b,c)>=2d(x,b) then d(x,c) >= d(x,b);
   
   this can be applied in `EuclideanDistance`, but not in `CosineDistance`.
   However, luckily for `CosineDistance` we can get a variant in the space of radian/angle.
   
   2, Let x be a point, and let b and c be centers. Then d(x,c) >= max{0, d(x,b)-d(b,c)};
   
   this can be applied in EuclideanDistance, but not in CosineDistance
   
   The application of Lemma 2 is a little complex: It need to cache/update the distance/lower bounds to previous centers, and thus can be only applied in training, not usable in predction.
   
   So this PR is mainly for Lemma 1. Its idea is quite simple, if point x is close to center b enough (less than a pre-computed radius), then we can say point x belong to center c without computing the distances between x and other centers. It can be used in both training and predction.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] srowen commented on issue #27758: [SPARK-31007][ML] KMeans optimization based on triangle-inequality

Posted by GitBox <gi...@apache.org>.

srowen commented on issue #27758: [SPARK-31007][ML] KMeans optimization based on triangle-inequality
URL: https://github.com/apache/spark/pull/27758#issuecomment-594055650
 
 
   If I'm reading that right, it's slower to train in many cases in the first instance, and slightly faster at scale in the second. I'm surprised because I'd have thought it's more easily a win, even with the overhead of calculating stats. Is the purpose more about prediction speed? I just wonder how much one is worth vs the other.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27758: [SPARK-31007][ML][WIP] KMeans optimization based on triangle-inequality

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27758: [SPARK-31007][ML][WIP] KMeans optimization based on triangle-inequality
URL: https://github.com/apache/spark/pull/27758#issuecomment-615101070
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on a change in pull request #27758: [SPARK-31007][ML] KMeans optimization based on triangle-inequality

Posted by GitBox <gi...@apache.org>.

zhengruifeng commented on a change in pull request #27758: [SPARK-31007][ML] KMeans optimization based on triangle-inequality
URL: https://github.com/apache/spark/pull/27758#discussion_r386812970
 
 

 ##########
 File path: mllib/src/main/scala/org/apache/spark/mllib/clustering/DistanceMeasure.scala
 ##########
 @@ -234,6 +342,39 @@ private[spark] object EuclideanDistanceMeasure {
 }
 
 private[spark] class CosineDistanceMeasure extends DistanceMeasure {
+
+  /**
+   * @return Radii of centers. If distance between point x and center c is less than
+   *         the radius of center c, then center c is the closest center to point x.
+   *         For Cosine distance, it is similar to Euclidean distance. However, here
+   *         radian/angle is used instead of Cosine distance: for center c, finding
+   *         its closest center, computing the radian/angle between them, halving the
+   *         radian/angle, and converting it back to Cosine distance at the end.
+   */
+  override def computeRadii(centers: Array[VectorWithNorm]): Array[Double] = {
+    val k = centers.length
+    if (k == 1) {
+      Array(Double.NaN)
+    } else {
+      val distances = Array.fill(k)(Double.PositiveInfinity)
+      var i = 0
+      while (i < k) {
+        var j = i + 1
+        while (j < k) {
+          val d = distance(centers(i), centers(j))
+          if (d < distances(i)) distances(i) = d
+          if (d < distances(j)) distances(j) = d
+          j += 1
+        }
+        i += 1
+      }
+
+      // d = 1 - cos(x)
+      // r = 1 - cos(x/2) = 1 - sqrt((cos(x) + 1) / 2) = 1 - sqrt(1 - d/2)
+      distances.map(d => 1 - math.sqrt(1 - d / 2))
 
 Review comment:
   In short, cos_distance do not obey triangle inequality, so we can **NOT** say:
   If cos_distance(b,x) < cos_distance(b,c)/2, then cos_distance(b,x) < cos_distance(c,x)
   
   
   However, the arc distance (or angle) obeys `Each side of a spherical triangle is less than the sum of the other two.`, so we can get a angular bound, and then a cos_distance bound:
   if point cos_distance(b,x) < 1 - sqrt{ 1 - cos_distance(b,c) / 2 }, then cos_distance(b,x) < cos_distance(c,x) 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27758: [SPARK-31007][ML][WIP] KMeans optimization based on triangle-inequality

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27758: [SPARK-31007][ML][WIP] KMeans optimization based on triangle-inequality
URL: https://github.com/apache/spark/pull/27758#issuecomment-615111266
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on issue #27758: [SPARK-31007][ML][WIP] KMeans optimization based on triangle-inequality

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on issue #27758: [SPARK-31007][ML][WIP] KMeans optimization based on triangle-inequality
URL: https://github.com/apache/spark/pull/27758#issuecomment-613406732
 
 
   **[Test build #121275 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/121275/testReport)** for PR 27758 at commit [`b4fabb1`](https://github.com/apache/spark/commit/b4fabb1ddb9c2c1a92a0ad5c9ccbbaed4f3d0acc).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on issue #27758: [SPARK-31007][ML][WIP] KMeans optimization based on triangle-inequality

Posted by GitBox <gi...@apache.org>.

SparkQA commented on issue #27758: [SPARK-31007][ML][WIP] KMeans optimization based on triangle-inequality
URL: https://github.com/apache/spark/pull/27758#issuecomment-615110579
 
 
   **[Test build #121401 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/121401/testReport)** for PR 27758 at commit [`0afc845`](https://github.com/apache/spark/commit/0afc845222a0c63d74474252b05185e3be402744).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng edited a comment on issue #27758: [SPARK-31007][ML] KMeans optimization based on triangle-inequality

Posted by GitBox <gi...@apache.org>.

zhengruifeng edited a comment on issue #27758: [SPARK-31007][ML] KMeans optimization based on triangle-inequality
URL: https://github.com/apache/spark/pull/27758#issuecomment-593339550
 
 
   env: bin/spark-shell --driver-memory=64G
   
   testCode:
   ```scala
   import org.apache.spark.ml.linalg._
   import org.apache.spark.ml.clustering._
   
   val df = spark.read.format("libsvm").load("/data1/Datasets/webspam/webspam_wc_normalized_trigram.svm.10k")
   df.persist()
   
   val km = new KMeans().setK(16).setMaxIter(5)
   val kmm = km.fit(df)
   
   val results = Seq(2,4,8,16).map { k =>
       val km = new KMeans().setK(k).setMaxIter(20);
       val start = System.currentTimeMillis;
       val kmm = km.fit(df);
       val end = System.currentTimeMillis;
       val train = end - start;
       kmm.transform(df).count; // trigger the lazy computation of radii
       val start2 = System.currentTimeMillis;
       kmm.transform(df).count;
       val end2 = System.currentTimeMillis;
       val predict = end2 - start2;
       val result = (k, train, kmm.summary.numIter, kmm.summary.trainingCost, predict);
       println(result);
       result
   }
   
   
   val df2 = spark.read.format("libsvm").load("/data1/Datasets/a9a/a9a")
   df2.persist()
   
   val km = new KMeans().setK(16).setMaxIter(10)
   val kmm = km.fit(df2)
   
   val results2 = Seq(2,4,8,16,32,64,128,256).map { k =>
       val km = new KMeans().setK(k).setMaxIter(20);
       val start = System.currentTimeMillis;
       val kmm = km.fit(df2);
       val end = System.currentTimeMillis;
       val train = end - start;
       kmm.transform(df2).count; // trigger the lazy computation of radii
       val start2 = System.currentTimeMillis;
       kmm.transform(df).count;
       val end2 = System.currentTimeMillis;
       val predict = end2 - start2;
       val result = (k, train, kmm.summary.numIter, kmm.summary.trainingCost, predict);
       println(result);
       result }
   ```
   
   1, sparse dataset: webspam, numInstances: 10,000, numFeatures: 8,289,919
   
   |Test on webspam| This PR(k=2) | This PR(k=4) | This PR(k=8) | This PR(k=16) | Master(k=2) | Master(k=4) | Master(k=8) | Master(k=16)  |
   |------|----------|------------|----------|------------|----------|------------|----------|------------|
   |Train Duration (sec)|27.602|47.932|145.430|371.239|27.042|46.630|136.205|350.788|
   |Radii Computation (sec)|0.097|0.651|5.558|26.059|-|-|-|-|
   |NumIters|9|10|18|20|9|10|18|20|
   |Cost|5701.39583824928|4518.01202129673|4013.6152754360096|3520.6545815340055|5701.39583824928|4518.01202129673|4013.6152754360096|3520.6545815340064|
   |Prediction Duration (millsec)|31|31|30|30|36|33|41|43|
   
   
   2, dense dataset: a9a, numInstances: 32,561, numFeatures: 123
   
   |Test on a9a| This PR(k=2) | This PR(k=4) | This PR(k=8) | This PR(k=16) | This PR(k=32) | This PR(k=64) | This PR(k=128) | This PR(k=256) | Master(k=2) | Master(k=4) | Master(k=8) | Master(k=16)  | Master(k=32) | Master(k=64) | Master(k=238) | Master(k=3566)  |
   |------|----------|------------|----------|------------|----------|------------|----------|------------|----------|------------|----------|------------|----------|------------|----------|------------|
   |Train Duration (sec)|0.465|1.411|0.957|1.109|1.387|2.137|3.891|8.373|0.484|0.758|1.065|1.3|1.577|2.413|4.483|9.616|
   |Radii Computation (sec)|0.000|0.000|0.000|0.001|0.002|0.007|0.024|0.093|-|-|-|-|-|-|-|-|
   |NumIters|5|14|20|20|20|20|20|20|5|14|20|20|20|20|20|20|
   |Cost|223377.07442855154|208261.6049327395|183833.73210801493|166964.5700618612|151753.31986776151|137289.24733092127|122693.39508665689|110459.53320745505|223377.07442855154|208261.6049327395|183833.73210801493|166964.5700618612|151753.31986776151|137289.24733092127|122693.39508665689|110459.53320745505|
   |Prediction Duration (millsec)|32|33|31|30|30|30|30|31|39|36|38|33|32|29|35|31|
   
   
   We can see that:
   1, the convergence of both impl are almost the same.
   2, new impl of prediction is likely to be faster than existing impl (after lazy computation of radii);
   3, computation of radii may take a few seconds, so training maybe slower than existing impl in some case (few instances, large k, large dim, then the computation of radii matters in the whole training), for example, webspam with k=16, new impl took 371.239 sec, while existing impl took 350.788 sec. That is because the computation of radii (in all 20 iterations) totally took 26.059 sec.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on issue #27758: [SPARK-31007][ML][WIP] KMeans optimization based on triangle-inequality

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on issue #27758: [SPARK-31007][ML][WIP] KMeans optimization based on triangle-inequality
URL: https://github.com/apache/spark/pull/27758#issuecomment-615110579
 
 
   **[Test build #121401 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/121401/testReport)** for PR 27758 at commit [`0afc845`](https://github.com/apache/spark/commit/0afc845222a0c63d74474252b05185e3be402744).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27758: [SPARK-31007][ML][WIP] KMeans optimization based on triangle-inequality

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27758: [SPARK-31007][ML][WIP] KMeans optimization based on triangle-inequality
URL: https://github.com/apache/spark/pull/27758#issuecomment-613407198
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on a change in pull request #27758: [SPARK-31007][ML] KMeans optimization based on triangle-inequality

Posted by GitBox <gi...@apache.org>.

zhengruifeng commented on a change in pull request #27758: [SPARK-31007][ML] KMeans optimization based on triangle-inequality
URL: https://github.com/apache/spark/pull/27758#discussion_r386785281
 
 

 ##########
 File path: mllib/src/main/scala/org/apache/spark/mllib/clustering/DistanceMeasure.scala
 ##########
 @@ -24,16 +24,60 @@ import org.apache.spark.mllib.util.MLUtils
 
 private[spark] abstract class DistanceMeasure extends Serializable {
 
+  /**
+   * Radii of centers used in triangle inequality to obtain useful bounds to find
+   * closest centers.
+   *
+   * @see <a href="https://www.aaai.org/Papers/ICML/2003/ICML03-022.pdf">Charles Elkan,
+   *      Using the Triangle Inequality to Accelerate k-Means</a>
+   *
+   * @return Radii of centers. If distance between point x and center c is less than
+   *         the radius of center c, then center c is the closest center to point x.
+   */
+  def computeRadii(centers: Array[VectorWithNorm]): Array[Double] = {
 
 Review comment:
   You are right, I should remove this. Not all distance can be expected to have such triangle-inequality.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on a change in pull request #27758: [SPARK-31007][ML] KMeans optimization based on triangle-inequality

Posted by GitBox <gi...@apache.org>.

zhengruifeng commented on a change in pull request #27758: [SPARK-31007][ML] KMeans optimization based on triangle-inequality
URL: https://github.com/apache/spark/pull/27758#discussion_r386812970
 
 

 ##########
 File path: mllib/src/main/scala/org/apache/spark/mllib/clustering/DistanceMeasure.scala
 ##########
 @@ -234,6 +342,39 @@ private[spark] object EuclideanDistanceMeasure {
 }
 
 private[spark] class CosineDistanceMeasure extends DistanceMeasure {
+
+  /**
+   * @return Radii of centers. If distance between point x and center c is less than
+   *         the radius of center c, then center c is the closest center to point x.
+   *         For Cosine distance, it is similar to Euclidean distance. However, here
+   *         radian/angle is used instead of Cosine distance: for center c, finding
+   *         its closest center, computing the radian/angle between them, halving the
+   *         radian/angle, and converting it back to Cosine distance at the end.
+   */
+  override def computeRadii(centers: Array[VectorWithNorm]): Array[Double] = {
+    val k = centers.length
+    if (k == 1) {
+      Array(Double.NaN)
+    } else {
+      val distances = Array.fill(k)(Double.PositiveInfinity)
+      var i = 0
+      while (i < k) {
+        var j = i + 1
+        while (j < k) {
+          val d = distance(centers(i), centers(j))
+          if (d < distances(i)) distances(i) = d
+          if (d < distances(j)) distances(j) = d
+          j += 1
+        }
+        i += 1
+      }
+
+      // d = 1 - cos(x)
+      // r = 1 - cos(x/2) = 1 - sqrt((cos(x) + 1) / 2) = 1 - sqrt(1 - d/2)
+      distances.map(d => 1 - math.sqrt(1 - d / 2))
 
 Review comment:
   In short, although cos_distance do not obey triangle inequality. The arc distance (or angle) obeys `Each side of a spherical triangle is less than the sum of the other two.`, then we can get a angular bound, and then a cos_distance bound.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27758: [SPARK-31007][ML][WIP] KMeans optimization based on triangle-inequality

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27758: [SPARK-31007][ML][WIP] KMeans optimization based on triangle-inequality
URL: https://github.com/apache/spark/pull/27758#issuecomment-613446389
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/121275/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27758: [SPARK-31007][ML] KMeans optimization based on triangle-inequality

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27758: [SPARK-31007][ML] KMeans optimization based on triangle-inequality
URL: https://github.com/apache/spark/pull/27758#issuecomment-593323265
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng edited a comment on issue #27758: [SPARK-31007][ML] KMeans optimization based on triangle-inequality

Posted by GitBox <gi...@apache.org>.

zhengruifeng edited a comment on issue #27758: [SPARK-31007][ML] KMeans optimization based on triangle-inequality
URL: https://github.com/apache/spark/pull/27758#issuecomment-593339550
 
 
   env: bin/spark-shell --driver-memory=64G
   
   testCode:
   ```scala
   import org.apache.spark.ml.linalg._
   import org.apache.spark.ml.clustering._
   
   val df = spark.read.format("libsvm").load("/data1/Datasets/webspam/webspam_wc_normalized_trigram.svm.10k")
   df.persist()
   
   val km = new KMeans().setK(16).setMaxIter(5)
   val kmm = km.fit(df)
   
   val results = Seq(2,4,8,16).map { k =>
       val km = new KMeans().setK(k).setMaxIter(20);
       val start = System.currentTimeMillis;
       val kmm = km.fit(df);
       val end = System.currentTimeMillis;
       val train = end - start;
       kmm.transform(df).count; // trigger the lazy computation of radii
       val start2 = System.currentTimeMillis;
       kmm.transform(df).count;
       val end2 = System.currentTimeMillis;
       val predict = end2 - start2;
       val result = (k, train, kmm.summary.numIter, kmm.summary.trainingCost, predict);
       println(result);
       result
   }
   
   
   val df2 = spark.read.format("libsvm").load("/data1/Datasets/a9a/a9a")
   df2.persist()
   
   val km = new KMeans().setK(16).setMaxIter(10)
   val kmm = km.fit(df2)
   
   val results2 = Seq(2,4,8,16,32,64,128,256).map { k => val km = new KMeans().setK(k).setMaxIter(20); val start = System.currentTimeMillis; val kmm = km.fit(df2); val end = System.currentTimeMillis; val train = end - start;  kmm.transform(df2).count; val start2 = System.currentTimeMillis; kmm.transform(df).count; val end2 = System.currentTimeMillis; val predict = end2 - start2; val result = (k, train, kmm.summary.numIter, kmm.summary.trainingCost, predict); println(result); result }
   ```
   
   1, sparse dataset: webspam, numInstances: 10,000, numFeatures: 8,289,919
   
   |Test on webspam| This PR(k=2) | This PR(k=4) | This PR(k=8) | This PR(k=16) | Master(k=2) | Master(k=4) | Master(k=8) | Master(k=16)  |
   |------|----------|------------|----------|------------|----------|------------|----------|------------|
   |Train Duration (sec)|27.602|47.932|145.430|371.239|27.042|46.630|136.205|350.788|
   |Radii Computation (sec)|0.097|0.651|5.558|26.059|-|-|-|-|
   |NumIters|9|10|18|20|9|10|18|20|
   |Cost|5701.39583824928|4518.01202129673|4013.6152754360096|3520.6545815340055|5701.39583824928|4518.01202129673|4013.6152754360096|3520.6545815340064|
   |Prediction Duration (millsec)|31|31|30|30|36|33|41|43|
   
   
   2, dense dataset: a9a, numInstances: 32,561, numFeatures: 123
   
   |Test on a9a| This PR(k=2) | This PR(k=4) | This PR(k=8) | This PR(k=16) | This PR(k=32) | This PR(k=64) | This PR(k=128) | This PR(k=256) | Master(k=2) | Master(k=4) | Master(k=8) | Master(k=16)  | Master(k=32) | Master(k=64) | Master(k=238) | Master(k=3566)  |
   |------|----------|------------|----------|------------|----------|------------|----------|------------|----------|------------|----------|------------|----------|------------|----------|------------|
   |Train Duration (sec)|0.465|1.411|0.957|1.109|1.387|2.137|3.891|8.373|0.484|0.758|1.065|1.3|1.577|2.413|4.483|9.616|
   |Radii Computation (sec)|0.000|0.000|0.000|0.001|0.002|0.007|0.024|0.093|-|-|-|-|-|-|-|-|
   |NumIters|5|14|20|20|20|20|20|20|5|14|20|20|20|20|20|20|
   |Cost|223377.07442855154|208261.6049327395|183833.73210801493|166964.5700618612|151753.31986776151|137289.24733092127|122693.39508665689|110459.53320745505|223377.07442855154|208261.6049327395|183833.73210801493|166964.5700618612|151753.31986776151|137289.24733092127|122693.39508665689|110459.53320745505|
   |Prediction Duration (millsec)|32|33|31|30|30|30|30|31|39|36|38|33|32|29|35|31|
   
   conclusion:
   1, the convergence of both impl are almost the same.
   2, new impl of prediction is likely to be faster than existing impl (after lazy computation of radii);
   3, computation of radii may take a few seconds, so training maybe slower than existing impl in some case (few instances, large k, large dim, then the computation of radii matters in the whole training), for example, webspam with k=16, new impl took 371.239 sec, while existing impl took 350.788 sec. That is because the computation of radii (in all 20 iterations) totally took 26.059 sec.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27758: [SPARK-31007][ML][WIP] KMeans optimization based on triangle-inequality

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27758: [SPARK-31007][ML][WIP] KMeans optimization based on triangle-inequality
URL: https://github.com/apache/spark/pull/27758#issuecomment-615111266
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on issue #27758: [SPARK-31007][ML] KMeans optimization based on triangle-inequality

Posted by GitBox <gi...@apache.org>.

SparkQA commented on issue #27758: [SPARK-31007][ML] KMeans optimization based on triangle-inequality
URL: https://github.com/apache/spark/pull/27758#issuecomment-593358675
 
 
   **[Test build #119166 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/119166/testReport)** for PR 27758 at commit [`c423b74`](https://github.com/apache/spark/commit/c423b74eaf72c40fcaa0cc823b600ae45299d245).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27758: [SPARK-31007][ML][WIP] KMeans optimization based on triangle-inequality

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27758: [SPARK-31007][ML][WIP] KMeans optimization based on triangle-inequality
URL: https://github.com/apache/spark/pull/27758#issuecomment-615144257
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on issue #27758: [SPARK-31007][ML][WIP] KMeans optimization based on triangle-inequality

Posted by GitBox <gi...@apache.org>.

SparkQA commented on issue #27758: [SPARK-31007][ML][WIP] KMeans optimization based on triangle-inequality
URL: https://github.com/apache/spark/pull/27758#issuecomment-615100525
 
 
   **[Test build #121400 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/121400/testReport)** for PR 27758 at commit [`2b10eb2`](https://github.com/apache/spark/commit/2b10eb27119de3f35529ef51b34542b7b3a01daf).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng edited a comment on issue #27758: [SPARK-31007][ML][WIP] KMeans optimization based on triangle-inequality

Posted by GitBox <gi...@apache.org>.

zhengruifeng edited a comment on issue #27758: [SPARK-31007][ML][WIP] KMeans optimization based on triangle-inequality
URL: https://github.com/apache/spark/pull/27758#issuecomment-613425865
 
 
   I made a update to optimize the computation of statistics, if `k` and/or `numFeatures` are too large, compute the statistics distributedly.
   
   I retest this impl today, and I use SparkUI to profile the performance:
   testcode:
   ```scala
   
   import org.apache.spark.ml.linalg._
   import org.apache.spark.ml.clustering._
   
   var df = spark.read.format("libsvm").load("/data1/Datasets/webspam/webspam_wc_normalized_trigram.svm.10k").repartition(2)
   df.persist()
   
   (0 until 4).foreach{ _ => df = df.union(df) }
   df.count
   
   Seq(4,8,16,32).foreach{ k => new KMeans().setK(k).setMaxIter(5).fit(df) }
   ```
   
   I recoded both the duration at each iteration and the _Stage_ of prediction:
   ![image](https://user-images.githubusercontent.com/7322292/79227674-d6a9d200-7e92-11ea-882f-f701bd24cbb0.png)
   
   
   
   
   results:
   
   Test on webspam | This PR(k=4) | This PR(k=8) | This PR(k=16) | This PR(k=32) | Master(k=4) | Master(k=8) | Master(k=16) | Master(k=32)
   -- | -- | -- | -- | -- | -- | -- | -- | --
   Average iteration (sec) | 9.2+0.0 | 15.8+0.1 | 31.4+0.5 | 63.6+2 | 9.8 | 16.4 | 34.6 | 78.3
   Average Prediction Stage | 6 | 10.1 | 20.6 | 44.4 | 6 | 10.8 | 22.8 | 57.2
   
   `63.6+2` here means it took 2sec to compute those statistics distributedly, which is faster than the previous commit (computing statstics in the driver) which took about 9sec.
   ![image](https://user-images.githubusercontent.com/7322292/79227308-453a6000-7e92-11ea-8f06-8841266beb6e.png)
   
   
   When `k=4,8` the speedup is not significant, when `k=16,32` it is about 10%~30% faster in prediction Stage.
   
   It shows that the large `k` is, the realtively faster this new impl is, that is because for large `k` there is more chances to trigger the short circuits.
   
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27758: [SPARK-31007][ML][WIP] KMeans optimization based on triangle-inequality

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27758: [SPARK-31007][ML][WIP] KMeans optimization based on triangle-inequality
URL: https://github.com/apache/spark/pull/27758#issuecomment-615144257
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27758: [SPARK-31007][ML] KMeans optimization based on triangle-inequality

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27758: [SPARK-31007][ML] KMeans optimization based on triangle-inequality
URL: https://github.com/apache/spark/pull/27758#issuecomment-593819492
 
 
   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/119208/
   Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27758: [SPARK-31007][ML] KMeans optimization based on triangle-inequality

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27758: [SPARK-31007][ML] KMeans optimization based on triangle-inequality
URL: https://github.com/apache/spark/pull/27758#issuecomment-593811976
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on issue #27758: [SPARK-31007][ML][WIP] KMeans optimization based on triangle-inequality

Posted by GitBox <gi...@apache.org>.

zhengruifeng commented on issue #27758: [SPARK-31007][ML][WIP] KMeans optimization based on triangle-inequality
URL: https://github.com/apache/spark/pull/27758#issuecomment-613425865
 
 
   I made a update to optimize the computation of statistics, if `k` and/or `numFeatures` are
   too large, compute the statistics distributedly.
   
   I retest this impl today, and I use SparkUI to profile the performance:
   testcode:
   ```scala
   
   import org.apache.spark.ml.linalg._
   import org.apache.spark.ml.clustering._
   
   var df = spark.read.format("libsvm").load("/data1/Datasets/webspam/webspam_wc_normalized_trigram.svm.10k").repartition(2)
   df.persist()
   
   (0 until 4).foreach{ _ => df = df.union(df) }
   df.count
   
   Seq(4,8,16,32).foreach{ k => new KMeans().setK(k).setMaxIter(5).fit(df) }
   ```
   
   I recoded both the duration at each iteration and the _Stage_ of prediction:
   ![image-20200414200611843](/home/zrf/.config/Typora/typora-user-images/image-20200414200611843.png)
   
   results:
   
   Test on webspam | This PR(k=4) | This PR(k=8) | This PR(k=16) | This PR(k=32) | Master(k=4) | Master(k=8) | Master(k=16) | Master(k=32)
   -- | -- | -- | -- | -- | -- | -- | -- | --
   Average iteration (sec) | 9.2+0.0 | 15.8+0.1 | 31.4+0.5 | 63.6+2 | 9.8 | 16.4 | 34.6 | 78.3
   Average Prediction Stage | 6 | 10.1 | 20.6 | 44.4 | 6 | 10.8 | 22.8 | 57.2
   
   `63.6+2` here means it took 2sec to compute those statistics distributedly, which is faster than the previous commit (computing statstics in the driver) which took about 9sec.
   ![image](https://user-images.githubusercontent.com/7322292/79227308-453a6000-7e92-11ea-8f06-8841266beb6e.png)
   
   
   When `k=4,8` the speedup is not significant, when `k=16,32` it is about 10%~30% faster in prediction Stage.
   
   It shows that the large `k` is, the realtively faster this new impl is, that is because for large `k` there is more chances to trigger the short circuits.
   
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on issue #27758: [SPARK-31007][ML][WIP] KMeans optimization based on triangle-inequality

Posted by GitBox <gi...@apache.org>.

SparkQA commented on issue #27758: [SPARK-31007][ML][WIP] KMeans optimization based on triangle-inequality
URL: https://github.com/apache/spark/pull/27758#issuecomment-613445548
 
 
   **[Test build #121275 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/121275/testReport)** for PR 27758 at commit [`b4fabb1`](https://github.com/apache/spark/commit/b4fabb1ddb9c2c1a92a0ad5c9ccbbaed4f3d0acc).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on a change in pull request #27758: [SPARK-31007][ML] KMeans optimization based on triangle-inequality

Posted by GitBox <gi...@apache.org>.

zhengruifeng commented on a change in pull request #27758: [SPARK-31007][ML] KMeans optimization based on triangle-inequality
URL: https://github.com/apache/spark/pull/27758#discussion_r386809482
 
 

 ##########
 File path: mllib/src/main/scala/org/apache/spark/mllib/clustering/DistanceMeasure.scala
 ##########
 @@ -234,6 +342,39 @@ private[spark] object EuclideanDistanceMeasure {
 }
 
 private[spark] class CosineDistanceMeasure extends DistanceMeasure {
+
+  /**
+   * @return Radii of centers. If distance between point x and center c is less than
+   *         the radius of center c, then center c is the closest center to point x.
+   *         For Cosine distance, it is similar to Euclidean distance. However, here
+   *         radian/angle is used instead of Cosine distance: for center c, finding
+   *         its closest center, computing the radian/angle between them, halving the
+   *         radian/angle, and converting it back to Cosine distance at the end.
+   */
+  override def computeRadii(centers: Array[VectorWithNorm]): Array[Double] = {
+    val k = centers.length
+    if (k == 1) {
+      Array(Double.NaN)
+    } else {
+      val distances = Array.fill(k)(Double.PositiveInfinity)
+      var i = 0
+      while (i < k) {
+        var j = i + 1
+        while (j < k) {
+          val d = distance(centers(i), centers(j))
+          if (d < distances(i)) distances(i) = d
+          if (d < distances(j)) distances(j) = d
+          j += 1
+        }
+        i += 1
+      }
+
+      // d = 1 - cos(x)
+      // r = 1 - cos(x/2) = 1 - sqrt((cos(x) + 1) / 2) = 1 - sqrt(1 - d/2)
+      distances.map(d => 1 - math.sqrt(1 - d / 2))
 
 Review comment:
   Yes, Cosine distance doesn't obey the triangle inequality, but the following lemma should be available to apply:
   
   given a point x, and let b and c be centers. If angle(x, b)<angle(b,c)/2, then angle(x,b)<angle(x,c), cos_distance(x,b)=1-cos(x,b)<cos_distance(x,c)=1-cos(x,c)
   
   That is because: [PRINCIPLES FROM GEOMETRY](http://www.angelfire.com/nt/navtrig/B1.html)
   
   > Each side of a spherical triangle is less than the sum of the other two.
   
   angle(x,b) + angle(x,c) > angle(b,c)
   angle(x,b) < angle(b,c)/2
   
   => angle(x,c) > angle(b,c)/2 > angle(x,b)
   => cos_distance(x,c) > cos_distance(x,b)
   
   
    angle(x,b) < angle(b,c)/2
   <=>  cos(x,b) > sqrt{ (cos(b,c) + 1)/2 }
   <=>  cos_distance(x,b) < 1 - sqrt{ (cos(b,c) + 1)/2 } = 1 - sqrt{ 1 - cos_distance(b,c) / 2  }
   
   => Give two centers b and c, if point x has cos_distance(x,b) < 1 - sqrt{ 1 - cos_distance(b,c) / 2  }, then point x belongs to center b.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on a change in pull request #27758: [SPARK-31007][ML] KMeans optimization based on triangle-inequality

Posted by GitBox <gi...@apache.org>.

zhengruifeng commented on a change in pull request #27758: [SPARK-31007][ML] KMeans optimization based on triangle-inequality
URL: https://github.com/apache/spark/pull/27758#discussion_r386809482
 
 

 ##########
 File path: mllib/src/main/scala/org/apache/spark/mllib/clustering/DistanceMeasure.scala
 ##########
 @@ -234,6 +342,39 @@ private[spark] object EuclideanDistanceMeasure {
 }
 
 private[spark] class CosineDistanceMeasure extends DistanceMeasure {
+
+  /**
+   * @return Radii of centers. If distance between point x and center c is less than
+   *         the radius of center c, then center c is the closest center to point x.
+   *         For Cosine distance, it is similar to Euclidean distance. However, here
+   *         radian/angle is used instead of Cosine distance: for center c, finding
+   *         its closest center, computing the radian/angle between them, halving the
+   *         radian/angle, and converting it back to Cosine distance at the end.
+   */
+  override def computeRadii(centers: Array[VectorWithNorm]): Array[Double] = {
+    val k = centers.length
+    if (k == 1) {
+      Array(Double.NaN)
+    } else {
+      val distances = Array.fill(k)(Double.PositiveInfinity)
+      var i = 0
+      while (i < k) {
+        var j = i + 1
+        while (j < k) {
+          val d = distance(centers(i), centers(j))
+          if (d < distances(i)) distances(i) = d
+          if (d < distances(j)) distances(j) = d
+          j += 1
+        }
+        i += 1
+      }
+
+      // d = 1 - cos(x)
+      // r = 1 - cos(x/2) = 1 - sqrt((cos(x) + 1) / 2) = 1 - sqrt(1 - d/2)
+      distances.map(d => 1 - math.sqrt(1 - d / 2))
 
 Review comment:
   Yes, Cosine distance doesn't obey the triangle inequality, but the following lemma should be available to apply:
   
   given a point x, and let b and c be centers. If angle(x, b)<angle(b,c)/2, then angle(x,b)<angle(x,c), cos_distance(x,b)=1-cos(x,b)<cos_distance(x,c)=1-cos(x,c)
   
   That is because: [PRINCIPLES FROM GEOMETRY](http://www.angelfire.com/nt/navtrig/B1.html)
   
   > Each side of a spherical triangle is less than the sum of the other two.
   
   angle(x,b) + angle(x,c) > angle(b,c)
   angle(x,b) < angle(b,c)/2
   
   => angle(x,c) > angle(b,c)/2 > angle(x,b)
   => cos_distance(x,c) > cos_distance(x,b)
   
   
    

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on issue #27758: [SPARK-31007][ML] KMeans optimization based on triangle-inequality

Posted by GitBox <gi...@apache.org>.

SparkQA commented on issue #27758: [SPARK-31007][ML] KMeans optimization based on triangle-inequality
URL: https://github.com/apache/spark/pull/27758#issuecomment-593811584
 
 
   **[Test build #119208 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/119208/testReport)** for PR 27758 at commit [`dd2aff7`](https://github.com/apache/spark/commit/dd2aff729a8ff4ffea34691486337f0bc3b5f16b).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on issue #27758: [SPARK-31007][ML][WIP] KMeans optimization based on triangle-inequality

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on issue #27758: [SPARK-31007][ML][WIP] KMeans optimization based on triangle-inequality
URL: https://github.com/apache/spark/pull/27758#issuecomment-615100525
 
 
   **[Test build #121400 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/121400/testReport)** for PR 27758 at commit [`2b10eb2`](https://github.com/apache/spark/commit/2b10eb27119de3f35529ef51b34542b7b3a01daf).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27758: [SPARK-31007][ML] KMeans optimization based on triangle-inequality

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27758: [SPARK-31007][ML] KMeans optimization based on triangle-inequality
URL: https://github.com/apache/spark/pull/27758#issuecomment-593323265
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27758: [SPARK-31007][ML][WIP] KMeans optimization based on triangle-inequality

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27758: [SPARK-31007][ML][WIP] KMeans optimization based on triangle-inequality
URL: https://github.com/apache/spark/pull/27758#issuecomment-615132707
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/121400/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27758: [SPARK-31007][ML] KMeans optimization based on triangle-inequality

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27758: [SPARK-31007][ML] KMeans optimization based on triangle-inequality
URL: https://github.com/apache/spark/pull/27758#issuecomment-593811976
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng edited a comment on issue #27758: [SPARK-31007][ML][WIP] KMeans optimization based on triangle-inequality

Posted by GitBox <gi...@apache.org>.

zhengruifeng edited a comment on issue #27758: [SPARK-31007][ML][WIP] KMeans optimization based on triangle-inequality
URL: https://github.com/apache/spark/pull/27758#issuecomment-613425865
 
 
   I made a update to optimize the computation of statistics, if `k` and/or `numFeatures` are too large, compute the statistics distributedly.
   
   I retest this impl today, and I use SparkUI to profile the performance:
   testcode:
   ```scala
   
   import org.apache.spark.ml.linalg._
   import org.apache.spark.ml.clustering._
   
   var df = spark.read.format("libsvm").load("/data1/Datasets/webspam/webspam_wc_normalized_trigram.svm.10k").repartition(2)
   df.persist()
   
   (0 until 4).foreach{ _ => df = df.union(df) }
   df.count
   
   Seq(4,8,16,32).foreach{ k => new KMeans().setK(k).setMaxIter(5).fit(df) }
   ```
   
   I recoded both the duration at each iteration and the _Stage_ of prediction:
   ![image](https://user-images.githubusercontent.com/7322292/79227534-9f3b2580-7e92-11ea-9d08-94bc2d769c16.png)
   
   
   
   
   results:
   
   Test on webspam | This PR(k=4) | This PR(k=8) | This PR(k=16) | This PR(k=32) | Master(k=4) | Master(k=8) | Master(k=16) | Master(k=32)
   -- | -- | -- | -- | -- | -- | -- | -- | --
   Average iteration (sec) | 9.2+0.0 | 15.8+0.1 | 31.4+0.5 | 63.6+2 | 9.8 | 16.4 | 34.6 | 78.3
   Average Prediction Stage | 6 | 10.1 | 20.6 | 44.4 | 6 | 10.8 | 22.8 | 57.2
   
   `63.6+2` here means it took 2sec to compute those statistics distributedly, which is faster than the previous commit (computing statstics in the driver) which took about 9sec.
   ![image](https://user-images.githubusercontent.com/7322292/79227308-453a6000-7e92-11ea-8f06-8841266beb6e.png)
   
   
   When `k=4,8` the speedup is not significant, when `k=16,32` it is about 10%~30% faster in prediction Stage.
   
   It shows that the large `k` is, the realtively faster this new impl is, that is because for large `k` there is more chances to trigger the short circuits.
   
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng edited a comment on issue #27758: [SPARK-31007][ML] KMeans optimization based on triangle-inequality

Posted by GitBox <gi...@apache.org>.

zhengruifeng edited a comment on issue #27758: [SPARK-31007][ML] KMeans optimization based on triangle-inequality
URL: https://github.com/apache/spark/pull/27758#issuecomment-593339550
 
 
   env: bin/spark-shell --driver-memory=64G
   
   testCode:
   ```scala
   import org.apache.spark.ml.linalg._
   import org.apache.spark.ml.clustering._
   
   val df = spark.read.format("libsvm").load("/data1/Datasets/webspam/webspam_wc_normalized_trigram.svm.10k")
   df.persist()
   
   val km = new KMeans().setK(16).setMaxIter(5)
   val kmm = km.fit(df)
   
   val results = Seq(2,4,8,16).map { k =>
       val km = new KMeans().setK(k).setMaxIter(20);
       val start = System.currentTimeMillis;
       val kmm = km.fit(df);
       val end = System.currentTimeMillis;
       val train = end - start;
       kmm.transform(df).count; // trigger the lazy computation of radii
       val start2 = System.currentTimeMillis;
       kmm.transform(df).count;
       val end2 = System.currentTimeMillis;
       val predict = end2 - start2;
       val result = (k, train, kmm.summary.numIter, kmm.summary.trainingCost, predict);
       println(result);
       result
   }
   
   
   val df2 = spark.read.format("libsvm").load("/data1/Datasets/a9a/a9a")
   df2.persist()
   
   val km = new KMeans().setK(16).setMaxIter(10)
   val kmm = km.fit(df2)
   
   val results2 = Seq(2,4,8,16,32,64,128,256).map { k =>
       val km = new KMeans().setK(k).setMaxIter(20);
       val start = System.currentTimeMillis;
       val kmm = km.fit(df2);
       val end = System.currentTimeMillis;
       val train = end - start;
       kmm.transform(df2).count;
       val start2 = System.currentTimeMillis;
       kmm.transform(df).count;
       val end2 = System.currentTimeMillis;
       val predict = end2 - start2;
       val result = (k, train, kmm.summary.numIter, kmm.summary.trainingCost, predict);
       println(result);
       result }
   ```
   
   1, sparse dataset: webspam, numInstances: 10,000, numFeatures: 8,289,919
   
   |Test on webspam| This PR(k=2) | This PR(k=4) | This PR(k=8) | This PR(k=16) | Master(k=2) | Master(k=4) | Master(k=8) | Master(k=16)  |
   |------|----------|------------|----------|------------|----------|------------|----------|------------|
   |Train Duration (sec)|27.602|47.932|145.430|371.239|27.042|46.630|136.205|350.788|
   |Radii Computation (sec)|0.097|0.651|5.558|26.059|-|-|-|-|
   |NumIters|9|10|18|20|9|10|18|20|
   |Cost|5701.39583824928|4518.01202129673|4013.6152754360096|3520.6545815340055|5701.39583824928|4518.01202129673|4013.6152754360096|3520.6545815340064|
   |Prediction Duration (millsec)|31|31|30|30|36|33|41|43|
   
   
   2, dense dataset: a9a, numInstances: 32,561, numFeatures: 123
   
   |Test on a9a| This PR(k=2) | This PR(k=4) | This PR(k=8) | This PR(k=16) | This PR(k=32) | This PR(k=64) | This PR(k=128) | This PR(k=256) | Master(k=2) | Master(k=4) | Master(k=8) | Master(k=16)  | Master(k=32) | Master(k=64) | Master(k=238) | Master(k=3566)  |
   |------|----------|------------|----------|------------|----------|------------|----------|------------|----------|------------|----------|------------|----------|------------|----------|------------|
   |Train Duration (sec)|0.465|1.411|0.957|1.109|1.387|2.137|3.891|8.373|0.484|0.758|1.065|1.3|1.577|2.413|4.483|9.616|
   |Radii Computation (sec)|0.000|0.000|0.000|0.001|0.002|0.007|0.024|0.093|-|-|-|-|-|-|-|-|
   |NumIters|5|14|20|20|20|20|20|20|5|14|20|20|20|20|20|20|
   |Cost|223377.07442855154|208261.6049327395|183833.73210801493|166964.5700618612|151753.31986776151|137289.24733092127|122693.39508665689|110459.53320745505|223377.07442855154|208261.6049327395|183833.73210801493|166964.5700618612|151753.31986776151|137289.24733092127|122693.39508665689|110459.53320745505|
   |Prediction Duration (millsec)|32|33|31|30|30|30|30|31|39|36|38|33|32|29|35|31|
   
   conclusion:
   1, the convergence of both impl are almost the same.
   2, new impl of prediction is likely to be faster than existing impl (after lazy computation of radii);
   3, computation of radii may take a few seconds, so training maybe slower than existing impl in some case (few instances, large k, large dim, then the computation of radii matters in the whole training), for example, webspam with k=16, new impl took 371.239 sec, while existing impl took 350.788 sec. That is because the computation of radii (in all 20 iterations) totally took 26.059 sec.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27758: [SPARK-31007][ML][WIP] KMeans optimization based on triangle-inequality

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27758: [SPARK-31007][ML][WIP] KMeans optimization based on triangle-inequality
URL: https://github.com/apache/spark/pull/27758#issuecomment-613446389
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/121275/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27758: [SPARK-31007][ML][WIP] KMeans optimization based on triangle-inequality

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27758: [SPARK-31007][ML][WIP] KMeans optimization based on triangle-inequality
URL: https://github.com/apache/spark/pull/27758#issuecomment-615111279
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/26086/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng edited a comment on issue #27758: [SPARK-31007][ML] KMeans optimization based on triangle-inequality

Posted by GitBox <gi...@apache.org>.

zhengruifeng edited a comment on issue #27758: [SPARK-31007][ML] KMeans optimization based on triangle-inequality
URL: https://github.com/apache/spark/pull/27758#issuecomment-593323571
 
 
   In current impl, following Lemma is used in KMeans:
   
   0, Let x be a point, let b be a center and o be the origin, then d(x,c) >= |(d(x,o) - d(c,o))| = |norm-norm(c)|
   ```scala
         val center = centers(i)
         // Since `\|a - b\| \geq |\|a\| - \|b\||`, we can use this lower bound to avoid unnecessary
         // distance computation.
         var lowerBoundOfSqDist = center.norm - point.norm
         lowerBoundOfSqDist = lowerBoundOfSqDist * lowerBoundOfSqDist
         if (lowerBoundOfSqDist < bestDistance) {
           ...
         }
   ```
   this can only be applied in `EuclideanDistance`, but not in `CosineDistance`
   
   
   
   According to [Using the Triangle Inequality to Accelerate K-Meanswe](https://www.aaai.org/Papers/ICML/2003/ICML03-022.pdf) , we can go futher, and there are another two Lemmas can be used:
   
   1, Let x be a point, and let b and c be centers. If d(b,c)>=2d(x,b) then d(x,c) >= d(x,b);
   
   this can be applied in `EuclideanDistance`, but not in `CosineDistance`.
   However, luckily for `CosineDistance` we can get a variant in the space of radian/angle.
   
   2, Let x be a point, and let b and c be centers. Then d(x,c) >= max{0, d(x,b)-d(b,c)};
   
   this can be applied in EuclideanDistance, but not in CosineDistance
   
   The application of Lemma 2 is a little complex: It need to cache/update the distance/lower bounds to previous centers, and thus can be only applied in training, not usable in prediction.
   
   So this PR is mainly for Lemma 1. Its idea is quite simple, if point x is close to center b enough (less than a pre-computed radius), then we can say point x belong to center c without computing the distances between x and other centers. It can be used in both training and predction.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27758: [SPARK-31007][ML][WIP] KMeans optimization based on triangle-inequality

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27758: [SPARK-31007][ML][WIP] KMeans optimization based on triangle-inequality
URL: https://github.com/apache/spark/pull/27758#issuecomment-613446379
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27758: [SPARK-31007][ML][WIP] KMeans optimization based on triangle-inequality

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27758: [SPARK-31007][ML][WIP] KMeans optimization based on triangle-inequality
URL: https://github.com/apache/spark/pull/27758#issuecomment-613407198
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng edited a comment on issue #27758: [SPARK-31007][ML] KMeans optimization based on triangle-inequality

Posted by GitBox <gi...@apache.org>.

zhengruifeng edited a comment on issue #27758: [SPARK-31007][ML] KMeans optimization based on triangle-inequality
URL: https://github.com/apache/spark/pull/27758#issuecomment-593323571
 
 
   In current impl, following Lemma is used in KMeans:
   
   0, Let x be a point, let b be a center and o be the origin, then d(x,c) >= |(d(x,o) - d(c,o))| = |norm-norm(c)|
   ```scala
         val center = centers(i)
         // Since `\|a - b\| \geq |\|a\| - \|b\||`, we can use this lower bound to avoid unnecessary
         // distance computation.
         var lowerBoundOfSqDist = center.norm - point.norm
         lowerBoundOfSqDist = lowerBoundOfSqDist * lowerBoundOfSqDist
         if (lowerBoundOfSqDist < bestDistance) {
           ...
         }
   ```
   this can only be applied in `EuclideanDistance`, but not in `CosineDistance`
   
   
   
   According to [Using the Triangle Inequality to Accelerate K-Meanswe](https://www.aaai.org/Papers/ICML/2003/ICML03-022.pdf) , we can go futher, and there are another two Lemmas can be used:
   
   > 1, Let x be a point, and let b and c be centers. If d(b,c)>=2d(x,b) then d(x,c) >= d(x,b);
   
   this can be applied in `EuclideanDistance`, but not in `CosineDistance`.
   However,  for `CosineDistance` we can luckily get a variant in the space of radian/angle.
   
   This PR is mainly for Lemma 1. Its idea is quite simple, if point x is close to center b enough (less than a pre-computed radius), then we can say point x belong to center c without computing the distances between x and other centers. It can be used in both training and predction.
   
   
   > 2, Let x be a point, and let b and c be centers. Then d(x,c) >= max{0, d(x,b)-d(b,c)};
   
   this can be applied in EuclideanDistance, but not in CosineDistance
   
   The application of Lemma 2 is a little complex:
   
   - It need to cache/update the distance/lower bounds to previous centers, 
   - and thus can be only applied in training, not usable in prediction.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng edited a comment on issue #27758: [SPARK-31007][ML] KMeans optimization based on triangle-inequality

Posted by GitBox <gi...@apache.org>.

zhengruifeng edited a comment on issue #27758: [SPARK-31007][ML] KMeans optimization based on triangle-inequality
URL: https://github.com/apache/spark/pull/27758#issuecomment-593339550
 
 
   env: bin/spark-shell --driver-memory=64G
   
   testCode:
   ```scala
   import org.apache.spark.ml.linalg._
   import org.apache.spark.ml.clustering._
   
   val df = spark.read.format("libsvm").load("/data1/Datasets/webspam/webspam_wc_normalized_trigram.svm.10k")
   df.persist()
   
   val km = new KMeans().setK(16).setMaxIter(5)
   val kmm = km.fit(df)
   
   val results = Seq(2,4,8,16).map { k =>
       val km = new KMeans().setK(k).setMaxIter(20);
       val start = System.currentTimeMillis;
       val kmm = km.fit(df);
       val end = System.currentTimeMillis;
       val train = end - start;
       kmm.transform(df).count; // trigger the lazy computation of radii
       val start2 = System.currentTimeMillis;
       kmm.transform(df).count;
       val end2 = System.currentTimeMillis;
       val predict = end2 - start2;
       val result = (k, train, kmm.summary.numIter, kmm.summary.trainingCost, predict);
       println(result);
       result
   }
   
   
   val df2 = spark.read.format("libsvm").load("/data1/Datasets/a9a/a9a")
   df2.persist()
   
   val km = new KMeans().setK(16).setMaxIter(10)
   val kmm = km.fit(df2)
   
   val results2 = Seq(2,4,8,16,32,64,128,256).map { k =>
       val km = new KMeans().setK(k).setMaxIter(20);
       val start = System.currentTimeMillis;
       val kmm = km.fit(df2);
       val end = System.currentTimeMillis;
       val train = end - start;
       kmm.transform(df2).count; // trigger the lazy computation of radii
       val start2 = System.currentTimeMillis;
       kmm.transform(df).count;
       val end2 = System.currentTimeMillis;
       val predict = end2 - start2;
       val result = (k, train, kmm.summary.numIter, kmm.summary.trainingCost, predict);
       println(result);
       result }
   ```
   
   1, sparse dataset: webspam, numInstances: 10,000, numFeatures: 8,289,919
   
   |Test on webspam| This PR(k=2) | This PR(k=4) | This PR(k=8) | This PR(k=16) | Master(k=2) | Master(k=4) | Master(k=8) | Master(k=16)  |
   |------|----------|------------|----------|------------|----------|------------|----------|------------|
   |Train Duration (sec)|27.602|47.932|145.430|371.239|27.042|46.630|136.205|350.788|
   |Radii Computation (sec)|0.097|0.651|5.558|26.059|-|-|-|-|
   |NumIters|9|10|18|20|9|10|18|20|
   |Cost|5701.39583824928|4518.01202129673|4013.6152754360096|3520.6545815340055|5701.39583824928|4518.01202129673|4013.6152754360096|3520.6545815340064|
   |Prediction Duration (millsec)|31|31|30|30|36|33|41|43|
   
   
   2, dense dataset: a9a, numInstances: 32,561, numFeatures: 123
   
   |Test on a9a| This PR(k=2) | This PR(k=4) | This PR(k=8) | This PR(k=16) | This PR(k=32) | This PR(k=64) | This PR(k=128) | This PR(k=256) | Master(k=2) | Master(k=4) | Master(k=8) | Master(k=16)  | Master(k=32) | Master(k=64) | Master(k=238) | Master(k=3566)  |
   |------|----------|------------|----------|------------|----------|------------|----------|------------|----------|------------|----------|------------|----------|------------|----------|------------|
   |Train Duration (sec)|0.465|1.411|0.957|1.109|1.387|2.137|3.891|8.373|0.484|0.758|1.065|1.3|1.577|2.413|4.483|9.616|
   |Radii Computation (sec)|0.000|0.000|0.000|0.001|0.002|0.007|0.024|0.093|-|-|-|-|-|-|-|-|
   |NumIters|5|14|20|20|20|20|20|20|5|14|20|20|20|20|20|20|
   |Cost|223377.07442855154|208261.6049327395|183833.73210801493|166964.5700618612|151753.31986776151|137289.24733092127|122693.39508665689|110459.53320745505|223377.07442855154|208261.6049327395|183833.73210801493|166964.5700618612|151753.31986776151|137289.24733092127|122693.39508665689|110459.53320745505|
   |Prediction Duration (millsec)|32|33|31|30|30|30|30|31|39|36|38|33|32|29|35|31|
   
   conclusion:
   1, the convergence of both impl are almost the same.
   2, new impl of prediction is likely to be faster than existing impl (after lazy computation of radii);
   3, computation of radii may take a few seconds, so training maybe slower than existing impl in some case (few instances, large k, large dim, then the computation of radii matters in the whole training), for example, webspam with k=16, new impl took 371.239 sec, while existing impl took 350.788 sec. That is because the computation of radii (in all 20 iterations) totally took 26.059 sec.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27758: [SPARK-31007][ML][WIP] KMeans optimization based on triangle-inequality

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27758: [SPARK-31007][ML][WIP] KMeans optimization based on triangle-inequality
URL: https://github.com/apache/spark/pull/27758#issuecomment-615101088
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/26083/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on a change in pull request #27758: [SPARK-31007][ML] KMeans optimization based on triangle-inequality

Posted by GitBox <gi...@apache.org>.

zhengruifeng commented on a change in pull request #27758: [SPARK-31007][ML] KMeans optimization based on triangle-inequality
URL: https://github.com/apache/spark/pull/27758#discussion_r386812970
 
 

 ##########
 File path: mllib/src/main/scala/org/apache/spark/mllib/clustering/DistanceMeasure.scala
 ##########
 @@ -234,6 +342,39 @@ private[spark] object EuclideanDistanceMeasure {
 }
 
 private[spark] class CosineDistanceMeasure extends DistanceMeasure {
+
+  /**
+   * @return Radii of centers. If distance between point x and center c is less than
+   *         the radius of center c, then center c is the closest center to point x.
+   *         For Cosine distance, it is similar to Euclidean distance. However, here
+   *         radian/angle is used instead of Cosine distance: for center c, finding
+   *         its closest center, computing the radian/angle between them, halving the
+   *         radian/angle, and converting it back to Cosine distance at the end.
+   */
+  override def computeRadii(centers: Array[VectorWithNorm]): Array[Double] = {
+    val k = centers.length
+    if (k == 1) {
+      Array(Double.NaN)
+    } else {
+      val distances = Array.fill(k)(Double.PositiveInfinity)
+      var i = 0
+      while (i < k) {
+        var j = i + 1
+        while (j < k) {
+          val d = distance(centers(i), centers(j))
+          if (d < distances(i)) distances(i) = d
+          if (d < distances(j)) distances(j) = d
+          j += 1
+        }
+        i += 1
+      }
+
+      // d = 1 - cos(x)
+      // r = 1 - cos(x/2) = 1 - sqrt((cos(x) + 1) / 2) = 1 - sqrt(1 - d/2)
+      distances.map(d => 1 - math.sqrt(1 - d / 2))
 
 Review comment:
   In short, cos_distance do not obey triangle inequality, so we can **NOT** say:
   If cos_distance(b,x) < cos_distance(b,c)/2, then cos_distance(b,x) < cos_distance(c,x)
   
   
   However, the arc distance (or angle) obeys `Each side of a spherical triangle is less than the sum of the other two.`, then we can get a angular bound, and then a cos_distance bound:
   if point cos_distance(b,x) < 1 - sqrt{ 1 - cos_distance(b,c) / 2 }, then cos_distance(b,x) < cos_distance(c,x) 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on issue #27758: [SPARK-31007][ML][WIP] KMeans optimization based on triangle-inequality

Posted by GitBox <gi...@apache.org>.

SparkQA commented on issue #27758: [SPARK-31007][ML][WIP] KMeans optimization based on triangle-inequality
URL: https://github.com/apache/spark/pull/27758#issuecomment-615143731
 
 
   **[Test build #121401 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/121401/testReport)** for PR 27758 at commit [`0afc845`](https://github.com/apache/spark/commit/0afc845222a0c63d74474252b05185e3be402744).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27758: [SPARK-31007][ML] KMeans optimization based on triangle-inequality

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27758: [SPARK-31007][ML] KMeans optimization based on triangle-inequality
URL: https://github.com/apache/spark/pull/27758#issuecomment-593323271
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23907/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on issue #27758: [SPARK-31007][ML][WIP] KMeans optimization based on triangle-inequality

Posted by GitBox <gi...@apache.org>.

zhengruifeng commented on issue #27758: [SPARK-31007][ML][WIP] KMeans optimization based on triangle-inequality
URL: https://github.com/apache/spark/pull/27758#issuecomment-601683609

@srowen Sorry to reply late. I missed the emails from github.

> Is the purpose more about prediction speed?

It also help saving training time, if the dataset is large enough. Since the cost of computing stats is about O(k^2 * m), while the cost of computing distances at one iteration is O(k * n * m) where m is the number of features, and n is the number of instances; I guess I can compute the stats distributedly in some case (when k is large);

I just mark this PR WIP for two reasons:
1, I will test this impl on a big dataset distributedly to check wheter above hypothesis set up;
2, for cosine distance, I want to future find a theoretical basis for the bound. Since above `Each side of a spherical triangle is less than the sum of the other two` seems is for 3-dim. I think it also right when dim>3, but it is not used in other impls. I will look for a theoretical proof.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng edited a comment on issue #27758: [SPARK-31007][ML] KMeans optimization based on triangle-inequality

Posted by GitBox <gi...@apache.org>.

zhengruifeng edited a comment on issue #27758: [SPARK-31007][ML] KMeans optimization based on triangle-inequality
URL: https://github.com/apache/spark/pull/27758#issuecomment-593323571
 
 
   In current impl, following Lemma is used in KMeans:
   
   0, Let x be a point, let b be a center and o be the origin, then d(x,c) >= |(d(x,o) - d(c,o))| = |norm-norm(c)|
   ```scala
         val center = centers(i)
         // Since `\|a - b\| \geq |\|a\| - \|b\||`, we can use this lower bound to avoid unnecessary
         // distance computation.
         var lowerBoundOfSqDist = center.norm - point.norm
         lowerBoundOfSqDist = lowerBoundOfSqDist * lowerBoundOfSqDist
         if (lowerBoundOfSqDist < bestDistance) {
           ...
         }
   ```
   this can only be applied in `EuclideanDistance`, but not in `CosineDistance`
   
   
   
   According to [Using the Triangle Inequality to Accelerate K-Meanswe](https://www.aaai.org/Papers/ICML/2003/ICML03-022.pdf) can go futher, and there are another two Lemmas can be used:
   
   1, Let x be a point, and let b and c be centers. If d(b,c)>=2d(x,b) then d(x,c) >= d(x,b);
   
   this can be applied in `EuclideanDistance`, but not in `CosineDistance`.
   However, luckily for `CosineDistance` we can get a variant in the space of radian/angle.
   
   2, Let x be a point, and let b and c be centers. Then d(x,c) >= max{0, d(x,b)-d(b,c)};
   
   this can be applied in EuclideanDistance, but not in CosineDistance
   
   The application of Lemma 2 is a little complex: It need to cache/update the distance/lower bounds to previous centers, and thus can be only applied in training, not usable in predction.
   
   So this PR is mainly for Lemma 1. Its idea is quite simple, if point x is close to center b enough (less than a pre-computed radius), then we can say point x belong to center c without computing the distances between x and other centers. It can be used in both training and predction.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27758: [SPARK-31007][ML] KMeans optimization based on triangle-inequality

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27758: [SPARK-31007][ML] KMeans optimization based on triangle-inequality
URL: https://github.com/apache/spark/pull/27758#issuecomment-593359289
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] srowen commented on a change in pull request #27758: [SPARK-31007][ML] KMeans optimization based on triangle-inequality

Posted by GitBox <gi...@apache.org>.

srowen commented on a change in pull request #27758: [SPARK-31007][ML] KMeans optimization based on triangle-inequality
URL: https://github.com/apache/spark/pull/27758#discussion_r386461449
 
 

 ##########
 File path: mllib/src/main/scala/org/apache/spark/mllib/clustering/DistanceMeasure.scala
 ##########
 @@ -234,6 +342,39 @@ private[spark] object EuclideanDistanceMeasure {
 }
 
 private[spark] class CosineDistanceMeasure extends DistanceMeasure {
+
+  /**
+   * @return Radii of centers. If distance between point x and center c is less than
+   *         the radius of center c, then center c is the closest center to point x.
+   *         For Cosine distance, it is similar to Euclidean distance. However, here
+   *         radian/angle is used instead of Cosine distance: for center c, finding
+   *         its closest center, computing the radian/angle between them, halving the
+   *         radian/angle, and converting it back to Cosine distance at the end.
+   */
+  override def computeRadii(centers: Array[VectorWithNorm]): Array[Double] = {
+    val k = centers.length
+    if (k == 1) {
+      Array(Double.NaN)
+    } else {
+      val distances = Array.fill(k)(Double.PositiveInfinity)
+      var i = 0
+      while (i < k) {
+        var j = i + 1
+        while (j < k) {
+          val d = distance(centers(i), centers(j))
+          if (d < distances(i)) distances(i) = d
+          if (d < distances(j)) distances(j) = d
+          j += 1
+        }
+        i += 1
+      }
+
+      // d = 1 - cos(x)
+      // r = 1 - cos(x/2) = 1 - sqrt((cos(x) + 1) / 2) = 1 - sqrt(1 - d/2)
+      distances.map(d => 1 - math.sqrt(1 - d / 2))
 
 Review comment:
   Cosine distance doesn't obey the triangle inequality; is this meant to be angular distance? For my benefit (and possibly comments) how is r the angular distance here?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on issue #27758: [SPARK-31007][ML] KMeans optimization based on triangle-inequality

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on issue #27758: [SPARK-31007][ML] KMeans optimization based on triangle-inequality
URL: https://github.com/apache/spark/pull/27758#issuecomment-593326104
 
 
   **[Test build #119166 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/119166/testReport)** for PR 27758 at commit [`c423b74`](https://github.com/apache/spark/commit/c423b74eaf72c40fcaa0cc823b600ae45299d245).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27758: [SPARK-31007][ML][WIP] KMeans optimization based on triangle-inequality

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27758: [SPARK-31007][ML][WIP] KMeans optimization based on triangle-inequality
URL: https://github.com/apache/spark/pull/27758#issuecomment-615132682
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on issue #27758: [SPARK-31007][ML][WIP] KMeans optimization based on triangle-inequality

Posted by GitBox <gi...@apache.org>.

SparkQA commented on issue #27758: [SPARK-31007][ML][WIP] KMeans optimization based on triangle-inequality
URL: https://github.com/apache/spark/pull/27758#issuecomment-615132208
 
 
   **[Test build #121400 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/121400/testReport)** for PR 27758 at commit [`2b10eb2`](https://github.com/apache/spark/commit/2b10eb27119de3f35529ef51b34542b7b3a01daf).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] srowen commented on a change in pull request #27758: [SPARK-31007][ML] KMeans optimization based on triangle-inequality

Posted by GitBox <gi...@apache.org>.

srowen commented on a change in pull request #27758: [SPARK-31007][ML] KMeans optimization based on triangle-inequality
URL: https://github.com/apache/spark/pull/27758#discussion_r386452407
 
 

 ##########
 File path: mllib/src/main/scala/org/apache/spark/mllib/clustering/DistanceMeasure.scala
 ##########
 @@ -24,16 +24,60 @@ import org.apache.spark.mllib.util.MLUtils
 
 private[spark] abstract class DistanceMeasure extends Serializable {
 
+  /**
+   * Radii of centers used in triangle inequality to obtain useful bounds to find
+   * closest centers.
+   *
+   * @see <a href="https://www.aaai.org/Papers/ICML/2003/ICML03-022.pdf">Charles Elkan,
+   *      Using the Triangle Inequality to Accelerate k-Means</a>
+   *
+   * @return Radii of centers. If distance between point x and center c is less than
+   *         the radius of center c, then center c is the closest center to point x.
+   */
+  def computeRadii(centers: Array[VectorWithNorm]): Array[Double] = {
+    val k = centers.length
+    Array.fill(k)(Double.NaN)
+  }
+
   /**
    * @return the index of the closest center to the given point, as well as the cost.
    */
   def findClosest(
-      centers: TraversableOnce[VectorWithNorm],
+      centers: Array[VectorWithNorm],
+      radii: Array[Double],
       point: VectorWithNorm): (Int, Double) = {
     var bestDistance = Double.PositiveInfinity
     var bestIndex = 0
     var i = 0
-    centers.foreach { center =>
+    var found = false
+    while (i < centers.length && !found) {
+      val center = centers(i)
+      val d = distance(center, point)
+      val r = radii(i)
+      if (d < r) {
+        bestDistance = d
+        bestIndex = i
+        found = true
 
 Review comment:
   Just return here rather than carry a flag around?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27758: [SPARK-31007][ML] KMeans optimization based on triangle-inequality

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27758: [SPARK-31007][ML] KMeans optimization based on triangle-inequality
URL: https://github.com/apache/spark/pull/27758#issuecomment-593359296
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/119166/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27758: [SPARK-31007][ML][WIP] KMeans optimization based on triangle-inequality

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27758: [SPARK-31007][ML][WIP] KMeans optimization based on triangle-inequality
URL: https://github.com/apache/spark/pull/27758#issuecomment-615101088
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/26083/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on issue #27758: [SPARK-31007][ML] KMeans optimization based on triangle-inequality

Posted by GitBox <gi...@apache.org>.

SparkQA commented on issue #27758: [SPARK-31007][ML] KMeans optimization based on triangle-inequality
URL: https://github.com/apache/spark/pull/27758#issuecomment-593819403
 
 
   **[Test build #119208 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/119208/testReport)** for PR 27758 at commit [`dd2aff7`](https://github.com/apache/spark/commit/dd2aff729a8ff4ffea34691486337f0bc3b5f16b).
    * This patch **fails due to an unknown error code, -9**.
    * This patch merges cleanly.
    * This patch adds no public classes.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on a change in pull request #27758: [SPARK-31007][ML] KMeans optimization based on triangle-inequality

Posted by GitBox <gi...@apache.org>.

zhengruifeng commented on a change in pull request #27758: [SPARK-31007][ML] KMeans optimization based on triangle-inequality
URL: https://github.com/apache/spark/pull/27758#discussion_r386809482
 
 

 ##########
 File path: mllib/src/main/scala/org/apache/spark/mllib/clustering/DistanceMeasure.scala
 ##########
 @@ -234,6 +342,39 @@ private[spark] object EuclideanDistanceMeasure {
 }
 
 private[spark] class CosineDistanceMeasure extends DistanceMeasure {
+
+  /**
+   * @return Radii of centers. If distance between point x and center c is less than
+   *         the radius of center c, then center c is the closest center to point x.
+   *         For Cosine distance, it is similar to Euclidean distance. However, here
+   *         radian/angle is used instead of Cosine distance: for center c, finding
+   *         its closest center, computing the radian/angle between them, halving the
+   *         radian/angle, and converting it back to Cosine distance at the end.
+   */
+  override def computeRadii(centers: Array[VectorWithNorm]): Array[Double] = {
+    val k = centers.length
+    if (k == 1) {
+      Array(Double.NaN)
+    } else {
+      val distances = Array.fill(k)(Double.PositiveInfinity)
+      var i = 0
+      while (i < k) {
+        var j = i + 1
+        while (j < k) {
+          val d = distance(centers(i), centers(j))
+          if (d < distances(i)) distances(i) = d
+          if (d < distances(j)) distances(j) = d
+          j += 1
+        }
+        i += 1
+      }
+
+      // d = 1 - cos(x)
+      // r = 1 - cos(x/2) = 1 - sqrt((cos(x) + 1) / 2) = 1 - sqrt(1 - d/2)
+      distances.map(d => 1 - math.sqrt(1 - d / 2))
 
 Review comment:
   Yes, Cosine distance doesn't obey the triangle inequality, but the following lemma should be available to apply:
   
   given a point x, and let b and c be centers. If angle(x, b)<angle(b,c)/2, then angle(x,b)<angle(x,c), cos_distance(x,b)=1-cos(x,b)<cos_distance(x,c)=1-cos(x,c)
   
   That is because: [PRINCIPLES FROM GEOMETRY, point 3](http://www.angelfire.com/nt/navtrig/B1.html)
   
   > Each side of a spherical triangle is less than the sum of the other two.
   
   angle(x,b) + angle(x,c) > angle(b,c)
   angle(x,b) < angle(b,c)/2
   
   => angle(x,c) > angle(b,c)/2 > angle(x,b)
   => cos_distance(x,c) > cos_distance(x,b)
   
   
    angle(x,b) < angle(b,c)/2
   <=>  cos(x,b) > sqrt{ (cos(b,c) + 1)/2 }
   <=>  cos_distance(x,b) < 1 - sqrt{ (cos(b,c) + 1)/2 } = 1 - sqrt{ 1 - cos_distance(b,c) / 2  }
   
   => Give two centers b and c, if point x has cos_distance(x,b) < 1 - sqrt{ 1 - cos_distance(b,c) / 2  }, then point x belongs to center b.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27758: [SPARK-31007][ML] KMeans optimization based on triangle-inequality

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27758: [SPARK-31007][ML] KMeans optimization based on triangle-inequality
URL: https://github.com/apache/spark/pull/27758#issuecomment-593819477
 
 
   Merged build finished. Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27758: [SPARK-31007][ML][WIP] KMeans optimization based on triangle-inequality

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27758: [SPARK-31007][ML][WIP] KMeans optimization based on triangle-inequality
URL: https://github.com/apache/spark/pull/27758#issuecomment-615132682
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27758: [SPARK-31007][ML] KMeans optimization based on triangle-inequality

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27758: [SPARK-31007][ML] KMeans optimization based on triangle-inequality
URL: https://github.com/apache/spark/pull/27758#issuecomment-593323271
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23907/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27758: [SPARK-31007][ML] KMeans optimization based on triangle-inequality

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27758: [SPARK-31007][ML] KMeans optimization based on triangle-inequality
URL: https://github.com/apache/spark/pull/27758#issuecomment-593359289
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27758: [SPARK-31007][ML][WIP] KMeans optimization based on triangle-inequality

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27758: [SPARK-31007][ML][WIP] KMeans optimization based on triangle-inequality
URL: https://github.com/apache/spark/pull/27758#issuecomment-615144264
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/121401/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27758: [SPARK-31007][ML] KMeans optimization based on triangle-inequality

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27758: [SPARK-31007][ML] KMeans optimization based on triangle-inequality
URL: https://github.com/apache/spark/pull/27758#issuecomment-593819492
 
 
   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/119208/
   Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] srowen commented on issue #27758: [SPARK-31007][ML][WIP] KMeans optimization based on triangle-inequality

Posted by GitBox <gi...@apache.org>.

srowen commented on issue #27758: [SPARK-31007][ML][WIP] KMeans optimization based on triangle-inequality
URL: https://github.com/apache/spark/pull/27758#issuecomment-602227283
 
 
   PS it might be worth benchmarking again anyway, as I just merged the change that uses native BLAS for level 1 operations of vectors >= 256 elements.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] srowen commented on a change in pull request #27758: [SPARK-31007][ML][WIP] KMeans optimization based on triangle-inequality

Posted by GitBox <gi...@apache.org>.

srowen commented on a change in pull request #27758: [SPARK-31007][ML][WIP] KMeans optimization based on triangle-inequality
URL: https://github.com/apache/spark/pull/27758#discussion_r408158947
 
 

 ##########
 File path: mllib/src/main/scala/org/apache/spark/mllib/clustering/DistanceMeasure.scala
 ##########
 @@ -234,6 +389,57 @@ private[spark] object EuclideanDistanceMeasure {
 }
 
 private[spark] class CosineDistanceMeasure extends DistanceMeasure {
+
+  /**
+   * Statistics used in triangle inequality to obtain useful bounds to find closest centers.
+   *
+   * @return One element used in statistics matrix to make matrix(i)(j) represents:
+   *         1, squared radii of the center i, if i==j. If distance between point x and center i
+   *         is less than the radius of center i, then center i is the closest center to point x.
+   *         For Cosine distance, it is similar to Euclidean distance. However, here radian/angle
+   *         is used instead of Cosine distance: for center c, finding its closest center,
+   *         computing the radian/angle between them, halving it, and converting it back to Cosine
+   *         distance at the end.
+   *         2, a lower bound r=matrix(i)(j) to help avoiding unnecessary distance computation.
+   *         Given point x, let i be current closest center, and d be current best squared
+   *         distance, if d < r, then we no longer need to compute the distance to center j.
+   */
+  override def computeStatistics(distance: Double): Double = {
+    // d = 1 - cos(x)
+    // r = 1 - cos(x/2) = 1 - sqrt((cos(x) + 1) / 2) = 1 - sqrt(1 - d/2)
+    1 - math.sqrt(1 - distance / 2)
+  }
+
+  /**
+   * @return the index of the closest center to the given point, as well as the cost.
+   */
+  def findClosest(
 
 Review comment:
   Is there any clean way to avoid duplicating most of this code? maybe not. It looks almost identical to the above though

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27758: [SPARK-31007][ML][WIP] KMeans optimization based on triangle-inequality

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27758: [SPARK-31007][ML][WIP] KMeans optimization based on triangle-inequality
URL: https://github.com/apache/spark/pull/27758#issuecomment-615101070
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27758: [SPARK-31007][ML] KMeans optimization based on triangle-inequality

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27758: [SPARK-31007][ML] KMeans optimization based on triangle-inequality
URL: https://github.com/apache/spark/pull/27758#issuecomment-593819477
 
 
   Merged build finished. Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on issue #27758: [SPARK-31007][ML] KMeans optimization based on triangle-inequality

Posted by GitBox <gi...@apache.org>.

zhengruifeng commented on issue #27758: [SPARK-31007][ML] KMeans optimization based on triangle-inequality
URL: https://github.com/apache/spark/pull/27758#issuecomment-593816313
 
 
   I update the `radii` to the `statistics` including more than `radii`, so another bound can be used in both distance measurers: In `findClosest`, when cluster `i` is too far alway from current closest cluster `bestIndex` (for example, distance(cluster_i, cluster_bestIndex) > 2 * distance(cluster_bestIndex, x) in `EuclideanDistance`), then we can say that point x should not belong to cluster `i`, so no need to compute distance(cluster_i, x).
   
   Then I retest above testsuite, the prediction is a litter faster, but not significantly.
   
   I also test on **cosine-distance**, results are:
   
   |Test on webspam| This PR(k=2) | This PR(k=4) | This PR(k=8) | This PR(k=16) | Master(k=2) | Master(k=4) | Master(k=8) | Master(k=16)  |
   |------|----------|------------|----------|------------|----------|------------|----------|------------|
   |Train Duration (sec)|28.851|39.434|107.571|306.996|29.291|37.332|99.98|275.32|
   |NumIters|10|7|11|14|10|7|11|14|
   |Cost|3585.915362367295|2830.5043071043824|2410.540059493046|2057.831172250597|3585.9153623672946|2830.5043071043824|2410.540059493046|2057.8311722505964|
   |Prediction Duration (millsec)|29|29|29|33|32|29|32|35|
   
   
   |Test on a9a| This PR(k=2) | This PR(k=4) | This PR(k=8) | This PR(k=16) | This PR(k=32) | This PR(k=64) | This PR(k=128) | This PR(k=256) | Master(k=2) | Master(k=4) | Master(k=8) | Master(k=16)  | Master(k=32) | Master(k=64) | Master(k=238) | Master(k=3566)  |
   |------|----------|------------|----------|------------|----------|------------|----------|------------|----------|------------|----------|------------|----------|------------|----------|------------|
   |Train Duration (sec)|0.445|0.559|0.77|1.067|1.379|2.114|3.857|7.786|0.461|0.613|0.728|1.208|1.547|2.387|4.287|9.11|
   |NumIters|4|9|10|20|20|20|20|20|4|9|10|20|20|20|20|20|
   |Cost|9458.881512958757|8727.181294576074|7646.536181047704|6743.890831633205|6063.089195117649|5381.196166489522|4787.4797985497125|4275.705880212388|9458.881512958757|8727.181294576074|7646.536181047704|6743.890831633205|6063.089195117649|5381.196166489522|4787.4797985497125|4275.705880212388|
   |Prediction Duration (millsec)|29|30|28|28|28|28|29|28|32|30|34|30|31|31|30|36|
   
   
   We can see that KMeans impls with cosine distance have the (almost) same convergen.
   And the prediction is about 10% faster than master.
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27758: [SPARK-31007][ML][WIP] KMeans optimization based on triangle-inequality

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27758: [SPARK-31007][ML][WIP] KMeans optimization based on triangle-inequality
URL: https://github.com/apache/spark/pull/27758#issuecomment-613407209
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/25962/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on issue #27758: [SPARK-31007][ML] KMeans optimization based on triangle-inequality

Posted by GitBox <gi...@apache.org>.

SparkQA commented on issue #27758: [SPARK-31007][ML] KMeans optimization based on triangle-inequality
URL: https://github.com/apache/spark/pull/27758#issuecomment-593326104
 
 
   **[Test build #119166 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/119166/testReport)** for PR 27758 at commit [`c423b74`](https://github.com/apache/spark/commit/c423b74eaf72c40fcaa0cc823b600ae45299d245).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27758: [SPARK-31007][ML][WIP] KMeans optimization based on triangle-inequality

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27758: [SPARK-31007][ML][WIP] KMeans optimization based on triangle-inequality
URL: https://github.com/apache/spark/pull/27758#issuecomment-615144264
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/121401/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on issue #27758: [SPARK-31007][ML][WIP] KMeans optimization based on triangle-inequality

Posted by GitBox <gi...@apache.org>.

SparkQA commented on issue #27758: [SPARK-31007][ML][WIP] KMeans optimization based on triangle-inequality
URL: https://github.com/apache/spark/pull/27758#issuecomment-613406732
 
 
   **[Test build #121275 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/121275/testReport)** for PR 27758 at commit [`b4fabb1`](https://github.com/apache/spark/commit/b4fabb1ddb9c2c1a92a0ad5c9ccbbaed4f3d0acc).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on a change in pull request #27758: [SPARK-31007][ML][WIP] KMeans optimization based on triangle-inequality

Posted by GitBox <gi...@apache.org>.

zhengruifeng commented on a change in pull request #27758: [SPARK-31007][ML][WIP] KMeans optimization based on triangle-inequality
URL: https://github.com/apache/spark/pull/27758#discussion_r409316055
 
 

 ##########
 File path: mllib/src/main/scala/org/apache/spark/mllib/clustering/DistanceMeasure.scala
 ##########
 @@ -17,23 +17,124 @@
 
 package org.apache.spark.mllib.clustering
 
+import org.apache.spark.SparkContext
 import org.apache.spark.annotation.Since
+import org.apache.spark.broadcast.Broadcast
 import org.apache.spark.mllib.linalg.{Vector, Vectors}
 import org.apache.spark.mllib.linalg.BLAS.{axpy, dot, scal}
 import org.apache.spark.mllib.util.MLUtils
 
 private[spark] abstract class DistanceMeasure extends Serializable {
 
+  /**
+   * Statistics used in triangle inequality to obtain useful bounds to find closest centers.
+   * @param distance distance between two centers
+   */
+  def computeStatistics(distance: Double): Double
+
+  /**
+   * Statistics used in triangle inequality to obtain useful bounds to find closest centers.
+   *
+   * @return A symmetric matrix containing statistics, matrix(i)(j) represents:
 
 Review comment:
   good idea, it is symmetric. I will study how other impls like GMM to store only the upper triangular part of the matrix.
   Maybe it is helpful to support symmetric dense matrix in .linalg? since it is used in many places

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] srowen commented on a change in pull request #27758: [SPARK-31007][ML] KMeans optimization based on triangle-inequality

Posted by GitBox <gi...@apache.org>.

srowen commented on a change in pull request #27758: [SPARK-31007][ML] KMeans optimization based on triangle-inequality
URL: https://github.com/apache/spark/pull/27758#discussion_r386450041
 
 

 ##########
 File path: mllib/src/main/scala/org/apache/spark/mllib/clustering/DistanceMeasure.scala
 ##########
 @@ -24,16 +24,60 @@ import org.apache.spark.mllib.util.MLUtils
 
 private[spark] abstract class DistanceMeasure extends Serializable {
 
+  /**
+   * Radii of centers used in triangle inequality to obtain useful bounds to find
+   * closest centers.
+   *
+   * @see <a href="https://www.aaai.org/Papers/ICML/2003/ICML03-022.pdf">Charles Elkan,
+   *      Using the Triangle Inequality to Accelerate k-Means</a>
+   *
+   * @return Radii of centers. If distance between point x and center c is less than
+   *         the radius of center c, then center c is the closest center to point x.
+   */
+  def computeRadii(centers: Array[VectorWithNorm]): Array[Double] = {
 
 Review comment:
   Is there any value in this default impl? it's overridden in both subclasses. Or is it to support future impls that may not be able to use this optimization?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] srowen commented on a change in pull request #27758: [SPARK-31007][ML] KMeans optimization based on triangle-inequality

Posted by GitBox <gi...@apache.org>.

srowen commented on a change in pull request #27758: [SPARK-31007][ML] KMeans optimization based on triangle-inequality
URL: https://github.com/apache/spark/pull/27758#discussion_r386456244
 
 

 ##########
 File path: mllib/src/main/scala/org/apache/spark/mllib/clustering/DistanceMeasure.scala
 ##########
 @@ -154,22 +198,86 @@ object DistanceMeasure {
 }
 
 private[spark] class EuclideanDistanceMeasure extends DistanceMeasure {
+
+  /**
+   * @return Radii of centers. If distance between point x and center c is less than
+   *         the radius of center c, then center c is the closest center to point x.
+   *         For Euclidean distance, radius of center c is half of the distance between
+   *         center c and its closest center.
+   */
+  override def computeRadii(centers: Array[VectorWithNorm]): Array[Double] = {
+    val k = centers.length
+    if (k == 1) {
+      Array(Double.NaN)
+    } else {
+      val distances = Array.fill(k)(Double.PositiveInfinity)
+      var i = 0
+      while (i < k) {
+        var j = i + 1
+        while (j < k) {
+          val d = distance(centers(i), centers(j))
+          if (d < distances(i)) distances(i) = d
+          if (d < distances(j)) distances(j) = d
+          j += 1
+        }
+        i += 1
+      }
+
+      distances.map(_ / 2)
+    }
+  }
+
+  /**
+   * @return the index of the closest center to the given point, as well as the cost.
+   */
+  override def findClosest(
+      centers: Array[VectorWithNorm],
+      radii: Array[Double],
+      point: VectorWithNorm): (Int, Double) = {
+    var bestDistance = Double.PositiveInfinity
+    var bestIndex = 0
+    var i = 0
+    var found = false
+    while (i < centers.length && !found) {
+      val center = centers(i)
+      // Since `\|a - b\| \geq |\|a\| - \|b\||`, we can use this lower bound to avoid unnecessary
+      // distance computation.
+      var lowerBoundOfSqDist = center.norm - point.norm
 
 Review comment:
   Nit: I'd just make these two vals for clarity

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on issue #27758: [SPARK-31007][ML] KMeans optimization based on triangle-inequality

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on issue #27758: [SPARK-31007][ML] KMeans optimization based on triangle-inequality
URL: https://github.com/apache/spark/pull/27758#issuecomment-593811584
 
 
   **[Test build #119208 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/119208/testReport)** for PR 27758 at commit [`dd2aff7`](https://github.com/apache/spark/commit/dd2aff729a8ff4ffea34691486337f0bc3b5f16b).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] srowen commented on a change in pull request #27758: [SPARK-31007][ML] KMeans optimization based on triangle-inequality

Posted by GitBox <gi...@apache.org>.

srowen commented on a change in pull request #27758: [SPARK-31007][ML] KMeans optimization based on triangle-inequality
URL: https://github.com/apache/spark/pull/27758#discussion_r386452300
 
 

 ##########
 File path: mllib/src/main/scala/org/apache/spark/mllib/clustering/DistanceMeasure.scala
 ##########
 @@ -24,16 +24,60 @@ import org.apache.spark.mllib.util.MLUtils
 
 private[spark] abstract class DistanceMeasure extends Serializable {
 
+  /**
+   * Radii of centers used in triangle inequality to obtain useful bounds to find
+   * closest centers.
+   *
+   * @see <a href="https://www.aaai.org/Papers/ICML/2003/ICML03-022.pdf">Charles Elkan,
+   *      Using the Triangle Inequality to Accelerate k-Means</a>
+   *
+   * @return Radii of centers. If distance between point x and center c is less than
+   *         the radius of center c, then center c is the closest center to point x.
+   */
+  def computeRadii(centers: Array[VectorWithNorm]): Array[Double] = {
+    val k = centers.length
+    Array.fill(k)(Double.NaN)
+  }
+
   /**
    * @return the index of the closest center to the given point, as well as the cost.
    */
   def findClosest(
-      centers: TraversableOnce[VectorWithNorm],
+      centers: Array[VectorWithNorm],
+      radii: Array[Double],
 
 Review comment:
   Do you need the other implementation, if you have this one? this short-circuits the case where you find a point within a cluster radius. This is kept to support cosine distance?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on a change in pull request #27758: [SPARK-31007][ML] KMeans optimization based on triangle-inequality

Posted by GitBox <gi...@apache.org>.

zhengruifeng commented on a change in pull request #27758: [SPARK-31007][ML] KMeans optimization based on triangle-inequality
URL: https://github.com/apache/spark/pull/27758#discussion_r386809482
 
 

 ##########
 File path: mllib/src/main/scala/org/apache/spark/mllib/clustering/DistanceMeasure.scala
 ##########
 @@ -234,6 +342,39 @@ private[spark] object EuclideanDistanceMeasure {
 }
 
 private[spark] class CosineDistanceMeasure extends DistanceMeasure {
+
+  /**
+   * @return Radii of centers. If distance between point x and center c is less than
+   *         the radius of center c, then center c is the closest center to point x.
+   *         For Cosine distance, it is similar to Euclidean distance. However, here
+   *         radian/angle is used instead of Cosine distance: for center c, finding
+   *         its closest center, computing the radian/angle between them, halving the
+   *         radian/angle, and converting it back to Cosine distance at the end.
+   */
+  override def computeRadii(centers: Array[VectorWithNorm]): Array[Double] = {
+    val k = centers.length
+    if (k == 1) {
+      Array(Double.NaN)
+    } else {
+      val distances = Array.fill(k)(Double.PositiveInfinity)
+      var i = 0
+      while (i < k) {
+        var j = i + 1
+        while (j < k) {
+          val d = distance(centers(i), centers(j))
+          if (d < distances(i)) distances(i) = d
+          if (d < distances(j)) distances(j) = d
+          j += 1
+        }
+        i += 1
+      }
+
+      // d = 1 - cos(x)
+      // r = 1 - cos(x/2) = 1 - sqrt((cos(x) + 1) / 2) = 1 - sqrt(1 - d/2)
+      distances.map(d => 1 - math.sqrt(1 - d / 2))
 
 Review comment:
   Yes, Cosine distance doesn't obey the triangle inequality, but the following lemma should be available to apply:
   
   given a point x, and let b and c be centers. If angle(x, b)<angle(b,c)/2, then angle(x,b)<angle(x,c), cos_distance(x,b)=1-cos(x,b)<cos_distance(x,c)=1-cos(x,c)
   
   That is because: [PRINCIPLES FROM GEOMETRY, point 3](http://www.angelfire.com/nt/navtrig/B1.html)
   
   > Each side of a spherical triangle is less than the sum of the other two.
   
   [Triangle_inequality:](https://en.wikipedia.org/wiki/Triangle_inequality)
   
   > In spherical geometry, the shortest distance between two points is an arc of a great circle, but the triangle inequality holds provided the restriction is made that the distance between two points on a sphere is the length of a minor spherical line segment (that is, one with central angle in [0, π]) with those endpoints.[4][5]
   
   
   angle(x,b) + angle(x,c) > angle(b,c)
   angle(x,b) < angle(b,c)/2
   
   => angle(x,c) > angle(b,c)/2 > angle(x,b)
   => cos_distance(x,c) > cos_distance(x,b)
   
   
    angle(x,b) < angle(b,c)/2
   <=>  cos(x,b) > sqrt{ (cos(b,c) + 1)/2 }
   <=>  cos_distance(x,b) < 1 - sqrt{ (cos(b,c) + 1)/2 } = 1 - sqrt{ 1 - cos_distance(b,c) / 2  }
   
   => Give two centers b and c, if point x has cos_distance(x,b) < 1 - sqrt{ 1 - cos_distance(b,c) / 2  }, then point x belongs to center b.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] srowen commented on a change in pull request #27758: [SPARK-31007][ML] KMeans optimization based on triangle-inequality

Posted by GitBox <gi...@apache.org>.

srowen commented on a change in pull request #27758: [SPARK-31007][ML] KMeans optimization based on triangle-inequality
URL: https://github.com/apache/spark/pull/27758#discussion_r386461703
 
 

 ##########
 File path: mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala
 ##########
 @@ -266,30 +266,35 @@ class KMeans private (
     var converged = false
     var cost = 0.0
     var iteration = 0
+    var totalRadiiTime = 0L
 
     val iterationStartTime = System.nanoTime()
 
     instr.foreach(_.logNumFeatures(centers.head.vector.size))
 
     // Execute iterations of Lloyd's algorithm until converged
     while (iteration < maxIterations && !converged) {
+      val radiiStartTime = System.nanoTime()
 
 Review comment:
   You might remove this after testing; it's not super important.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27758: [SPARK-31007][ML][WIP] KMeans optimization based on triangle-inequality

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27758: [SPARK-31007][ML][WIP] KMeans optimization based on triangle-inequality
URL: https://github.com/apache/spark/pull/27758#issuecomment-615132707
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/121400/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on issue #27758: [SPARK-31007][ML] KMeans optimization based on triangle-inequality

Posted by GitBox <gi...@apache.org>.

zhengruifeng commented on issue #27758: [SPARK-31007][ML] KMeans optimization based on triangle-inequality
URL: https://github.com/apache/spark/pull/27758#issuecomment-593339550
 
 
   env: bin/spark-shell --driver-memory=64G
   
   testCode:
   ```scala
   import org.apache.spark.ml.linalg._
   import org.apache.spark.ml.clustering._
   
   val df = spark.read.format("libsvm").load("/data1/Datasets/webspam/webspam_wc_normalized_trigram.svm.10k")
   df.persist()
   
   val km = new KMeans().setK(16).setMaxIter(5)
   val kmm = km.fit(df)
   
   val results = Seq(2,4,8,16).map { k => val km = new KMeans().setK(k).setMaxIter(20); val start = System.currentTimeMillis; val kmm = km.fit(df); val end = System.currentTimeMillis; val train = end - start;  kmm.transform(df).count; val start2 = System.currentTimeMillis; kmm.transform(df).count; val end2 = System.currentTimeMillis; val predict = end2 - start2; val result = (k, train, kmm.summary.numIter, kmm.summary.trainingCost, predict); println(result); result }
   
   
   val df2 = spark.read.format("libsvm").load("/data1/Datasets/a9a/a9a")
   df2.persist()
   
   val km = new KMeans().setK(16).setMaxIter(10)
   val kmm = km.fit(df2)
   
   val results2 = Seq(2,4,8,16,32,64,128,256).map { k => val km = new KMeans().setK(k).setMaxIter(20); val start = System.currentTimeMillis; val kmm = km.fit(df2); val end = System.currentTimeMillis; val train = end - start;  kmm.transform(df2).count; val start2 = System.currentTimeMillis; kmm.transform(df).count; val end2 = System.currentTimeMillis; val predict = end2 - start2; val result = (k, train, kmm.summary.numIter, kmm.summary.trainingCost, predict); println(result); result }
   ```
   
   1, sparse dataset: webspam, numInstances: 10,000, numFeatures: 8,289,919
   
   |Test on webspam| This PR(k=2) | This PR(k=4) | This PR(k=8) | This PR(k=16) | Master(k=2) | Master(k=4) | Master(k=8) | Master(k=16)  |
   |------|----------|------------|----------|------------|----------|------------|----------|------------|
   |Train Duration (sec)|27.602|47.932|145.430|371.239|27.042|46.630|136.205|350.788|
   |Radii Computation (sec)|0.097|0.651|5.558|26.059|-|-|-|-|
   |NumIters|9|10|18|20|9|10|18|20|
   |Cost|5701.39583824928|4518.01202129673|4013.6152754360096|3520.6545815340055|5701.39583824928|4518.01202129673|4013.6152754360096|3520.6545815340064|
   |Prediction Duration (millsec)|31|31|30|30|36|33|41|43|
   
   
   2, dense dataset: a9a, numInstances: 32,561, numFeatures: 123
   
   |Test on a9a| This PR(k=2) | This PR(k=4) | This PR(k=8) | This PR(k=16) | This PR(k=32) | This PR(k=64) | This PR(k=128) | This PR(k=256) | Master(k=2) | Master(k=4) | Master(k=8) | Master(k=16)  | Master(k=32) | Master(k=64) | Master(k=238) | Master(k=3566)  |
   |------|----------|------------|----------|------------|----------|------------|----------|------------|----------|------------|----------|------------|----------|------------|----------|------------|
   |Train Duration (sec)|0.465|1.411|0.957|1.109|1.387|2.137|3.891|8.373|0.484|0.758|1.065|1.3|1.577|2.413|4.483|9.616|
   |Radii Computation (sec)|0.000|0.000|0.000|0.001|0.002|0.007|0.024|0.093|-|-|-|-|-|-|-|-|
   |NumIters|5|14|20|20|20|20|20|20|5|14|20|20|20|20|20|20|
   |Cost|223377.07442855154|208261.6049327395|183833.73210801493|166964.5700618612|151753.31986776151|137289.24733092127|122693.39508665689|110459.53320745505|223377.07442855154|208261.6049327395|183833.73210801493|166964.5700618612|151753.31986776151|137289.24733092127|122693.39508665689|110459.53320745505|
   |Prediction Duration (millsec)|32|33|31|30|30|30|30|31|39|36|38|33|32|29|35|31|
   
   conclusion:
   1, the convergence of both impl are almost the same.
   2, new impl of prediction is likely to be faster than existing impl (after lazy computation of radii);
   3, computation of radii may take a few seconds, so training maybe slower than existing impl in some case (few instances, large k, large dim, then the computation of radii matters in the whole training), for example, webspam with k=16, new impl took 371.239 sec, while existing impl took 350.788 sec. That is because the computation of radii (in all 20 iterations) totally took 26.059 sec.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org