You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by KyleLi1985 <gi...@git.apache.org> on 2018/10/30 11:26:14 UTC

[GitHub] spark pull request #22893: One part of Spark MLlib Kmean Logic Performance p...

GitHub user KyleLi1985 opened a pull request:

    https://github.com/apache/spark/pull/22893

    One part of Spark MLlib Kmean Logic Performance problem

    [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic Performance problem
    
    ## What changes were proposed in this pull request?
    Reduce low performance logic in function fastSquaredDistance
    
    ## How was this patch tested?
    ./dev/run-tests pass
    Calculate two vector for test fastSquaredDistance 100000000 times
    1 2 3 4 3 4 5 6 7 8 9 0 1 3 4 6 7 4 2 2 5 7 8 9 3 2 3 5 7 8 9 3 3 2 1 1 2 2 9 3 3 4 5
    4 5 2 1 5 6 3 2 1 3 4 6 7 8 9 0 3 2 1 2 3 4 5 6 7 8 5 3 2 1 4 5 6 7 8 4 3 2 4 6 7 8 9
    After added patch, the cost time update from 8395 to 5448 milliseconds


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/KyleLi1985/spark updatekmeanpatch

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/22893.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #22893
    
----
commit 701223b39a0c473de865de30b0017af4883441f3
Author: 李亮 <li...@...>
Date:   2018-10-30T11:03:02Z

    upgrade kmean performance

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22893: One part of Spark MLlib Kmean Logic Performance problem

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22893
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/22893
  
    There's no merge conflict right now. You can just update the file and push the commit to your branch. If there were a merge conflict, you'd just rebase on apache/master, resolve the conflict, and force-push the branch.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/22893
  
    Heh, as a side effect, this made the output of computeCost more accurate in one Pyspark test. It prints "2.0" rather than "2.000..." I think you can change the three instances that failed to just expect "2.0".


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...

Posted by KyleLi1985 <gi...@git.apache.org>.

Github user KyleLi1985 commented on the issue:

    https://github.com/apache/spark/pull/22893
  
    I form the final test case for sparse case and dense case on realistic data to test new commit
    [SparkMLlibTest.txt](https://github.com/apache/spark/files/2561442/SparkMLlibTest.txt)
    
    **For Dense case**,  we use data from 
    http://archive.ics.uci.edu/ml/datasets/EEG+Steady-State+Visual+Evoked+Potential+Signals
    and use all the dense data file from this realistic data
    
    **Dense case Test Result time cost (milliseconds)**
    Before Enhance 211878 210845 215375
    After Enhance 140827 149282 130691
    
    **For Sparse case**, we use data from
    http://archive.ics.uci.edu/ml/datasets/Condition+monitoring+of+hydraulic+systems
    and extract all the sparse data file (PS1, PS2, PS3, PS4, PS5, PS6) from this realistic data
    
    **Sparse case Test Result time cost (milliseconds)**
    Before Enhance 108080 103582 103586
    After Enhance 107652 107145 104768


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on the issue:

    https://github.com/apache/spark/pull/22893
  
    then I think you have to try with native BLAS installed, otherwise the results are not valid IMHO.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on the issue:

    https://github.com/apache/spark/pull/22893
  
    retest this please


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/22893
  
    Thanks @KyleLi1985 this looks like a nice win in the end. Thanks for your investigation.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22893
  
    **[Test build #4420 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4420/testReport)** for PR 22893 at commit [`762755d`](https://github.com/apache/spark/commit/762755d7b4196c808323bdcd6ee5db9127610bd1).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...

Posted by KyleLi1985 <gi...@git.apache.org>.

Github user KyleLi1985 commented on the issue:

    https://github.com/apache/spark/pull/22893
  
    > Hm, actually that's the best case. You're exercising the case where the code path you prefer is fast. And the case where the precision bound applies is exactly the case where the branch you deleted helps. As I say, you'd have to show this is not impacting other cases significantly, and I think it should. Consider the sparse-sparse case.
    
    There is my test for sparse-sparse, dense-dense, sparse-dense case
    [SparkMLlibTest.txt](https://github.com/apache/spark/files/2541007/SparkMLlibTest.txt)
    
    `
    
      import org.apache.spark.{SparkConf, SparkContext}
      import org.apache.spark.mllib.linalg.{DenseVector, SparseVector, Vector, Vectors}
      import com.github.fommil.netlib.F2jBLAS
      import java.util.Date
    
    
      object SparkMLlibTest {
        class VectorWithNorm(val vector: Vector, val norm: Double) extends Serializable {
    
          def this(vector: Vector) = this(vector, Vectors.norm(vector, 2.0))
    
          def this(array: Array[Double]) = this(Vectors.dense(array))
    
          /** Converts the vector to a dense vector. */
          def toDense: VectorWithNorm = new VectorWithNorm(Vectors.dense(vector.toArray), norm)
        }
    
        def dot(x: Vector, y: Vector): Double = {
          require(x.size == y.size,
            "BLAS.dot(x: Vector, y:Vector) was given Vectors with non-matching sizes:" +
              " x.size = " + x.size + ", y.size = " + y.size)
          (x, y) match {
            case (dx: DenseVector, dy: DenseVector) =>
              dot(dx, dy)
            case (sx: SparseVector, dy: DenseVector) =>
              dot(sx, dy)
            case (dx: DenseVector, sy: SparseVector) =>
              dot(sy, dx)
            case (sx: SparseVector, sy: SparseVector) =>
              dot(sx, sy)
            case _ =>
              throw new IllegalArgumentException(s"dot doesn't support (${x.getClass}, ${y.getClass}).")
          }
        }
    
        def dot(x: DenseVector, y: DenseVector): Double = {
          val n = x.size
          new F2jBLAS().ddot(n, x.values, 1, y.values, 1)
        }
    
        def dot(x: SparseVector, y: DenseVector): Double = {
          val xValues = x.values
          val xIndices = x.indices
          val yValues = y.values
          val nnz = xIndices.length
    
          var sum = 0.0
          var k = 0
          while (k < nnz) {
            sum += xValues(k) * yValues(xIndices(k))
            k += 1
          }
          sum
        }
    
        /**
          * dot(x, y)
          */
        def dot(x: SparseVector, y: SparseVector): Double = {
          val xValues = x.values
          val xIndices = x.indices
          val yValues = y.values
          val yIndices = y.indices
          val nnzx = xIndices.length
          val nnzy = yIndices.length
    
          var kx = 0
          var ky = 0
          var sum = 0.0
          // y catching x
          while (kx < nnzx && ky < nnzy) {
            val ix = xIndices(kx)
            while (ky < nnzy && yIndices(ky) < ix) {
              ky += 1
            }
            if (ky < nnzy && yIndices(ky) == ix) {
              sum += xValues(kx) * yValues(ky)
              ky += 1
            }
            kx += 1
          }
          sum
        }
    
        lazy val EPSILON = {
          var eps = 1.0
          while ((1.0 + (eps / 2.0)) != 1.0) {
            eps /= 2.0
          }
          eps
        }
    
        def fastSquaredDistanceAddedPatch(
                                                v1: Vector,
                                                norm1: Double,
                                                v2: Vector,
                                                norm2: Double,
                                                precision: Double = 1e-6): Double = {
          val n = v1.size
          require(v2.size == n)
          require(norm1 >= 0.0 && norm2 >= 0.0)
          val sumSquaredNorm = norm1 * norm1 + norm2 * norm2
          val normDiff = norm1 - norm2
          var sqDist = 0.0
          /*
           * The relative error is
           * <pre>
           * EPSILON * ( \|a\|_2^2 + \|b\\_2^2 + 2 |a^T b|) / ( \|a - b\|_2^2 ),
           * </pre>
           * which is bounded by
           * <pre>
           * 2.0 * EPSILON * ( \|a\|_2^2 + \|b\|_2^2 ) / ( (\|a\|_2 - \|b\|_2)^2 ).
           * </pre>
           * The bound doesn't need the inner product, so we can use it as a sufficient condition to
           * check quickly whether the inner product approach is accurate.
           */
          val precisionBound1 = 2.0 * EPSILON * sumSquaredNorm / (normDiff * normDiff + EPSILON)
    
          if (precisionBound1 < precision && (!v1.isInstanceOf[DenseVector]
            || !v2.isInstanceOf[DenseVector])) {
            sqDist = sumSquaredNorm - 2.0 * dot(v1, v2)
          } else if (v1.isInstanceOf[SparseVector] || v2.isInstanceOf[SparseVector]) {
            val dotValue = dot(v1, v2)
            sqDist = math.max(sumSquaredNorm - 2.0 * dotValue, 0.0)
            val precisionBound2 = EPSILON * (sumSquaredNorm + 2.0 * math.abs(dotValue)) /
              (sqDist + EPSILON)
            if (precisionBound2 > precision) {
              sqDist = Vectors.sqdist(v1, v2)
            }
          } else {
            sqDist = Vectors.sqdist(v1, v2)
          }
    
          sqDist
        }
    
        def fastSquaredDistanceOriginal(
                                           v1: Vector,
                                           norm1: Double,
                                           v2: Vector,
                                           norm2: Double,
                                           precision: Double = 1e-6): Double = {
          val n = v1.size
          require(v2.size == n)
          require(norm1 >= 0.0 && norm2 >= 0.0)
          val sumSquaredNorm = norm1 * norm1 + norm2 * norm2
          val normDiff = norm1 - norm2
          var sqDist = 0.0
          /*
           * The relative error is
           * <pre>
           * EPSILON * ( \|a\|_2^2 + \|b\\_2^2 + 2 |a^T b|) / ( \|a - b\|_2^2 ),
           * </pre>
           * which is bounded by
           * <pre>
           * 2.0 * EPSILON * ( \|a\|_2^2 + \|b\|_2^2 ) / ( (\|a\|_2 - \|b\|_2)^2 ).
           * </pre>
           * The bound doesn't need the inner product, so we can use it as a sufficient condition to
           * check quickly whether the inner product approach is accurate.
           */
          val precisionBound1 = 2.0 * EPSILON * sumSquaredNorm / (normDiff * normDiff + EPSILON)
    
          if (precisionBound1 < precision) {
            sqDist = sumSquaredNorm - 2.0 * dot(v1, v2)
          } else if (v1.isInstanceOf[SparseVector] || v2.isInstanceOf[SparseVector]) {
            val dotValue = dot(v1, v2)
            sqDist = math.max(sumSquaredNorm - 2.0 * dotValue, 0.0)
            val precisionBound2 = EPSILON * (sumSquaredNorm + 2.0 * math.abs(dotValue)) /
              (sqDist + EPSILON)
            if (precisionBound2 > precision) {
              sqDist = Vectors.sqdist(v1, v2)
            }
          } else {
            sqDist = Vectors.sqdist(v1, v2)
          }
    
          sqDist
        }
    
        def main(args: Array[String]): Unit = {
          val conf = new SparkConf().setAppName("Test")
          conf.setMaster("local[*]")
          conf.set("spark.testing.memory", "512000000")
          val sc = new SparkContext(conf)
    
          val testData1 = sc.parallelize(List(Array(0,0,2,3,4,1,3,5,3,2,1,5,0,5,0,0,3,0,1,2,3,0,0,0,3,0,0,0,0,0,0,0,1,0,0,0,2)),1)
          val testData2 = sc.parallelize(List(Array(1,2,3,0,2,3,0,3,5,7,2,1,3,2,0,3,2,1,0,4,0,1,0,1,0,1,0,0,0,0,0,2,0,0,0,1,0)),1)
    
          val vector1dense = testData1.map{
            f =>
              val vector = Vectors.dense(f.map(f=>f.toDouble))
              new VectorWithNorm(vector, Vectors.norm(vector,2))
          }.take(1).apply(0)
          val vector2dense = testData2.map{
            f =>
              val vector = Vectors.dense(f.map(f=>f.toDouble))
              new VectorWithNorm(vector, Vectors.norm(vector,2))
          }.take(1).apply(0)
          val vector1sparse = testData1.map{
            f =>
              val vector = Vectors.sparse(f.length, f.zipWithIndex.map(f=>f._2),f.map(f=>f.toDouble))
              val spasevector = vector.toSparse
              new VectorWithNorm(spasevector, Vectors.norm(spasevector,2))
          }.take(1).apply(0)
          val vector2sparse = testData2.map{
            f =>
              val vector = Vectors.sparse(f.length, f.zipWithIndex.map(f=>f._2),f.map(f=>f.toDouble))
              val spasevector = vector.toSparse
              new VectorWithNorm(spasevector, Vectors.norm(spasevector,2))
          }.take(1).apply(0)
    
    
          var vector1 = Vectors.zeros(0)
          var vector2 = Vectors.zeros(0)
          var norm1 = 0.0
          var norm2 = 0.0
    
    
          //use different input data mode to check before and after add patch
          //INPUT mode:
          //sparse/sparseCase
          //dense/denseCase
          //dense/sparseCase
    
          val key = "dense/sparseCase"
          key match {
            case "sparse/sparseCase" =>
              vector1 = vector1sparse.vector
              vector2 = vector2sparse.vector
              norm1 = vector1sparse.norm
              norm2 = vector2sparse.norm
            case "dense/denseCase" =>
              vector1 = vector1dense.vector
              vector2 = vector2dense.vector
              norm1 = vector1dense.norm
              norm2 = vector2dense.norm
            case "dense/sparseCase" =>
              vector1 = vector1sparse.vector
              vector2 = vector2dense.vector
              norm1 = vector1sparse.norm
              norm2 = vector2dense.norm
          }
    
          var sqOri = 0.0
          val startTime1 = new Date().getTime
          for (i <- 0 until(100000000)) {//100000000
            sqOri = fastSquaredDistanceAddedPatch(vector1,norm1,vector2,norm2)
          }
          val endTime1 = new Date().getTime
    
          println("sqOri: " + sqOri)
          println("costTime1: " + (endTime1 - startTime1))
          
          var sqAP = 0.0
          val startTime2 = new Date().getTime
          for (a <- 0 until(100000000)) {
            sqAP = fastSquaredDistanceAddedPatch(vector1,norm1,vector2,norm2)
          }
          val endTime2 = new Date().getTime
          println("sqAp: " + sqAP)
          println("costTime2: " + (endTime2 - startTime2))
    
    
          sc.stop()
        }
      }
    
    `


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...

Posted by KyleLi1985 <gi...@git.apache.org>.

Github user KyleLi1985 commented on the issue:

    https://github.com/apache/spark/pull/22893
  
    > Thanks @KyleLi1985 this looks like a nice win in the end. Thanks for your investigation.
    
    @srowen @HyukjinKwon @mgaido91 Thanks for review. It is my pleasure.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22893
  
    **[Test build #4424 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4424/testReport)** for PR 22893 at commit [`50ef296`](https://github.com/apache/spark/commit/50ef296e4df865105188f00e09be758187d0fbe9).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...

Posted by KyleLi1985 <gi...@git.apache.org>.

Github user KyleLi1985 commented on the issue:

    https://github.com/apache/spark/pull/22893
  
    > I don't think BLAS matters here as these are all vector-vector operations and f2jblas is used directly (i.e. stays in the JVM).
    > 
    > Are all the vectors dense? I suppose I'm still surprised if sqdist is faster than dot here as it ought to be a little more math. The sparse-dense case might come out differently, note.
    > 
    > And I suppose I have a hard time believing that the sparse-sparse case is faster after this change (when the precision bound is met) because now it's handled in the sparse-sparse if case in this code, which definitely does a dot plus more work.
    > 
    > (If you did remove this check you could remove some other values that get computed to check this bound, like precision1)
    
    We use only "Vectors Dense", here is the test file
    [SparkMLlibTest.txt](https://github.com/apache/spark/files/2538356/SparkMLlibTest.txt)
    I extract the relevant part from code and compare the performance, The result show in Vectors Dense situation the sqdist is faster。
    And for End-to-End test, I consider the worst situation, input vector are all dense and the precision is not OK!
    
    `
    
        if (precisionBound1 < precision && !v1.isInstanceOf[DenseVector]
          && !v2.isInstanceOf[DenseVector]) {
            sqDist = sumSquaredNorm - 2.0 * dot(v1, v2)
        } else if (v1.isInstanceOf[SparseVector] || v2.isInstanceOf[SparseVector]) {
          val dotValue = dot(v1, v2)
          sqDist = math.max(sumSquaredNorm - 2.0 * dotValue, 0.0)
          val precisionBound2 = EPSILON * (sumSquaredNorm + 2.0 * math.abs(dotValue)) /
            (sqDist + EPSILON)
          if (precisionBound2 > precision) {
            sqDist = Vectors.sqdist(v1, v2)
          }
        } else {
          sqDist = Vectors.sqdist(v1, v2)
        }
    `
    
    only use sqdist to calculate distance when the logic is presented above


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22893
  
    **[Test build #4424 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4424/testReport)** for PR 22893 at commit [`50ef296`](https://github.com/apache/spark/commit/50ef296e4df865105188f00e09be758187d0fbe9).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22893: One part of Spark MLlib Kmean Logic Performance problem

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/22893
  
    Also please fill the PR description. How much performance does it gain? 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/22893
  
    So the pull request right now doesn't reflect what you tested, but you tested the version pasted above. You're saying that the optimization just never helps the dense-dense case, and sqdist is faster than a dot product. This doesn't make sense mathematically as it should be more math, but stranger things have happened.
    
    Still, I don't follow your test code here. You parallelize one vector, map it, collect it: why use Spark? and it's the same vector over and over, and it's not a big vector. Your sparse vectors aren't very sparse.
    
    How about more representative input -- larger vectors (100s of elements, probably), more sparse sparse vectors, and a large set of different inputs. I also don't see where the precision bound is changed here?
    
    This may be a good change but I'm just not yet convinced by the test methodology, and the result still doesn't make much intuitive sense.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...

Posted by KyleLi1985 <gi...@git.apache.org>.

Github user KyleLi1985 commented on the issue:

    https://github.com/apache/spark/pull/22893
  
    End-to-End TEST Situation:
    Use below code to test
    `  
    test("kmeanproblem") {
        val rdd = sc
            .textFile("/Users/liliang/Desktop/inputdata.txt")
            .map(f => f.split(",")
            .map(f => f.toDouble))
    
        val vectorRdd = rdd.map(f => Vectors.dense(f))
        val startTime = System.currentTimeMillis()
        for (i <- 0 until 20) {
          val model = new KMeans()
            .setK(8)
            .setMaxIterations(100)
            .setInitializationMode(K_MEANS_PARALLEL)
            .run(vectorRdd)
        }
        val endTime = System.currentTimeMillis()
    
        // scalastyle:off println
        println("cost time: " + (endTime - startTime))
        // scalastyle:on println
    ` 
    Input Data:
    extract 57216 items from the data (http://archive.ics.uci.edu/ml/datasets/EEG+Steady-State+Visual+Evoked+Potential+Signals) to form the test input data
    Test Result:
    Before: cost time is 297686 milliseconds (consider the worst situation)
    After add patch: cost time is 180544 milliseconds (consider the worst situation)
    
    Function Test Situation:
    Only test function fastSquaredDistance in below situation: 
    call fastSquaredDistance function 100000000 times before and after added patch respectively 
    
    Input Data:
    1 2 3 4 3 4 5 6 7 8 9 0 1 3 4 6 7 4 2 2 5 7 8 9 3 2 3 5 7 8 9 3 3 2 1 1 2 2 9 3 3 4 5
    4 5 2 1 5 6 3 2 1 3 4 6 7 8 9 0 3 2 1 2 3 4 5 6 7 8 5 3 2 1 4 5 6 7 8 4 3 2 4 6 7 8 9
    Test Result:
    Before: cost time is 8395 milliseconds
    After added patch: cost time is 5448 milliseconds
    
    So according to above test, we can conclude that the patch give a better performance for function fastSquaredDistance in spark k-mean mode. 
    (   further more the sqDist = Vectors.sqdist(v1, v2) 
        is better than 
        sqDist = sumSquaredNorm - 2.0 * dot(v1, v2)
        in calculation performance
    )


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...

Posted by KyleLi1985 <gi...@git.apache.org>.

Github user KyleLi1985 commented on the issue:

    https://github.com/apache/spark/pull/22893
  
    It seems the related file spark/python/pyspark/ml/clustering.py has been changed, during these days. My local latest commit stay on "bfe60fc  on 30 Jul".  So I need re-fork spark and open another pull request, or is there other method?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...

Posted by KyleLi1985 <gi...@git.apache.org>.

Github user KyleLi1985 commented on the issue:

    https://github.com/apache/spark/pull/22893
  
    There is my test for situation sparse-sparse, dense-dense, sparse-dense case
    `
    
      import org.apache.spark.{SparkConf, SparkContext}
      import org.apache.spark.mllib.linalg.{DenseVector, SparseVector, Vector, Vectors}
      import com.github.fommil.netlib.F2jBLAS
      import java.util.Date
      
      
      object SparkMLlibTest {
        class VectorWithNorm(val vector: Vector, val norm: Double) extends Serializable {
      
          def this(vector: Vector) = this(vector, Vectors.norm(vector, 2.0))
      
          def this(array: Array[Double]) = this(Vectors.dense(array))
      
          /** Converts the vector to a dense vector. */
          def toDense: VectorWithNorm = new VectorWithNorm(Vectors.dense(vector.toArray), norm)
        }
      
        def dot(x: Vector, y: Vector): Double = {
          require(x.size == y.size,
            "BLAS.dot(x: Vector, y:Vector) was given Vectors with non-matching sizes:" +
              " x.size = " + x.size + ", y.size = " + y.size)
          (x, y) match {
            case (dx: DenseVector, dy: DenseVector) =>
              dot(dx, dy)
            case (sx: SparseVector, dy: DenseVector) =>
              dot(sx, dy)
            case (dx: DenseVector, sy: SparseVector) =>
              dot(sy, dx)
            case (sx: SparseVector, sy: SparseVector) =>
              dot(sx, sy)
            case _ =>
              throw new IllegalArgumentException(s"dot doesn't support (${x.getClass}, ${y.getClass}).")
          }
        }
      
        def dot(x: DenseVector, y: DenseVector): Double = {
          val n = x.size
          new F2jBLAS().ddot(n, x.values, 1, y.values, 1)
        }
      
        def dot(x: SparseVector, y: DenseVector): Double = {
          val xValues = x.values
          val xIndices = x.indices
          val yValues = y.values
          val nnz = xIndices.length
      
          var sum = 0.0
          var k = 0
          while (k < nnz) {
            sum += xValues(k) * yValues(xIndices(k))
            k += 1
          }
          sum
        }
      
        /**
          * dot(x, y)
          */
        def dot(x: SparseVector, y: SparseVector): Double = {
          val xValues = x.values
          val xIndices = x.indices
          val yValues = y.values
          val yIndices = y.indices
          val nnzx = xIndices.length
          val nnzy = yIndices.length
      
          var kx = 0
          var ky = 0
          var sum = 0.0
          // y catching x
          while (kx < nnzx && ky < nnzy) {
            val ix = xIndices(kx)
            while (ky < nnzy && yIndices(ky) < ix) {
              ky += 1
            }
            if (ky < nnzy && yIndices(ky) == ix) {
              sum += xValues(kx) * yValues(ky)
              ky += 1
            }
            kx += 1
          }
          sum
        }
      
        lazy val EPSILON = {
          var eps = 1.0
          while ((1.0 + (eps / 2.0)) != 1.0) {
            eps /= 2.0
          }
          eps
        }
      
        def fastSquaredDistanceAddedPatch(
                                                v1: Vector,
                                                norm1: Double,
                                                v2: Vector,
                                                norm2: Double,
                                                precision: Double = 1e-6): Double = {
          val n = v1.size
          require(v2.size == n)
          require(norm1 >= 0.0 && norm2 >= 0.0)
          val sumSquaredNorm = norm1 * norm1 + norm2 * norm2
          val normDiff = norm1 - norm2
          var sqDist = 0.0
          /*
           * The relative error is
           * <pre>
           * EPSILON * ( \|a\|_2^2 + \|b\\_2^2 + 2 |a^T b|) / ( \|a - b\|_2^2 ),
           * </pre>
           * which is bounded by
           * <pre>
           * 2.0 * EPSILON * ( \|a\|_2^2 + \|b\|_2^2 ) / ( (\|a\|_2 - \|b\|_2)^2 ).
           * </pre>
           * The bound doesn't need the inner product, so we can use it as a sufficient condition to
           * check quickly whether the inner product approach is accurate.
           */
          val precisionBound1 = 2.0 * EPSILON * sumSquaredNorm / (normDiff * normDiff + EPSILON)
      
          if (precisionBound1 < precision && (!v1.isInstanceOf[DenseVector]
            || !v2.isInstanceOf[DenseVector])) {
            sqDist = sumSquaredNorm - 2.0 * dot(v1, v2)
          } else if (v1.isInstanceOf[SparseVector] || v2.isInstanceOf[SparseVector]) {
            val dotValue = dot(v1, v2)
            sqDist = math.max(sumSquaredNorm - 2.0 * dotValue, 0.0)
            val precisionBound2 = EPSILON * (sumSquaredNorm + 2.0 * math.abs(dotValue)) /
              (sqDist + EPSILON)
            if (precisionBound2 > precision) {
              sqDist = Vectors.sqdist(v1, v2)
            }
          } else {
            sqDist = Vectors.sqdist(v1, v2)
          }
      
          sqDist
        }
      
        def fastSquaredDistanceOriginal(
                                           v1: Vector,
                                           norm1: Double,
                                           v2: Vector,
                                           norm2: Double,
                                           precision: Double = 1e-6): Double = {
          val n = v1.size
          require(v2.size == n)
          require(norm1 >= 0.0 && norm2 >= 0.0)
          val sumSquaredNorm = norm1 * norm1 + norm2 * norm2
          val normDiff = norm1 - norm2
          var sqDist = 0.0
          /*
           * The relative error is
           * <pre>
           * EPSILON * ( \|a\|_2^2 + \|b\\_2^2 + 2 |a^T b|) / ( \|a - b\|_2^2 ),
           * </pre>
           * which is bounded by
           * <pre>
           * 2.0 * EPSILON * ( \|a\|_2^2 + \|b\|_2^2 ) / ( (\|a\|_2 - \|b\|_2)^2 ).
           * </pre>
           * The bound doesn't need the inner product, so we can use it as a sufficient condition to
           * check quickly whether the inner product approach is accurate.
           */
          val precisionBound1 = 2.0 * EPSILON * sumSquaredNorm / (normDiff * normDiff + EPSILON)
      
          if (precisionBound1 < precision) {
            sqDist = sumSquaredNorm - 2.0 * dot(v1, v2)
          } else if (v1.isInstanceOf[SparseVector] || v2.isInstanceOf[SparseVector]) {
            val dotValue = dot(v1, v2)
            sqDist = math.max(sumSquaredNorm - 2.0 * dotValue, 0.0)
            val precisionBound2 = EPSILON * (sumSquaredNorm + 2.0 * math.abs(dotValue)) /
              (sqDist + EPSILON)
            if (precisionBound2 > precision) {
              sqDist = Vectors.sqdist(v1, v2)
            }
          } else {
            sqDist = Vectors.sqdist(v1, v2)
          }
      
          sqDist
        }
      
        def main(args: Array[String]): Unit = {
          val conf = new SparkConf().setAppName("Test")
          conf.setMaster("local[*]")
          conf.set("spark.testing.memory", "512000000")
          val sc = new SparkContext(conf)
      
          val testData1 = sc.parallelize(List(Array(0,0,2,3,4,1,3,5,3,2,1,5,0,5,0,0,3,0,1,2,3,0,0,0,3,0,0,0,0,0,0,0,1,0,0,0,2)),1)
          val testData2 = sc.parallelize(List(Array(1,2,3,0,2,3,0,3,5,7,2,1,3,2,0,3,2,1,0,4,0,1,0,1,0,1,0,0,0,0,0,2,0,0,0,1,0)),1)
      
          val vector1dense = testData1.map{
            f =>
              val vector = Vectors.dense(f.map(f=>f.toDouble))
              new VectorWithNorm(vector, Vectors.norm(vector,2))
          }.take(1).apply(0)
          val vector2dense = testData2.map{
            f =>
              val vector = Vectors.dense(f.map(f=>f.toDouble))
              new VectorWithNorm(vector, Vectors.norm(vector,2))
          }.take(1).apply(0)
          val vector1sparse = testData1.map{
            f =>
              val vector = Vectors.sparse(f.length, f.zipWithIndex.map(f=>f._2),f.map(f=>f.toDouble))
              val spasevector = vector.toSparse
              new VectorWithNorm(spasevector, Vectors.norm(spasevector,2))
          }.take(1).apply(0)
          val vector2sparse = testData2.map{
            f =>
              val vector = Vectors.sparse(f.length, f.zipWithIndex.map(f=>f._2),f.map(f=>f.toDouble))
              val spasevector = vector.toSparse
              new VectorWithNorm(spasevector, Vectors.norm(spasevector,2))
          }.take(1).apply(0)
      
      
          var vector1 = Vectors.zeros(0)
          var vector2 = Vectors.zeros(0)
          var norm1 = 0.0
          var norm2 = 0.0
      
      
          //use different input data mode to check before and after add patch
          //INPUT mode:
          //sparse/sparseCase
          //dense/denseCase
          //dense/sparseCase
      
          val key = "dense/sparseCase"
          key match {
            case "sparse/sparseCase" =>
              vector1 = vector1sparse.vector
              vector2 = vector2sparse.vector
              norm1 = vector1sparse.norm
              norm2 = vector2sparse.norm
            case "dense/denseCase" =>
              vector1 = vector1dense.vector
              vector2 = vector2dense.vector
              norm1 = vector1dense.norm
              norm2 = vector2dense.norm
            case "dense/sparseCase" =>
              vector1 = vector1sparse.vector
              vector2 = vector2dense.vector
              norm1 = vector1sparse.norm
              norm2 = vector2dense.norm
          }
      
          var sqOri = 0.0
          val startTime1 = new Date().getTime
          for (i <- 0 until(100000000)) {//100000000
            sqOri = fastSquaredDistanceAddedPatch(vector1,norm1,vector2,norm2)
          }
          val endTime1 = new Date().getTime
      
          println("sqOri: " + sqOri)
          println("costTime1: " + (endTime1 - startTime1))
          
          var sqAP = 0.0
          val startTime2 = new Date().getTime
          for (a <- 0 until(100000000)) {
            sqAP = fastSquaredDistanceAddedPatch(vector1,norm1,vector2,norm2)
          }
          val endTime2 = new Date().getTime
          println("sqAp: " + sqAP)
          println("costTime2: " + (endTime2 - startTime2))
      
      
          sc.stop()
        }
      }
    `
    [SparkMLlibTest.txt](https://github.com/apache/spark/files/2541002/SparkMLlibTest.txt)



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/22893
  
    I don't think BLAS matters here as these are all vector-vector operations and f2jblas is used directly (i.e. stays in the JVM). 
    
    Are all the vectors dense? I suppose I'm still surprised if sqdist is faster than dot here as it ought to be a little more math. The sparse-dense case might come out differently, note. 
    
    And I suppose I have a hard time believing that the sparse-sparse case is faster after this change (when the precision bound is met) because now it's handled in the sparse-sparse if case in this code, which definitely does a dot plus more work.
    
    (If you did remove this check you could remove some other values that get computed to check this bound, like precision1)


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...

Posted by KyleLi1985 <gi...@git.apache.org>.

Github user KyleLi1985 commented on the issue:

    https://github.com/apache/spark/pull/22893
  
    > then I think you have to try with native BLAS installed, otherwise the results are not valid IMHO.
    Ok, For a fair result, I will try it



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...

Posted by KyleLi1985 <gi...@git.apache.org>.

Github user KyleLi1985 commented on the issue:

    https://github.com/apache/spark/pull/22893
  
    > So the pull request right now doesn't reflect what you tested, but you tested the version pasted above. You're saying that the optimization just never helps the dense-dense case, and sqdist is faster than a dot product. This doesn't make sense mathematically as it should be more math, but stranger things have happened.
    > 
    > Still, I don't follow your test code here. You parallelize one vector, map it, collect it: why use Spark? and it's the same vector over and over, and it's not a big vector. Your sparse vectors aren't very sparse.
    > 
    > How about more representative input -- larger vectors (100s of elements, probably), more sparse sparse vectors, and a large set of different inputs. I also don't see where the precision bound is changed here?
    > 
    > This may be a good change but I'm just not yet convinced by the test methodology, and the result still doesn't make much intuitive sense.
    
    1) why use Spark? not for special reason, only align with my common using tool. 
    
    2) About the vector, I did a more representative input test, I show this result below
    
    3) About the precision, it is trick,  you can meet your goal (let your calculation logic into which branch) by manually change it.  As I said in last comment, take LOGIC2 for example, you can manually change precision to -10000  in ( precisionbound1 < precision) and change precision to 10000 in (precisionbound2 > precision), so you calculation login will into LOGIC2 situation.  It is like codecoverage thing.  Anyway, we goal is to show the performance will not change in same calculation logic before and after added Enhance for sparse-sparse and sparse-dense situation.
    
    There is my test file
    [SparkMLlibTest.txt](https://github.com/apache/spark/files/2544667/SparkMLlibTest.txt)
    
    There is my test data situation
    I use the data 
    http://archive.ics.uci.edu/ml/datasets/Condition+monitoring+of+hydraulic+systems
    extract file (PS1, PS2, PS3, PS4, PS5, PS6) to form the test data
    
    total instances are 13230
    the attributes for line are 6000
    
    **Result for sparse-sparse situation time cost (milliseconds)**
    Before Enhance:  7670, 7704, 7652
    After Enhance: 7634, 7729, 7645



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on the issue:

    https://github.com/apache/spark/pull/22893
  
    @KyleLi1985 do you have native BLAS installed?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22893: One part of Spark MLlib Kmean Logic Performance problem

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/22893
  
    How much performance does it gain in end-to-end test, and how does it provide better performance?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22893
  
    **[Test build #4423 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4423/testReport)** for PR 22893 at commit [`dc30bac`](https://github.com/apache/spark/commit/dc30bace482aa129178b5b34f371ff20c8b545ba).
     * This patch **fails PySpark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...

Posted by KyleLi1985 <gi...@git.apache.org>.

Github user KyleLi1985 commented on the issue:

    https://github.com/apache/spark/pull/22893
  
    > OK, the Spark part doesn't seem relevant. The input might be more realistic here, yes. I was commenting that your test code doesn't show what you're testing, though I understand you manually modified it. Because the test is so central here I think it's important to understand exactly what you're measuring and exactly what you're running.
    > 
    > This doesn't show an improvement, right?
    
    TEST, I agree with you
    
    No influence for sparse case


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...

Posted by KyleLi1985 <gi...@git.apache.org>.

Github user KyleLi1985 commented on the issue:

    https://github.com/apache/spark/pull/22893
  
    > then I think you have to try with native BLAS installed, otherwise the results are not valid IMHO.
    This part only use F2j to calculate as I said in last comment, so the performance is not influence by the native BLAS



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...

Posted by KyleLi1985 <gi...@git.apache.org>.

Github user KyleLi1985 commented on the issue:

    https://github.com/apache/spark/pull/22893
  
    @SparkQA test this please


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/22893
  
    OK, the Spark part doesn't seem relevant. The input might be more realistic here, yes. I was commenting that your test code doesn't show what you're testing, though I understand you manually modified it. Because the test is so central here I think it's important to understand exactly what you're measuring and exactly what you're running.
    
    This doesn't show an improvement, right?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...

Posted by KyleLi1985 <gi...@git.apache.org>.

Github user KyleLi1985 commented on the issue:

    https://github.com/apache/spark/pull/22893
  
    @AmplabJenkins test this please


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...

Posted by KyleLi1985 <gi...@git.apache.org>.

Github user KyleLi1985 commented on the issue:

    https://github.com/apache/spark/pull/22893
  
    > Hm, actually that's the best case. You're exercising the case where the code path you prefer is fast. And the case where the precision bound applies is exactly the case where the branch you deleted helps. As I say, you'd have to show this is not impacting other cases significantly, and I think it should. Consider the sparse-sparse case.
    
    There is my test for sparse-sparse, dense-dense, sparse-dense case
    [SparkMLlibTest.txt](https://github.com/apache/spark/files/2541385/SparkMLlibTest.txt)
    
    There is test result:
    
    First we need define some branch path logic for sparse-sparse and sparse-dense case
    if meet precisionBound1, we define it as LOGIC1
    if not meet precisionBound1, and not meet precisionBound2, we define it as LOGIC2
    if not meet precisionBound1, but meet precisionBound2, we define it as LOGIC3
    (There is a trick, you can manually change the precision value to meet above situation)
    
    **sparse- sparse case time cost situation (milliseconds)**
    LOGIC1
    Before add patch: 7786, 7970, 8086
    After add patch: 7729, 7653, 7903
    LOGIC2
    Before add patch: 8412, 9029, 8606
    After add patch: 8603, 8724, 9024
    LOGIC3
    Before add patch: 19365, 19146, 19351
    After add patch: 18917, 19007, 19074
    
    **sparse-dense case time cost situation (milliseconds)**
    LOGIC1
    Before add patch: 4195, 4014, 4409
    After add patch: 4081,3971, 4151
    LOGIC2
    Before add patch: 4968, 5579, 5080
    After add patch: 4980, 5472, 5148
    LOGIC3
    Before add patch: 11848, 12077, 12168
    After add patch: 11718, 11874, 11743
    
    And for dense-dense case like we already discussed in comment, only use sqdist to calculate distance
    
    **dense-dense case time cost situation (milliseconds)**
    Before add patch: 7340, 7816, 7672
    After add patch: 5752, 5800, 5753
    
    The above result based on comparison between Original fastSquaredDistance and Enhanced fastSquaredDistance which is showed below
    
    `
    
        private[mllib] def fastSquaredDistance(
            v1: Vector,
            norm1: Double,
            v2: Vector,
            norm2: Double,
            precision: Double = 1e-6): Double = {
          val n = v1.size
          require(v2.size == n)
          require(norm1 >= 0.0 && norm2 >= 0.0)
          val sumSquaredNorm = norm1 * norm1 + norm2 * norm2
          val normDiff = norm1 - norm2
          var sqDist = 0.0
          /*
           * The relative error is
           * <pre>
           * EPSILON * ( \|a\|_2^2 + \|b\\_2^2 + 2 |a^T b|) / ( \|a - b\|_2^2 ),
           * </pre>
           * which is bounded by
           * <pre>
           * 2.0 * EPSILON * ( \|a\|_2^2 + \|b\|_2^2 ) / ( (\|a\|_2 - \|b\|_2)^2 ).
           * </pre>
           * The bound doesn't need the inner product, so we can use it as a sufficient condition to
           * check quickly whether the inner product approach is accurate.
           */
        
          if (v1.isInstanceOf[DenseVector] && v2.isInstanceOf[DenseVector]) {
            sqDist = Vectors.sqdist(v1, v2)
          } else {
            val precisionBound1 = 2.0 * EPSILON * sumSquaredNorm / (normDiff * normDiff + EPSILON)
      
            if (precisionBound1 < precision) {
              sqDist = sumSquaredNorm - 2.0 * dot(v1, v2)
            } else {
              val dotValue = dot(v1, v2)
              sqDist = math.max(sumSquaredNorm - 2.0 * dotValue, 0.0)
              val precisionBound2 = EPSILON * (sumSquaredNorm + 2.0 * math.abs(dotValue)) /
                (sqDist + EPSILON)
      
              if (precisionBound2 > precision) {
                sqDist = Vectors.sqdist(v1, v2)
              }
            }
          }
      
          sqDist
        }
    `


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...

Posted by KyleLi1985 <gi...@git.apache.org>.

Github user KyleLi1985 commented on the issue:

    https://github.com/apache/spark/pull/22893
  
    > I don't think BLAS matters here as these are all vector-vector operations and f2jblas is used directly (i.e. stays in the JVM).
    > 
    > Are all the vectors dense? I suppose I'm still surprised if sqdist is faster than dot here as it ought to be a little more math. The sparse-dense case might come out differently, note.
    > 
    > And I suppose I have a hard time believing that the sparse-sparse case is faster after this change (when the precision bound is met) because now it's handled in the sparse-sparse if case in this code, which definitely does a dot plus more work.
    > 
    > (If you did remove this check you could remove some other values that get computed to check this bound, like precision1)
    
    We use only "Vectors Dense", here is the test file
    [SparkMLlibTest.txt](https://github.com/apache/spark/files/2538030/SparkMLlibTest.txt)
    I extract the relevant part from code and compare the performance, The result show in Vectors Dense situation the sqdist is faster。
    And for End-to-End test, I consider the worst situation, input vector are all dense and the precision is not OK!
    
    And other situation definitive like your saying, if the precision is ok to use sqdist is not a good way.
    So, this part logic should be 
    `
        if (precisionBound1 < precision && v1.isInstanceOf[SparseVector] && v2.isInstanceOf[SparseVector]) 
       {
           sqDist = sumSquaredNorm - 2.0 * dot(v1, v2)
        } else if (v1.isInstanceOf[SparseVector] || v2.isInstanceOf[SparseVector]) {
          val dotValue = dot(v1, v2)
          sqDist = math.max(sumSquaredNorm - 2.0 * dotValue, 0.0)
          val precisionBound2 = EPSILON * (sumSquaredNorm + 2.0 * math.abs(dotValue)) /
            (sqDist + EPSILON)
          if (precisionBound2 > precision) {
            sqDist = Vectors.sqdist(v1, v2)
          }
        } else {
          sqDist = Vectors.sqdist(v1, v2)
        }
    `
    only use sqdist to calculate distance when the logic like above
    



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22893: One part of Spark MLlib Kmean Logic Performance problem

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/22893
  
    Please fix the PR title as described in https://spark.apache.org/contributing.html and read it.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmea...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22893#discussion_r231555559
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala ---
    @@ -521,19 +521,21 @@ object MLUtils extends Logging {
          * The bound doesn't need the inner product, so we can use it as a sufficient condition to
          * check quickly whether the inner product approach is accurate.
          */
    -    val precisionBound1 = 2.0 * EPSILON * sumSquaredNorm / (normDiff * normDiff + EPSILON)
    -    if (precisionBound1 < precision) {
    -      sqDist = sumSquaredNorm - 2.0 * dot(v1, v2)
    -    } else if (v1.isInstanceOf[SparseVector] || v2.isInstanceOf[SparseVector]) {
    -      val dotValue = dot(v1, v2)
    -      sqDist = math.max(sumSquaredNorm - 2.0 * dotValue, 0.0)
    -      val precisionBound2 = EPSILON * (sumSquaredNorm + 2.0 * math.abs(dotValue)) /
    -        (sqDist + EPSILON)
    -      if (precisionBound2 > precision) {
    -        sqDist = Vectors.sqdist(v1, v2)
    -      }
    -    } else {
    +    if (v1.isInstanceOf[DenseVector] && v2.isInstanceOf[DenseVector]) {
           sqDist = Vectors.sqdist(v1, v2)
    +    } else {
    +      val precisionBound1 = 2.0 * EPSILON * sumSquaredNorm / (normDiff * normDiff + EPSILON)
    --- End diff --
    
    Shouldn't you pull computation of things like normDiff into this block, then? That would improve the dense case more, I suppose.
    
    I'd still like to see your current final test code to understand what it's testing. If it's a win on some types of vectors, on realistic data, and doesn't hurt performance otherwise, this could be OK.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmea...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22893#discussion_r232456101
  
    --- Diff: python/pyspark/ml/clustering.py ---
    @@ -88,6 +88,14 @@ def clusterSizes(self):
             """
             return self._call_java("clusterSizes")
     
    +    @property
    --- End diff --
    
    Why is this now added, by mistake? this change should just be the one above and the test change


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...

Posted by KyleLi1985 <gi...@git.apache.org>.

Github user KyleLi1985 commented on the issue:

    https://github.com/apache/spark/pull/22893
  
    @SparkQA retest this please


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmea...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/22893


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...

Posted by KyleLi1985 <gi...@git.apache.org>.

Github user KyleLi1985 commented on the issue:

    https://github.com/apache/spark/pull/22893
  
    > Hm, actually that's the best case. You're exercising the case where the code path you prefer is fast. And the case where the precision bound applies is exactly the case where the branch you deleted helps. As I say, you'd have to show this is not impacting other cases significantly, and I think it should. Consider the sparse-sparse case.
    
    There is above test result:
    
    First we need define some branch path logic for sparse-sparse and sparse-dense case
    if meet precisionBound1, we define it as LOGIC1
    if not meet precisionBound1, and not meet precisionBound2, we define it as LOGIC2
    if not meet precisionBound1, but meet precisionBound2, we define it as LOGIC3
    (There is a trick, you can manually change the precision value to meet above situation)
    
    **sparse- sparse case time cost situation (milliseconds)**
    
    LOGIC1
    Before add patch:  7786, 7970, 8086
    After add patch: 7729, 7653, 7903
    LOGIC2
    Before add patch: 8412, 9029, 8606
    After add patch: 8603, 8724, 9024
    LOGIC3
    Before add patch:  19365, 19146, 19351
    After add patch:  18917, 19007, 19074
    
    **sparse-dense case time cost situation (milliseconds)**
    
    LOGIC1
    Before add patch: 4195, 4014, 4409
    After add patch: 4081,3971, 4151
    LOGIC2
    Before add patch: 4968, 5579, 5080
    After add patch: 4980, 5472, 5148
    LOGIC3
    Before add patch: 11848, 12077, 12168
    After add patch: 11718, 11874, 11743
    
    And for dense-dense case like we already discussed in comment, only use sqdist to calculate distance
    
    **dense-dense case time cost situation (milliseconds)**
    
    Before add patch: 7340, 7816, 7672
    After add patch: 5752, 5800, 5753
    
    The above result based on fastSquaredDistance which is showed below  
    
    `
    
        private[mllib] def fastSquaredDistance(
            v1: Vector,
            norm1: Double,
            v2: Vector,
            norm2: Double,
            precision: Double = 1e-6): Double = {
          val n = v1.size
          require(v2.size == n)
          require(norm1 >= 0.0 && norm2 >= 0.0)
          val sumSquaredNorm = norm1 * norm1 + norm2 * norm2
          val normDiff = norm1 - norm2
          var sqDist = 0.0
          /*
           * The relative error is
           * <pre>
           * EPSILON * ( \|a\|_2^2 + \|b\\_2^2 + 2 |a^T b|) / ( \|a - b\|_2^2 ),
           * </pre>
           * which is bounded by
           * <pre>
           * 2.0 * EPSILON * ( \|a\|_2^2 + \|b\|_2^2 ) / ( (\|a\|_2 - \|b\|_2)^2 ).
           * </pre>
           * The bound doesn't need the inner product, so we can use it as a sufficient condition to
           * check quickly whether the inner product approach is accurate.
           */
          val precisionBound1 = 2.0 * EPSILON * sumSquaredNorm / (normDiff * normDiff + EPSILON)
      
          if (precisionBound1 < precision && (!v1.isInstanceOf[DenseVector]
            || !v2.isInstanceOf[DenseVector])) {
              sqDist = sumSquaredNorm - 2.0 * dot(v1, v2)
          } else if (v1.isInstanceOf[SparseVector] || v2.isInstanceOf[SparseVector]) {
            val dotValue = dot(v1, v2)
            sqDist = math.max(sumSquaredNorm - 2.0 * dotValue, 0.0)
            val precisionBound2 = EPSILON * (sumSquaredNorm + 2.0 * math.abs(dotValue)) /
              (sqDist + EPSILON)
            if (precisionBound2 > precision) {
              sqDist = Vectors.sqdist(v1, v2)
            }
          } else {
            sqDist = Vectors.sqdist(v1, v2)
          }
      
          sqDist
        }
    `


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22893
  
    **[Test build #4423 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4423/testReport)** for PR 22893 at commit [`dc30bac`](https://github.com/apache/spark/commit/dc30bace482aa129178b5b34f371ff20c8b545ba).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22893
  
    **[Test build #4420 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4420/testReport)** for PR 22893 at commit [`762755d`](https://github.com/apache/spark/commit/762755d7b4196c808323bdcd6ee5db9127610bd1).
     * This patch **fails PySpark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmea...

Posted by KyleLi1985 <gi...@git.apache.org>.

Github user KyleLi1985 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22893#discussion_r231838390
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala ---
    @@ -521,19 +521,21 @@ object MLUtils extends Logging {
          * The bound doesn't need the inner product, so we can use it as a sufficient condition to
          * check quickly whether the inner product approach is accurate.
          */
    -    val precisionBound1 = 2.0 * EPSILON * sumSquaredNorm / (normDiff * normDiff + EPSILON)
    -    if (precisionBound1 < precision) {
    -      sqDist = sumSquaredNorm - 2.0 * dot(v1, v2)
    -    } else if (v1.isInstanceOf[SparseVector] || v2.isInstanceOf[SparseVector]) {
    -      val dotValue = dot(v1, v2)
    -      sqDist = math.max(sumSquaredNorm - 2.0 * dotValue, 0.0)
    -      val precisionBound2 = EPSILON * (sumSquaredNorm + 2.0 * math.abs(dotValue)) /
    -        (sqDist + EPSILON)
    -      if (precisionBound2 > precision) {
    -        sqDist = Vectors.sqdist(v1, v2)
    -      }
    -    } else {
    +    if (v1.isInstanceOf[DenseVector] && v2.isInstanceOf[DenseVector]) {
           sqDist = Vectors.sqdist(v1, v2)
    +    } else {
    +      val precisionBound1 = 2.0 * EPSILON * sumSquaredNorm / (normDiff * normDiff + EPSILON)
    --- End diff --
    
    @srowen Thanks for review, I will update the new commit and related test


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmea...

Posted by KyleLi1985 <gi...@git.apache.org>.

Github user KyleLi1985 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22893#discussion_r232457128
  
    --- Diff: python/pyspark/ml/clustering.py ---
    @@ -88,6 +88,14 @@ def clusterSizes(self):
             """
             return self._call_java("clusterSizes")
     
    +    @property
    --- End diff --
    
    It is my mistake, already update the related file to fit value "2.0"
    in spark/python/pyspark/ml/clustering.py and spark/python/pyspark/mllib/clustering.py


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/22893
  
    Merged to master


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...

Posted by KyleLi1985 <gi...@git.apache.org>.

Github user KyleLi1985 commented on the issue:

    https://github.com/apache/spark/pull/22893
  
    > @KyleLi1985 do you have native BLAS installed?
    Like code said : // For level-1 routines, we use Java implementation.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22893: One part of Spark MLlib Kmean Logic Performance problem

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22893
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22893: One part of Spark MLlib Kmean Logic Performance problem

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22893
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/22893
  
    Hm, actually that's the best case. You're exercising the case where the code path you prefer is fast. And the case where the precision bound applies is exactly the case where the branch you deleted helps. As I say, you'd have to show this is not impacting other cases significantly, and I think it should. Consider the sparse-sparse case.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org