You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by KyleLi1985 <gi...@git.apache.org> on 2018/11/01 10:13:42 UTC

[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...

Github user KyleLi1985 commented on the issue:

    https://github.com/apache/spark/pull/22893
  
    > I don't think BLAS matters here as these are all vector-vector operations and f2jblas is used directly (i.e. stays in the JVM).
    > 
    > Are all the vectors dense? I suppose I'm still surprised if sqdist is faster than dot here as it ought to be a little more math. The sparse-dense case might come out differently, note.
    > 
    > And I suppose I have a hard time believing that the sparse-sparse case is faster after this change (when the precision bound is met) because now it's handled in the sparse-sparse if case in this code, which definitely does a dot plus more work.
    > 
    > (If you did remove this check you could remove some other values that get computed to check this bound, like precision1)
    
    We use only "Vectors Dense", here is the test file
    [SparkMLlibTest.txt](https://github.com/apache/spark/files/2538030/SparkMLlibTest.txt)
    I extract the relevant part from code and compare the performance, The result show in Vectors Dense situation the sqdist is faster。
    And for End-to-End test, I consider the worst situation, input vector are all dense and the precision is not OK!
    
    And other situation definitive like your saying, if the precision is ok to use sqdist is not a good way.
    So, this part logic should be 
    `
        if (precisionBound1 < precision && v1.isInstanceOf[SparseVector] && v2.isInstanceOf[SparseVector]) 
       {
           sqDist = sumSquaredNorm - 2.0 * dot(v1, v2)
        } else if (v1.isInstanceOf[SparseVector] || v2.isInstanceOf[SparseVector]) {
          val dotValue = dot(v1, v2)
          sqDist = math.max(sumSquaredNorm - 2.0 * dotValue, 0.0)
          val precisionBound2 = EPSILON * (sumSquaredNorm + 2.0 * math.abs(dotValue)) /
            (sqDist + EPSILON)
          if (precisionBound2 > precision) {
            sqDist = Vectors.sqdist(v1, v2)
          }
        } else {
          sqDist = Vectors.sqdist(v1, v2)
        }
    `
    only use sqdist to calculate distance when the logic like above
    



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org