You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by KyleLi1985 <gi...@git.apache.org> on 2018/10/31 10:25:25 UTC

[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...

Github user KyleLi1985 commented on the issue:

    https://github.com/apache/spark/pull/22893
  
    End-to-End TEST Situation:
    Use below code to test
    `  
    test("kmeanproblem") {
        val rdd = sc
            .textFile("/Users/liliang/Desktop/inputdata.txt")
            .map(f => f.split(",")
            .map(f => f.toDouble))
    
        val vectorRdd = rdd.map(f => Vectors.dense(f))
        val startTime = System.currentTimeMillis()
        for (i <- 0 until 20) {
          val model = new KMeans()
            .setK(8)
            .setMaxIterations(100)
            .setInitializationMode(K_MEANS_PARALLEL)
            .run(vectorRdd)
        }
        val endTime = System.currentTimeMillis()
    
        // scalastyle:off println
        println("cost time: " + (endTime - startTime))
        // scalastyle:on println
    ` 
    Input Data:
    extract 57216 items from the data (http://archive.ics.uci.edu/ml/datasets/EEG+Steady-State+Visual+Evoked+Potential+Signals) to form the test input data
    Test Result:
    Before: cost time is 297686 milliseconds (consider the worst situation)
    After add patch: cost time is 180544 milliseconds (consider the worst situation)
    
    Function Test Situation:
    Only test function fastSquaredDistance in below situation: 
    call fastSquaredDistance function 100000000 times before and after added patch respectively 
    
    Input Data:
    1 2 3 4 3 4 5 6 7 8 9 0 1 3 4 6 7 4 2 2 5 7 8 9 3 2 3 5 7 8 9 3 3 2 1 1 2 2 9 3 3 4 5
    4 5 2 1 5 6 3 2 1 3 4 6 7 8 9 0 3 2 1 2 3 4 5 6 7 8 5 3 2 1 4 5 6 7 8 4 3 2 4 6 7 8 9
    Test Result:
    Before: cost time is 8395 milliseconds
    After added patch: cost time is 5448 milliseconds
    
    So according to above test, we can conclude that the patch give a better performance for function fastSquaredDistance in spark k-mean mode. 
    (   further more the sqDist = Vectors.sqdist(v1, v2) 
        is better than 
        sqDist = sumSquaredNorm - 2.0 * dot(v1, v2)
        in calculation performance
    )


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org