You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by KyleLi1985 <gi...@git.apache.org> on 2018/10/31 10:25:25 UTC
[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...
Github user KyleLi1985 commented on the issue:
https://github.com/apache/spark/pull/22893
End-to-End TEST Situation:
Use below code to test
`
test("kmeanproblem") {
val rdd = sc
.textFile("/Users/liliang/Desktop/inputdata.txt")
.map(f => f.split(",")
.map(f => f.toDouble))
val vectorRdd = rdd.map(f => Vectors.dense(f))
val startTime = System.currentTimeMillis()
for (i <- 0 until 20) {
val model = new KMeans()
.setK(8)
.setMaxIterations(100)
.setInitializationMode(K_MEANS_PARALLEL)
.run(vectorRdd)
}
val endTime = System.currentTimeMillis()
// scalastyle:off println
println("cost time: " + (endTime - startTime))
// scalastyle:on println
`
Input Data:
extract 57216 items from the data (http://archive.ics.uci.edu/ml/datasets/EEG+Steady-State+Visual+Evoked+Potential+Signals) to form the test input data
Test Result:
Before: cost time is 297686 milliseconds (consider the worst situation)
After add patch: cost time is 180544 milliseconds (consider the worst situation)
Function Test Situation:
Only test function fastSquaredDistance in below situation:
call fastSquaredDistance function 100000000 times before and after added patch respectively
Input Data:
1 2 3 4 3 4 5 6 7 8 9 0 1 3 4 6 7 4 2 2 5 7 8 9 3 2 3 5 7 8 9 3 3 2 1 1 2 2 9 3 3 4 5
4 5 2 1 5 6 3 2 1 3 4 6 7 8 9 0 3 2 1 2 3 4 5 6 7 8 5 3 2 1 4 5 6 7 8 4 3 2 4 6 7 8 9
Test Result:
Before: cost time is 8395 milliseconds
After added patch: cost time is 5448 milliseconds
So according to above test, we can conclude that the patch give a better performance for function fastSquaredDistance in spark k-mean mode.
( further more the sqDist = Vectors.sqdist(v1, v2)
is better than
sqDist = sumSquaredNorm - 2.0 * dot(v1, v2)
in calculation performance
)
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org