You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by KyleLi1985 <gi...@git.apache.org> on 2018/10/30 11:26:14 UTC
[GitHub] spark pull request #22893: One part of Spark MLlib Kmean Logic Performance p...
GitHub user KyleLi1985 opened a pull request:
https://github.com/apache/spark/pull/22893
One part of Spark MLlib Kmean Logic Performance problem
[SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic Performance problem
## What changes were proposed in this pull request?
Reduce low performance logic in function fastSquaredDistance
## How was this patch tested?
./dev/run-tests pass
Calculate two vector for test fastSquaredDistance 100000000 times
1 2 3 4 3 4 5 6 7 8 9 0 1 3 4 6 7 4 2 2 5 7 8 9 3 2 3 5 7 8 9 3 3 2 1 1 2 2 9 3 3 4 5
4 5 2 1 5 6 3 2 1 3 4 6 7 8 9 0 3 2 1 2 3 4 5 6 7 8 5 3 2 1 4 5 6 7 8 4 3 2 4 6 7 8 9
After added patch, the cost time update from 8395 to 5448 milliseconds
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/KyleLi1985/spark updatekmeanpatch
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/22893.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #22893
----
commit 701223b39a0c473de865de30b0017af4883441f3
Author: 李亮 <li...@...>
Date: 2018-10-30T11:03:02Z
upgrade kmean performance
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22893: One part of Spark MLlib Kmean Logic Performance problem
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22893
Can one of the admins verify this patch?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...
Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the issue:
https://github.com/apache/spark/pull/22893
There's no merge conflict right now. You can just update the file and push the commit to your branch. If there were a merge conflict, you'd just rebase on apache/master, resolve the conflict, and force-push the branch.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...
Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the issue:
https://github.com/apache/spark/pull/22893
Heh, as a side effect, this made the output of computeCost more accurate in one Pyspark test. It prints "2.0" rather than "2.000..." I think you can change the three instances that failed to just expect "2.0".
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...
Posted by KyleLi1985 <gi...@git.apache.org>.
Github user KyleLi1985 commented on the issue:
https://github.com/apache/spark/pull/22893
I form the final test case for sparse case and dense case on realistic data to test new commit
[SparkMLlibTest.txt](https://github.com/apache/spark/files/2561442/SparkMLlibTest.txt)
**For Dense case**, we use data from
http://archive.ics.uci.edu/ml/datasets/EEG+Steady-State+Visual+Evoked+Potential+Signals
and use all the dense data file from this realistic data
**Dense case Test Result time cost (milliseconds)**
Before Enhance 211878 210845 215375
After Enhance 140827 149282 130691
**For Sparse case**, we use data from
http://archive.ics.uci.edu/ml/datasets/Condition+monitoring+of+hydraulic+systems
and extract all the sparse data file (PS1, PS2, PS3, PS4, PS5, PS6) from this realistic data
**Sparse case Test Result time cost (milliseconds)**
Before Enhance 108080 103582 103586
After Enhance 107652 107145 104768
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...
Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on the issue:
https://github.com/apache/spark/pull/22893
then I think you have to try with native BLAS installed, otherwise the results are not valid IMHO.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...
Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on the issue:
https://github.com/apache/spark/pull/22893
retest this please
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...
Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the issue:
https://github.com/apache/spark/pull/22893
Thanks @KyleLi1985 this looks like a nice win in the end. Thanks for your investigation.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/22893
**[Test build #4420 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4420/testReport)** for PR 22893 at commit [`762755d`](https://github.com/apache/spark/commit/762755d7b4196c808323bdcd6ee5db9127610bd1).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...
Posted by KyleLi1985 <gi...@git.apache.org>.
Github user KyleLi1985 commented on the issue:
https://github.com/apache/spark/pull/22893
> Hm, actually that's the best case. You're exercising the case where the code path you prefer is fast. And the case where the precision bound applies is exactly the case where the branch you deleted helps. As I say, you'd have to show this is not impacting other cases significantly, and I think it should. Consider the sparse-sparse case.
There is my test for sparse-sparse, dense-dense, sparse-dense case
[SparkMLlibTest.txt](https://github.com/apache/spark/files/2541007/SparkMLlibTest.txt)
`
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.mllib.linalg.{DenseVector, SparseVector, Vector, Vectors}
import com.github.fommil.netlib.F2jBLAS
import java.util.Date
object SparkMLlibTest {
class VectorWithNorm(val vector: Vector, val norm: Double) extends Serializable {
def this(vector: Vector) = this(vector, Vectors.norm(vector, 2.0))
def this(array: Array[Double]) = this(Vectors.dense(array))
/** Converts the vector to a dense vector. */
def toDense: VectorWithNorm = new VectorWithNorm(Vectors.dense(vector.toArray), norm)
}
def dot(x: Vector, y: Vector): Double = {
require(x.size == y.size,
"BLAS.dot(x: Vector, y:Vector) was given Vectors with non-matching sizes:" +
" x.size = " + x.size + ", y.size = " + y.size)
(x, y) match {
case (dx: DenseVector, dy: DenseVector) =>
dot(dx, dy)
case (sx: SparseVector, dy: DenseVector) =>
dot(sx, dy)
case (dx: DenseVector, sy: SparseVector) =>
dot(sy, dx)
case (sx: SparseVector, sy: SparseVector) =>
dot(sx, sy)
case _ =>
throw new IllegalArgumentException(s"dot doesn't support (${x.getClass}, ${y.getClass}).")
}
}
def dot(x: DenseVector, y: DenseVector): Double = {
val n = x.size
new F2jBLAS().ddot(n, x.values, 1, y.values, 1)
}
def dot(x: SparseVector, y: DenseVector): Double = {
val xValues = x.values
val xIndices = x.indices
val yValues = y.values
val nnz = xIndices.length
var sum = 0.0
var k = 0
while (k < nnz) {
sum += xValues(k) * yValues(xIndices(k))
k += 1
}
sum
}
/**
* dot(x, y)
*/
def dot(x: SparseVector, y: SparseVector): Double = {
val xValues = x.values
val xIndices = x.indices
val yValues = y.values
val yIndices = y.indices
val nnzx = xIndices.length
val nnzy = yIndices.length
var kx = 0
var ky = 0
var sum = 0.0
// y catching x
while (kx < nnzx && ky < nnzy) {
val ix = xIndices(kx)
while (ky < nnzy && yIndices(ky) < ix) {
ky += 1
}
if (ky < nnzy && yIndices(ky) == ix) {
sum += xValues(kx) * yValues(ky)
ky += 1
}
kx += 1
}
sum
}
lazy val EPSILON = {
var eps = 1.0
while ((1.0 + (eps / 2.0)) != 1.0) {
eps /= 2.0
}
eps
}
def fastSquaredDistanceAddedPatch(
v1: Vector,
norm1: Double,
v2: Vector,
norm2: Double,
precision: Double = 1e-6): Double = {
val n = v1.size
require(v2.size == n)
require(norm1 >= 0.0 && norm2 >= 0.0)
val sumSquaredNorm = norm1 * norm1 + norm2 * norm2
val normDiff = norm1 - norm2
var sqDist = 0.0
/*
* The relative error is
* <pre>
* EPSILON * ( \|a\|_2^2 + \|b\\_2^2 + 2 |a^T b|) / ( \|a - b\|_2^2 ),
* </pre>
* which is bounded by
* <pre>
* 2.0 * EPSILON * ( \|a\|_2^2 + \|b\|_2^2 ) / ( (\|a\|_2 - \|b\|_2)^2 ).
* </pre>
* The bound doesn't need the inner product, so we can use it as a sufficient condition to
* check quickly whether the inner product approach is accurate.
*/
val precisionBound1 = 2.0 * EPSILON * sumSquaredNorm / (normDiff * normDiff + EPSILON)
if (precisionBound1 < precision && (!v1.isInstanceOf[DenseVector]
|| !v2.isInstanceOf[DenseVector])) {
sqDist = sumSquaredNorm - 2.0 * dot(v1, v2)
} else if (v1.isInstanceOf[SparseVector] || v2.isInstanceOf[SparseVector]) {
val dotValue = dot(v1, v2)
sqDist = math.max(sumSquaredNorm - 2.0 * dotValue, 0.0)
val precisionBound2 = EPSILON * (sumSquaredNorm + 2.0 * math.abs(dotValue)) /
(sqDist + EPSILON)
if (precisionBound2 > precision) {
sqDist = Vectors.sqdist(v1, v2)
}
} else {
sqDist = Vectors.sqdist(v1, v2)
}
sqDist
}
def fastSquaredDistanceOriginal(
v1: Vector,
norm1: Double,
v2: Vector,
norm2: Double,
precision: Double = 1e-6): Double = {
val n = v1.size
require(v2.size == n)
require(norm1 >= 0.0 && norm2 >= 0.0)
val sumSquaredNorm = norm1 * norm1 + norm2 * norm2
val normDiff = norm1 - norm2
var sqDist = 0.0
/*
* The relative error is
* <pre>
* EPSILON * ( \|a\|_2^2 + \|b\\_2^2 + 2 |a^T b|) / ( \|a - b\|_2^2 ),
* </pre>
* which is bounded by
* <pre>
* 2.0 * EPSILON * ( \|a\|_2^2 + \|b\|_2^2 ) / ( (\|a\|_2 - \|b\|_2)^2 ).
* </pre>
* The bound doesn't need the inner product, so we can use it as a sufficient condition to
* check quickly whether the inner product approach is accurate.
*/
val precisionBound1 = 2.0 * EPSILON * sumSquaredNorm / (normDiff * normDiff + EPSILON)
if (precisionBound1 < precision) {
sqDist = sumSquaredNorm - 2.0 * dot(v1, v2)
} else if (v1.isInstanceOf[SparseVector] || v2.isInstanceOf[SparseVector]) {
val dotValue = dot(v1, v2)
sqDist = math.max(sumSquaredNorm - 2.0 * dotValue, 0.0)
val precisionBound2 = EPSILON * (sumSquaredNorm + 2.0 * math.abs(dotValue)) /
(sqDist + EPSILON)
if (precisionBound2 > precision) {
sqDist = Vectors.sqdist(v1, v2)
}
} else {
sqDist = Vectors.sqdist(v1, v2)
}
sqDist
}
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("Test")
conf.setMaster("local[*]")
conf.set("spark.testing.memory", "512000000")
val sc = new SparkContext(conf)
val testData1 = sc.parallelize(List(Array(0,0,2,3,4,1,3,5,3,2,1,5,0,5,0,0,3,0,1,2,3,0,0,0,3,0,0,0,0,0,0,0,1,0,0,0,2)),1)
val testData2 = sc.parallelize(List(Array(1,2,3,0,2,3,0,3,5,7,2,1,3,2,0,3,2,1,0,4,0,1,0,1,0,1,0,0,0,0,0,2,0,0,0,1,0)),1)
val vector1dense = testData1.map{
f =>
val vector = Vectors.dense(f.map(f=>f.toDouble))
new VectorWithNorm(vector, Vectors.norm(vector,2))
}.take(1).apply(0)
val vector2dense = testData2.map{
f =>
val vector = Vectors.dense(f.map(f=>f.toDouble))
new VectorWithNorm(vector, Vectors.norm(vector,2))
}.take(1).apply(0)
val vector1sparse = testData1.map{
f =>
val vector = Vectors.sparse(f.length, f.zipWithIndex.map(f=>f._2),f.map(f=>f.toDouble))
val spasevector = vector.toSparse
new VectorWithNorm(spasevector, Vectors.norm(spasevector,2))
}.take(1).apply(0)
val vector2sparse = testData2.map{
f =>
val vector = Vectors.sparse(f.length, f.zipWithIndex.map(f=>f._2),f.map(f=>f.toDouble))
val spasevector = vector.toSparse
new VectorWithNorm(spasevector, Vectors.norm(spasevector,2))
}.take(1).apply(0)
var vector1 = Vectors.zeros(0)
var vector2 = Vectors.zeros(0)
var norm1 = 0.0
var norm2 = 0.0
//use different input data mode to check before and after add patch
//INPUT mode:
//sparse/sparseCase
//dense/denseCase
//dense/sparseCase
val key = "dense/sparseCase"
key match {
case "sparse/sparseCase" =>
vector1 = vector1sparse.vector
vector2 = vector2sparse.vector
norm1 = vector1sparse.norm
norm2 = vector2sparse.norm
case "dense/denseCase" =>
vector1 = vector1dense.vector
vector2 = vector2dense.vector
norm1 = vector1dense.norm
norm2 = vector2dense.norm
case "dense/sparseCase" =>
vector1 = vector1sparse.vector
vector2 = vector2dense.vector
norm1 = vector1sparse.norm
norm2 = vector2dense.norm
}
var sqOri = 0.0
val startTime1 = new Date().getTime
for (i <- 0 until(100000000)) {//100000000
sqOri = fastSquaredDistanceAddedPatch(vector1,norm1,vector2,norm2)
}
val endTime1 = new Date().getTime
println("sqOri: " + sqOri)
println("costTime1: " + (endTime1 - startTime1))
var sqAP = 0.0
val startTime2 = new Date().getTime
for (a <- 0 until(100000000)) {
sqAP = fastSquaredDistanceAddedPatch(vector1,norm1,vector2,norm2)
}
val endTime2 = new Date().getTime
println("sqAp: " + sqAP)
println("costTime2: " + (endTime2 - startTime2))
sc.stop()
}
}
`
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...
Posted by KyleLi1985 <gi...@git.apache.org>.
Github user KyleLi1985 commented on the issue:
https://github.com/apache/spark/pull/22893
> Thanks @KyleLi1985 this looks like a nice win in the end. Thanks for your investigation.
@srowen @HyukjinKwon @mgaido91 Thanks for review. It is my pleasure.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/22893
**[Test build #4424 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4424/testReport)** for PR 22893 at commit [`50ef296`](https://github.com/apache/spark/commit/50ef296e4df865105188f00e09be758187d0fbe9).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...
Posted by KyleLi1985 <gi...@git.apache.org>.
Github user KyleLi1985 commented on the issue:
https://github.com/apache/spark/pull/22893
> I don't think BLAS matters here as these are all vector-vector operations and f2jblas is used directly (i.e. stays in the JVM).
>
> Are all the vectors dense? I suppose I'm still surprised if sqdist is faster than dot here as it ought to be a little more math. The sparse-dense case might come out differently, note.
>
> And I suppose I have a hard time believing that the sparse-sparse case is faster after this change (when the precision bound is met) because now it's handled in the sparse-sparse if case in this code, which definitely does a dot plus more work.
>
> (If you did remove this check you could remove some other values that get computed to check this bound, like precision1)
We use only "Vectors Dense", here is the test file
[SparkMLlibTest.txt](https://github.com/apache/spark/files/2538356/SparkMLlibTest.txt)
I extract the relevant part from code and compare the performance, The result show in Vectors Dense situation the sqdist is faster。
And for End-to-End test, I consider the worst situation, input vector are all dense and the precision is not OK!
`
if (precisionBound1 < precision && !v1.isInstanceOf[DenseVector]
&& !v2.isInstanceOf[DenseVector]) {
sqDist = sumSquaredNorm - 2.0 * dot(v1, v2)
} else if (v1.isInstanceOf[SparseVector] || v2.isInstanceOf[SparseVector]) {
val dotValue = dot(v1, v2)
sqDist = math.max(sumSquaredNorm - 2.0 * dotValue, 0.0)
val precisionBound2 = EPSILON * (sumSquaredNorm + 2.0 * math.abs(dotValue)) /
(sqDist + EPSILON)
if (precisionBound2 > precision) {
sqDist = Vectors.sqdist(v1, v2)
}
} else {
sqDist = Vectors.sqdist(v1, v2)
}
`
only use sqdist to calculate distance when the logic is presented above
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/22893
**[Test build #4424 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4424/testReport)** for PR 22893 at commit [`50ef296`](https://github.com/apache/spark/commit/50ef296e4df865105188f00e09be758187d0fbe9).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22893: One part of Spark MLlib Kmean Logic Performance problem
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/22893
Also please fill the PR description. How much performance does it gain?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...
Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the issue:
https://github.com/apache/spark/pull/22893
So the pull request right now doesn't reflect what you tested, but you tested the version pasted above. You're saying that the optimization just never helps the dense-dense case, and sqdist is faster than a dot product. This doesn't make sense mathematically as it should be more math, but stranger things have happened.
Still, I don't follow your test code here. You parallelize one vector, map it, collect it: why use Spark? and it's the same vector over and over, and it's not a big vector. Your sparse vectors aren't very sparse.
How about more representative input -- larger vectors (100s of elements, probably), more sparse sparse vectors, and a large set of different inputs. I also don't see where the precision bound is changed here?
This may be a good change but I'm just not yet convinced by the test methodology, and the result still doesn't make much intuitive sense.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...
Posted by KyleLi1985 <gi...@git.apache.org>.
Github user KyleLi1985 commented on the issue:
https://github.com/apache/spark/pull/22893
End-to-End TEST Situation:
Use below code to test
`
test("kmeanproblem") {
val rdd = sc
.textFile("/Users/liliang/Desktop/inputdata.txt")
.map(f => f.split(",")
.map(f => f.toDouble))
val vectorRdd = rdd.map(f => Vectors.dense(f))
val startTime = System.currentTimeMillis()
for (i <- 0 until 20) {
val model = new KMeans()
.setK(8)
.setMaxIterations(100)
.setInitializationMode(K_MEANS_PARALLEL)
.run(vectorRdd)
}
val endTime = System.currentTimeMillis()
// scalastyle:off println
println("cost time: " + (endTime - startTime))
// scalastyle:on println
`
Input Data:
extract 57216 items from the data (http://archive.ics.uci.edu/ml/datasets/EEG+Steady-State+Visual+Evoked+Potential+Signals) to form the test input data
Test Result:
Before: cost time is 297686 milliseconds (consider the worst situation)
After add patch: cost time is 180544 milliseconds (consider the worst situation)
Function Test Situation:
Only test function fastSquaredDistance in below situation:
call fastSquaredDistance function 100000000 times before and after added patch respectively
Input Data:
1 2 3 4 3 4 5 6 7 8 9 0 1 3 4 6 7 4 2 2 5 7 8 9 3 2 3 5 7 8 9 3 3 2 1 1 2 2 9 3 3 4 5
4 5 2 1 5 6 3 2 1 3 4 6 7 8 9 0 3 2 1 2 3 4 5 6 7 8 5 3 2 1 4 5 6 7 8 4 3 2 4 6 7 8 9
Test Result:
Before: cost time is 8395 milliseconds
After added patch: cost time is 5448 milliseconds
So according to above test, we can conclude that the patch give a better performance for function fastSquaredDistance in spark k-mean mode.
( further more the sqDist = Vectors.sqdist(v1, v2)
is better than
sqDist = sumSquaredNorm - 2.0 * dot(v1, v2)
in calculation performance
)
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...
Posted by KyleLi1985 <gi...@git.apache.org>.
Github user KyleLi1985 commented on the issue:
https://github.com/apache/spark/pull/22893
It seems the related file spark/python/pyspark/ml/clustering.py has been changed, during these days. My local latest commit stay on "bfe60fc on 30 Jul". So I need re-fork spark and open another pull request, or is there other method?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...
Posted by KyleLi1985 <gi...@git.apache.org>.
Github user KyleLi1985 commented on the issue:
https://github.com/apache/spark/pull/22893
There is my test for situation sparse-sparse, dense-dense, sparse-dense case
`
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.mllib.linalg.{DenseVector, SparseVector, Vector, Vectors}
import com.github.fommil.netlib.F2jBLAS
import java.util.Date
object SparkMLlibTest {
class VectorWithNorm(val vector: Vector, val norm: Double) extends Serializable {
def this(vector: Vector) = this(vector, Vectors.norm(vector, 2.0))
def this(array: Array[Double]) = this(Vectors.dense(array))
/** Converts the vector to a dense vector. */
def toDense: VectorWithNorm = new VectorWithNorm(Vectors.dense(vector.toArray), norm)
}
def dot(x: Vector, y: Vector): Double = {
require(x.size == y.size,
"BLAS.dot(x: Vector, y:Vector) was given Vectors with non-matching sizes:" +
" x.size = " + x.size + ", y.size = " + y.size)
(x, y) match {
case (dx: DenseVector, dy: DenseVector) =>
dot(dx, dy)
case (sx: SparseVector, dy: DenseVector) =>
dot(sx, dy)
case (dx: DenseVector, sy: SparseVector) =>
dot(sy, dx)
case (sx: SparseVector, sy: SparseVector) =>
dot(sx, sy)
case _ =>
throw new IllegalArgumentException(s"dot doesn't support (${x.getClass}, ${y.getClass}).")
}
}
def dot(x: DenseVector, y: DenseVector): Double = {
val n = x.size
new F2jBLAS().ddot(n, x.values, 1, y.values, 1)
}
def dot(x: SparseVector, y: DenseVector): Double = {
val xValues = x.values
val xIndices = x.indices
val yValues = y.values
val nnz = xIndices.length
var sum = 0.0
var k = 0
while (k < nnz) {
sum += xValues(k) * yValues(xIndices(k))
k += 1
}
sum
}
/**
* dot(x, y)
*/
def dot(x: SparseVector, y: SparseVector): Double = {
val xValues = x.values
val xIndices = x.indices
val yValues = y.values
val yIndices = y.indices
val nnzx = xIndices.length
val nnzy = yIndices.length
var kx = 0
var ky = 0
var sum = 0.0
// y catching x
while (kx < nnzx && ky < nnzy) {
val ix = xIndices(kx)
while (ky < nnzy && yIndices(ky) < ix) {
ky += 1
}
if (ky < nnzy && yIndices(ky) == ix) {
sum += xValues(kx) * yValues(ky)
ky += 1
}
kx += 1
}
sum
}
lazy val EPSILON = {
var eps = 1.0
while ((1.0 + (eps / 2.0)) != 1.0) {
eps /= 2.0
}
eps
}
def fastSquaredDistanceAddedPatch(
v1: Vector,
norm1: Double,
v2: Vector,
norm2: Double,
precision: Double = 1e-6): Double = {
val n = v1.size
require(v2.size == n)
require(norm1 >= 0.0 && norm2 >= 0.0)
val sumSquaredNorm = norm1 * norm1 + norm2 * norm2
val normDiff = norm1 - norm2
var sqDist = 0.0
/*
* The relative error is
* <pre>
* EPSILON * ( \|a\|_2^2 + \|b\\_2^2 + 2 |a^T b|) / ( \|a - b\|_2^2 ),
* </pre>
* which is bounded by
* <pre>
* 2.0 * EPSILON * ( \|a\|_2^2 + \|b\|_2^2 ) / ( (\|a\|_2 - \|b\|_2)^2 ).
* </pre>
* The bound doesn't need the inner product, so we can use it as a sufficient condition to
* check quickly whether the inner product approach is accurate.
*/
val precisionBound1 = 2.0 * EPSILON * sumSquaredNorm / (normDiff * normDiff + EPSILON)
if (precisionBound1 < precision && (!v1.isInstanceOf[DenseVector]
|| !v2.isInstanceOf[DenseVector])) {
sqDist = sumSquaredNorm - 2.0 * dot(v1, v2)
} else if (v1.isInstanceOf[SparseVector] || v2.isInstanceOf[SparseVector]) {
val dotValue = dot(v1, v2)
sqDist = math.max(sumSquaredNorm - 2.0 * dotValue, 0.0)
val precisionBound2 = EPSILON * (sumSquaredNorm + 2.0 * math.abs(dotValue)) /
(sqDist + EPSILON)
if (precisionBound2 > precision) {
sqDist = Vectors.sqdist(v1, v2)
}
} else {
sqDist = Vectors.sqdist(v1, v2)
}
sqDist
}
def fastSquaredDistanceOriginal(
v1: Vector,
norm1: Double,
v2: Vector,
norm2: Double,
precision: Double = 1e-6): Double = {
val n = v1.size
require(v2.size == n)
require(norm1 >= 0.0 && norm2 >= 0.0)
val sumSquaredNorm = norm1 * norm1 + norm2 * norm2
val normDiff = norm1 - norm2
var sqDist = 0.0
/*
* The relative error is
* <pre>
* EPSILON * ( \|a\|_2^2 + \|b\\_2^2 + 2 |a^T b|) / ( \|a - b\|_2^2 ),
* </pre>
* which is bounded by
* <pre>
* 2.0 * EPSILON * ( \|a\|_2^2 + \|b\|_2^2 ) / ( (\|a\|_2 - \|b\|_2)^2 ).
* </pre>
* The bound doesn't need the inner product, so we can use it as a sufficient condition to
* check quickly whether the inner product approach is accurate.
*/
val precisionBound1 = 2.0 * EPSILON * sumSquaredNorm / (normDiff * normDiff + EPSILON)
if (precisionBound1 < precision) {
sqDist = sumSquaredNorm - 2.0 * dot(v1, v2)
} else if (v1.isInstanceOf[SparseVector] || v2.isInstanceOf[SparseVector]) {
val dotValue = dot(v1, v2)
sqDist = math.max(sumSquaredNorm - 2.0 * dotValue, 0.0)
val precisionBound2 = EPSILON * (sumSquaredNorm + 2.0 * math.abs(dotValue)) /
(sqDist + EPSILON)
if (precisionBound2 > precision) {
sqDist = Vectors.sqdist(v1, v2)
}
} else {
sqDist = Vectors.sqdist(v1, v2)
}
sqDist
}
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("Test")
conf.setMaster("local[*]")
conf.set("spark.testing.memory", "512000000")
val sc = new SparkContext(conf)
val testData1 = sc.parallelize(List(Array(0,0,2,3,4,1,3,5,3,2,1,5,0,5,0,0,3,0,1,2,3,0,0,0,3,0,0,0,0,0,0,0,1,0,0,0,2)),1)
val testData2 = sc.parallelize(List(Array(1,2,3,0,2,3,0,3,5,7,2,1,3,2,0,3,2,1,0,4,0,1,0,1,0,1,0,0,0,0,0,2,0,0,0,1,0)),1)
val vector1dense = testData1.map{
f =>
val vector = Vectors.dense(f.map(f=>f.toDouble))
new VectorWithNorm(vector, Vectors.norm(vector,2))
}.take(1).apply(0)
val vector2dense = testData2.map{
f =>
val vector = Vectors.dense(f.map(f=>f.toDouble))
new VectorWithNorm(vector, Vectors.norm(vector,2))
}.take(1).apply(0)
val vector1sparse = testData1.map{
f =>
val vector = Vectors.sparse(f.length, f.zipWithIndex.map(f=>f._2),f.map(f=>f.toDouble))
val spasevector = vector.toSparse
new VectorWithNorm(spasevector, Vectors.norm(spasevector,2))
}.take(1).apply(0)
val vector2sparse = testData2.map{
f =>
val vector = Vectors.sparse(f.length, f.zipWithIndex.map(f=>f._2),f.map(f=>f.toDouble))
val spasevector = vector.toSparse
new VectorWithNorm(spasevector, Vectors.norm(spasevector,2))
}.take(1).apply(0)
var vector1 = Vectors.zeros(0)
var vector2 = Vectors.zeros(0)
var norm1 = 0.0
var norm2 = 0.0
//use different input data mode to check before and after add patch
//INPUT mode:
//sparse/sparseCase
//dense/denseCase
//dense/sparseCase
val key = "dense/sparseCase"
key match {
case "sparse/sparseCase" =>
vector1 = vector1sparse.vector
vector2 = vector2sparse.vector
norm1 = vector1sparse.norm
norm2 = vector2sparse.norm
case "dense/denseCase" =>
vector1 = vector1dense.vector
vector2 = vector2dense.vector
norm1 = vector1dense.norm
norm2 = vector2dense.norm
case "dense/sparseCase" =>
vector1 = vector1sparse.vector
vector2 = vector2dense.vector
norm1 = vector1sparse.norm
norm2 = vector2dense.norm
}
var sqOri = 0.0
val startTime1 = new Date().getTime
for (i <- 0 until(100000000)) {//100000000
sqOri = fastSquaredDistanceAddedPatch(vector1,norm1,vector2,norm2)
}
val endTime1 = new Date().getTime
println("sqOri: " + sqOri)
println("costTime1: " + (endTime1 - startTime1))
var sqAP = 0.0
val startTime2 = new Date().getTime
for (a <- 0 until(100000000)) {
sqAP = fastSquaredDistanceAddedPatch(vector1,norm1,vector2,norm2)
}
val endTime2 = new Date().getTime
println("sqAp: " + sqAP)
println("costTime2: " + (endTime2 - startTime2))
sc.stop()
}
}
`
[SparkMLlibTest.txt](https://github.com/apache/spark/files/2541002/SparkMLlibTest.txt)
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...
Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the issue:
https://github.com/apache/spark/pull/22893
I don't think BLAS matters here as these are all vector-vector operations and f2jblas is used directly (i.e. stays in the JVM).
Are all the vectors dense? I suppose I'm still surprised if sqdist is faster than dot here as it ought to be a little more math. The sparse-dense case might come out differently, note.
And I suppose I have a hard time believing that the sparse-sparse case is faster after this change (when the precision bound is met) because now it's handled in the sparse-sparse if case in this code, which definitely does a dot plus more work.
(If you did remove this check you could remove some other values that get computed to check this bound, like precision1)
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...
Posted by KyleLi1985 <gi...@git.apache.org>.
Github user KyleLi1985 commented on the issue:
https://github.com/apache/spark/pull/22893
> then I think you have to try with native BLAS installed, otherwise the results are not valid IMHO.
Ok, For a fair result, I will try it
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...
Posted by KyleLi1985 <gi...@git.apache.org>.
Github user KyleLi1985 commented on the issue:
https://github.com/apache/spark/pull/22893
> So the pull request right now doesn't reflect what you tested, but you tested the version pasted above. You're saying that the optimization just never helps the dense-dense case, and sqdist is faster than a dot product. This doesn't make sense mathematically as it should be more math, but stranger things have happened.
>
> Still, I don't follow your test code here. You parallelize one vector, map it, collect it: why use Spark? and it's the same vector over and over, and it's not a big vector. Your sparse vectors aren't very sparse.
>
> How about more representative input -- larger vectors (100s of elements, probably), more sparse sparse vectors, and a large set of different inputs. I also don't see where the precision bound is changed here?
>
> This may be a good change but I'm just not yet convinced by the test methodology, and the result still doesn't make much intuitive sense.
1) why use Spark? not for special reason, only align with my common using tool.
2) About the vector, I did a more representative input test, I show this result below
3) About the precision, it is trick, you can meet your goal (let your calculation logic into which branch) by manually change it. As I said in last comment, take LOGIC2 for example, you can manually change precision to -10000 in ( precisionbound1 < precision) and change precision to 10000 in (precisionbound2 > precision), so you calculation login will into LOGIC2 situation. It is like codecoverage thing. Anyway, we goal is to show the performance will not change in same calculation logic before and after added Enhance for sparse-sparse and sparse-dense situation.
There is my test file
[SparkMLlibTest.txt](https://github.com/apache/spark/files/2544667/SparkMLlibTest.txt)
There is my test data situation
I use the data
http://archive.ics.uci.edu/ml/datasets/Condition+monitoring+of+hydraulic+systems
extract file (PS1, PS2, PS3, PS4, PS5, PS6) to form the test data
total instances are 13230
the attributes for line are 6000
**Result for sparse-sparse situation time cost (milliseconds)**
Before Enhance: 7670, 7704, 7652
After Enhance: 7634, 7729, 7645
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...
Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on the issue:
https://github.com/apache/spark/pull/22893
@KyleLi1985 do you have native BLAS installed?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22893: One part of Spark MLlib Kmean Logic Performance problem
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/22893
How much performance does it gain in end-to-end test, and how does it provide better performance?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/22893
**[Test build #4423 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4423/testReport)** for PR 22893 at commit [`dc30bac`](https://github.com/apache/spark/commit/dc30bace482aa129178b5b34f371ff20c8b545ba).
* This patch **fails PySpark unit tests**.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...
Posted by KyleLi1985 <gi...@git.apache.org>.
Github user KyleLi1985 commented on the issue:
https://github.com/apache/spark/pull/22893
> OK, the Spark part doesn't seem relevant. The input might be more realistic here, yes. I was commenting that your test code doesn't show what you're testing, though I understand you manually modified it. Because the test is so central here I think it's important to understand exactly what you're measuring and exactly what you're running.
>
> This doesn't show an improvement, right?
TEST, I agree with you
No influence for sparse case
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...
Posted by KyleLi1985 <gi...@git.apache.org>.
Github user KyleLi1985 commented on the issue:
https://github.com/apache/spark/pull/22893
> then I think you have to try with native BLAS installed, otherwise the results are not valid IMHO.
This part only use F2j to calculate as I said in last comment, so the performance is not influence by the native BLAS
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...
Posted by KyleLi1985 <gi...@git.apache.org>.
Github user KyleLi1985 commented on the issue:
https://github.com/apache/spark/pull/22893
@SparkQA test this please
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...
Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the issue:
https://github.com/apache/spark/pull/22893
OK, the Spark part doesn't seem relevant. The input might be more realistic here, yes. I was commenting that your test code doesn't show what you're testing, though I understand you manually modified it. Because the test is so central here I think it's important to understand exactly what you're measuring and exactly what you're running.
This doesn't show an improvement, right?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...
Posted by KyleLi1985 <gi...@git.apache.org>.
Github user KyleLi1985 commented on the issue:
https://github.com/apache/spark/pull/22893
@AmplabJenkins test this please
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...
Posted by KyleLi1985 <gi...@git.apache.org>.
Github user KyleLi1985 commented on the issue:
https://github.com/apache/spark/pull/22893
> Hm, actually that's the best case. You're exercising the case where the code path you prefer is fast. And the case where the precision bound applies is exactly the case where the branch you deleted helps. As I say, you'd have to show this is not impacting other cases significantly, and I think it should. Consider the sparse-sparse case.
There is my test for sparse-sparse, dense-dense, sparse-dense case
[SparkMLlibTest.txt](https://github.com/apache/spark/files/2541385/SparkMLlibTest.txt)
There is test result:
First we need define some branch path logic for sparse-sparse and sparse-dense case
if meet precisionBound1, we define it as LOGIC1
if not meet precisionBound1, and not meet precisionBound2, we define it as LOGIC2
if not meet precisionBound1, but meet precisionBound2, we define it as LOGIC3
(There is a trick, you can manually change the precision value to meet above situation)
**sparse- sparse case time cost situation (milliseconds)**
LOGIC1
Before add patch: 7786, 7970, 8086
After add patch: 7729, 7653, 7903
LOGIC2
Before add patch: 8412, 9029, 8606
After add patch: 8603, 8724, 9024
LOGIC3
Before add patch: 19365, 19146, 19351
After add patch: 18917, 19007, 19074
**sparse-dense case time cost situation (milliseconds)**
LOGIC1
Before add patch: 4195, 4014, 4409
After add patch: 4081,3971, 4151
LOGIC2
Before add patch: 4968, 5579, 5080
After add patch: 4980, 5472, 5148
LOGIC3
Before add patch: 11848, 12077, 12168
After add patch: 11718, 11874, 11743
And for dense-dense case like we already discussed in comment, only use sqdist to calculate distance
**dense-dense case time cost situation (milliseconds)**
Before add patch: 7340, 7816, 7672
After add patch: 5752, 5800, 5753
The above result based on comparison between Original fastSquaredDistance and Enhanced fastSquaredDistance which is showed below
`
private[mllib] def fastSquaredDistance(
v1: Vector,
norm1: Double,
v2: Vector,
norm2: Double,
precision: Double = 1e-6): Double = {
val n = v1.size
require(v2.size == n)
require(norm1 >= 0.0 && norm2 >= 0.0)
val sumSquaredNorm = norm1 * norm1 + norm2 * norm2
val normDiff = norm1 - norm2
var sqDist = 0.0
/*
* The relative error is
* <pre>
* EPSILON * ( \|a\|_2^2 + \|b\\_2^2 + 2 |a^T b|) / ( \|a - b\|_2^2 ),
* </pre>
* which is bounded by
* <pre>
* 2.0 * EPSILON * ( \|a\|_2^2 + \|b\|_2^2 ) / ( (\|a\|_2 - \|b\|_2)^2 ).
* </pre>
* The bound doesn't need the inner product, so we can use it as a sufficient condition to
* check quickly whether the inner product approach is accurate.
*/
if (v1.isInstanceOf[DenseVector] && v2.isInstanceOf[DenseVector]) {
sqDist = Vectors.sqdist(v1, v2)
} else {
val precisionBound1 = 2.0 * EPSILON * sumSquaredNorm / (normDiff * normDiff + EPSILON)
if (precisionBound1 < precision) {
sqDist = sumSquaredNorm - 2.0 * dot(v1, v2)
} else {
val dotValue = dot(v1, v2)
sqDist = math.max(sumSquaredNorm - 2.0 * dotValue, 0.0)
val precisionBound2 = EPSILON * (sumSquaredNorm + 2.0 * math.abs(dotValue)) /
(sqDist + EPSILON)
if (precisionBound2 > precision) {
sqDist = Vectors.sqdist(v1, v2)
}
}
}
sqDist
}
`
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...
Posted by KyleLi1985 <gi...@git.apache.org>.
Github user KyleLi1985 commented on the issue:
https://github.com/apache/spark/pull/22893
> I don't think BLAS matters here as these are all vector-vector operations and f2jblas is used directly (i.e. stays in the JVM).
>
> Are all the vectors dense? I suppose I'm still surprised if sqdist is faster than dot here as it ought to be a little more math. The sparse-dense case might come out differently, note.
>
> And I suppose I have a hard time believing that the sparse-sparse case is faster after this change (when the precision bound is met) because now it's handled in the sparse-sparse if case in this code, which definitely does a dot plus more work.
>
> (If you did remove this check you could remove some other values that get computed to check this bound, like precision1)
We use only "Vectors Dense", here is the test file
[SparkMLlibTest.txt](https://github.com/apache/spark/files/2538030/SparkMLlibTest.txt)
I extract the relevant part from code and compare the performance, The result show in Vectors Dense situation the sqdist is faster。
And for End-to-End test, I consider the worst situation, input vector are all dense and the precision is not OK!
And other situation definitive like your saying, if the precision is ok to use sqdist is not a good way.
So, this part logic should be
`
if (precisionBound1 < precision && v1.isInstanceOf[SparseVector] && v2.isInstanceOf[SparseVector])
{
sqDist = sumSquaredNorm - 2.0 * dot(v1, v2)
} else if (v1.isInstanceOf[SparseVector] || v2.isInstanceOf[SparseVector]) {
val dotValue = dot(v1, v2)
sqDist = math.max(sumSquaredNorm - 2.0 * dotValue, 0.0)
val precisionBound2 = EPSILON * (sumSquaredNorm + 2.0 * math.abs(dotValue)) /
(sqDist + EPSILON)
if (precisionBound2 > precision) {
sqDist = Vectors.sqdist(v1, v2)
}
} else {
sqDist = Vectors.sqdist(v1, v2)
}
`
only use sqdist to calculate distance when the logic like above
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22893: One part of Spark MLlib Kmean Logic Performance problem
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/22893
Please fix the PR title as described in https://spark.apache.org/contributing.html and read it.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmea...
Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:
https://github.com/apache/spark/pull/22893#discussion_r231555559
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala ---
@@ -521,19 +521,21 @@ object MLUtils extends Logging {
* The bound doesn't need the inner product, so we can use it as a sufficient condition to
* check quickly whether the inner product approach is accurate.
*/
- val precisionBound1 = 2.0 * EPSILON * sumSquaredNorm / (normDiff * normDiff + EPSILON)
- if (precisionBound1 < precision) {
- sqDist = sumSquaredNorm - 2.0 * dot(v1, v2)
- } else if (v1.isInstanceOf[SparseVector] || v2.isInstanceOf[SparseVector]) {
- val dotValue = dot(v1, v2)
- sqDist = math.max(sumSquaredNorm - 2.0 * dotValue, 0.0)
- val precisionBound2 = EPSILON * (sumSquaredNorm + 2.0 * math.abs(dotValue)) /
- (sqDist + EPSILON)
- if (precisionBound2 > precision) {
- sqDist = Vectors.sqdist(v1, v2)
- }
- } else {
+ if (v1.isInstanceOf[DenseVector] && v2.isInstanceOf[DenseVector]) {
sqDist = Vectors.sqdist(v1, v2)
+ } else {
+ val precisionBound1 = 2.0 * EPSILON * sumSquaredNorm / (normDiff * normDiff + EPSILON)
--- End diff --
Shouldn't you pull computation of things like normDiff into this block, then? That would improve the dense case more, I suppose.
I'd still like to see your current final test code to understand what it's testing. If it's a win on some types of vectors, on realistic data, and doesn't hurt performance otherwise, this could be OK.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmea...
Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:
https://github.com/apache/spark/pull/22893#discussion_r232456101
--- Diff: python/pyspark/ml/clustering.py ---
@@ -88,6 +88,14 @@ def clusterSizes(self):
"""
return self._call_java("clusterSizes")
+ @property
--- End diff --
Why is this now added, by mistake? this change should just be the one above and the test change
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...
Posted by KyleLi1985 <gi...@git.apache.org>.
Github user KyleLi1985 commented on the issue:
https://github.com/apache/spark/pull/22893
@SparkQA retest this please
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmea...
Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:
https://github.com/apache/spark/pull/22893
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...
Posted by KyleLi1985 <gi...@git.apache.org>.
Github user KyleLi1985 commented on the issue:
https://github.com/apache/spark/pull/22893
> Hm, actually that's the best case. You're exercising the case where the code path you prefer is fast. And the case where the precision bound applies is exactly the case where the branch you deleted helps. As I say, you'd have to show this is not impacting other cases significantly, and I think it should. Consider the sparse-sparse case.
There is above test result:
First we need define some branch path logic for sparse-sparse and sparse-dense case
if meet precisionBound1, we define it as LOGIC1
if not meet precisionBound1, and not meet precisionBound2, we define it as LOGIC2
if not meet precisionBound1, but meet precisionBound2, we define it as LOGIC3
(There is a trick, you can manually change the precision value to meet above situation)
**sparse- sparse case time cost situation (milliseconds)**
LOGIC1
Before add patch: 7786, 7970, 8086
After add patch: 7729, 7653, 7903
LOGIC2
Before add patch: 8412, 9029, 8606
After add patch: 8603, 8724, 9024
LOGIC3
Before add patch: 19365, 19146, 19351
After add patch: 18917, 19007, 19074
**sparse-dense case time cost situation (milliseconds)**
LOGIC1
Before add patch: 4195, 4014, 4409
After add patch: 4081,3971, 4151
LOGIC2
Before add patch: 4968, 5579, 5080
After add patch: 4980, 5472, 5148
LOGIC3
Before add patch: 11848, 12077, 12168
After add patch: 11718, 11874, 11743
And for dense-dense case like we already discussed in comment, only use sqdist to calculate distance
**dense-dense case time cost situation (milliseconds)**
Before add patch: 7340, 7816, 7672
After add patch: 5752, 5800, 5753
The above result based on fastSquaredDistance which is showed below
`
private[mllib] def fastSquaredDistance(
v1: Vector,
norm1: Double,
v2: Vector,
norm2: Double,
precision: Double = 1e-6): Double = {
val n = v1.size
require(v2.size == n)
require(norm1 >= 0.0 && norm2 >= 0.0)
val sumSquaredNorm = norm1 * norm1 + norm2 * norm2
val normDiff = norm1 - norm2
var sqDist = 0.0
/*
* The relative error is
* <pre>
* EPSILON * ( \|a\|_2^2 + \|b\\_2^2 + 2 |a^T b|) / ( \|a - b\|_2^2 ),
* </pre>
* which is bounded by
* <pre>
* 2.0 * EPSILON * ( \|a\|_2^2 + \|b\|_2^2 ) / ( (\|a\|_2 - \|b\|_2)^2 ).
* </pre>
* The bound doesn't need the inner product, so we can use it as a sufficient condition to
* check quickly whether the inner product approach is accurate.
*/
val precisionBound1 = 2.0 * EPSILON * sumSquaredNorm / (normDiff * normDiff + EPSILON)
if (precisionBound1 < precision && (!v1.isInstanceOf[DenseVector]
|| !v2.isInstanceOf[DenseVector])) {
sqDist = sumSquaredNorm - 2.0 * dot(v1, v2)
} else if (v1.isInstanceOf[SparseVector] || v2.isInstanceOf[SparseVector]) {
val dotValue = dot(v1, v2)
sqDist = math.max(sumSquaredNorm - 2.0 * dotValue, 0.0)
val precisionBound2 = EPSILON * (sumSquaredNorm + 2.0 * math.abs(dotValue)) /
(sqDist + EPSILON)
if (precisionBound2 > precision) {
sqDist = Vectors.sqdist(v1, v2)
}
} else {
sqDist = Vectors.sqdist(v1, v2)
}
sqDist
}
`
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/22893
**[Test build #4423 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4423/testReport)** for PR 22893 at commit [`dc30bac`](https://github.com/apache/spark/commit/dc30bace482aa129178b5b34f371ff20c8b545ba).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/22893
**[Test build #4420 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4420/testReport)** for PR 22893 at commit [`762755d`](https://github.com/apache/spark/commit/762755d7b4196c808323bdcd6ee5db9127610bd1).
* This patch **fails PySpark unit tests**.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmea...
Posted by KyleLi1985 <gi...@git.apache.org>.
Github user KyleLi1985 commented on a diff in the pull request:
https://github.com/apache/spark/pull/22893#discussion_r231838390
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala ---
@@ -521,19 +521,21 @@ object MLUtils extends Logging {
* The bound doesn't need the inner product, so we can use it as a sufficient condition to
* check quickly whether the inner product approach is accurate.
*/
- val precisionBound1 = 2.0 * EPSILON * sumSquaredNorm / (normDiff * normDiff + EPSILON)
- if (precisionBound1 < precision) {
- sqDist = sumSquaredNorm - 2.0 * dot(v1, v2)
- } else if (v1.isInstanceOf[SparseVector] || v2.isInstanceOf[SparseVector]) {
- val dotValue = dot(v1, v2)
- sqDist = math.max(sumSquaredNorm - 2.0 * dotValue, 0.0)
- val precisionBound2 = EPSILON * (sumSquaredNorm + 2.0 * math.abs(dotValue)) /
- (sqDist + EPSILON)
- if (precisionBound2 > precision) {
- sqDist = Vectors.sqdist(v1, v2)
- }
- } else {
+ if (v1.isInstanceOf[DenseVector] && v2.isInstanceOf[DenseVector]) {
sqDist = Vectors.sqdist(v1, v2)
+ } else {
+ val precisionBound1 = 2.0 * EPSILON * sumSquaredNorm / (normDiff * normDiff + EPSILON)
--- End diff --
@srowen Thanks for review, I will update the new commit and related test
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmea...
Posted by KyleLi1985 <gi...@git.apache.org>.
Github user KyleLi1985 commented on a diff in the pull request:
https://github.com/apache/spark/pull/22893#discussion_r232457128
--- Diff: python/pyspark/ml/clustering.py ---
@@ -88,6 +88,14 @@ def clusterSizes(self):
"""
return self._call_java("clusterSizes")
+ @property
--- End diff --
It is my mistake, already update the related file to fit value "2.0"
in spark/python/pyspark/ml/clustering.py and spark/python/pyspark/mllib/clustering.py
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...
Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the issue:
https://github.com/apache/spark/pull/22893
Merged to master
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...
Posted by KyleLi1985 <gi...@git.apache.org>.
Github user KyleLi1985 commented on the issue:
https://github.com/apache/spark/pull/22893
> @KyleLi1985 do you have native BLAS installed?
Like code said : // For level-1 routines, we use Java implementation.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22893: One part of Spark MLlib Kmean Logic Performance problem
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22893
Can one of the admins verify this patch?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22893: One part of Spark MLlib Kmean Logic Performance problem
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22893
Can one of the admins verify this patch?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...
Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the issue:
https://github.com/apache/spark/pull/22893
Hm, actually that's the best case. You're exercising the case where the code path you prefer is fast. And the case where the precision bound applies is exactly the case where the branch you deleted helps. As I say, you'd have to show this is not impacting other cases significantly, and I think it should. Consider the sparse-sparse case.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org