You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by beckgael <gi...@git.apache.org> on 2018/11/15 14:23:59 UTC

[GitHub] spark pull request #23048: transform DenseVector x DenseVector sqdist from i...

GitHub user beckgael opened a pull request:

    https://github.com/apache/spark/pull/23048

    transform DenseVector x DenseVector sqdist from imperativ to function…

    …al tailrec style which improves performance well on low and medium dim vectors
    
    ## What changes were proposed in this pull request?
    
    Improving speed efficiency Densevector X Densevector sqdist function
    
    (Please fill in changes proposed in this fix)
    
    ## How was this patch tested?
    
    Both version, imperativ and functional one have been tested separetely showing large improvment in speed for dimension 2, 3, 10, 100 when the performances tend to be the same on very high dim (10000, 100000)
    
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/beckgael/spark master

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/23048.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #23048
    
----
commit 5194a6942425ed9866774e1f5eccc66f5452d745
Author: beckgael <be...@...>
Date:   2018-11-15T14:16:45Z

    transform DenseVector x DenseVector sqdist from imperatif to functional tailrec style which improves performance well on low and medium dim vectors

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #23048: transform DenseVector x DenseVector sqdist from imperati...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23048
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #23048: transform DenseVector x DenseVector sqdist from i...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/23048#discussion_r235031986
  
    --- Diff: mllib-local/src/main/scala/org/apache/spark/ml/linalg/Vectors.scala ---
    @@ -370,14 +370,19 @@ object Vectors {
           case (v1: DenseVector, v2: SparseVector) =>
             squaredDistance = sqdist(v2, v1)
     
    -      case (DenseVector(vv1), DenseVector(vv2)) =>
    -        var kv = 0
    +      case (DenseVector(vv1), DenseVector(vv2)) => {
             val sz = vv1.length
    -        while (kv < sz) {
    -          val score = vv1(kv) - vv2(kv)
    -          squaredDistance += score * score
    -          kv += 1
    +        @annotation.tailrec
    --- End diff --
    
    The StackOverflow post points to a reason the two implementations that were being compared there could be different. Here I don't see any branching is saved. 
    
    I can't explain your microbenchmark, which still surprises me. If I had to guess it's because of JIT compilation or something. I do wonder whether the microbenchmark shows an improvement that won't actually materialize in this code. That is, I think you'd have to show that this improves the performance of Spark DenseVectors when used as-is in the Vector class here.
    
    I'm pretty skeptical but stranger things have happened.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #23048: transform DenseVector x DenseVector sqdist from imperati...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23048
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #23048: transform DenseVector x DenseVector sqdist from i...

Posted by beckgael <gi...@git.apache.org>.

Github user beckgael commented on a diff in the pull request:

    https://github.com/apache/spark/pull/23048#discussion_r234937901
  
    --- Diff: mllib-local/src/main/scala/org/apache/spark/ml/linalg/Vectors.scala ---
    @@ -370,14 +370,19 @@ object Vectors {
           case (v1: DenseVector, v2: SparseVector) =>
             squaredDistance = sqdist(v2, v1)
     
    -      case (DenseVector(vv1), DenseVector(vv2)) =>
    -        var kv = 0
    +      case (DenseVector(vv1), DenseVector(vv2)) => {
             val sz = vv1.length
    -        while (kv < sz) {
    -          val score = vv1(kv) - vv2(kv)
    -          squaredDistance += score * score
    -          kv += 1
    +        @annotation.tailrec
    --- End diff --
    
    Hi, i put it [here](https://github.com/beckgael/functional_sqdist_spark) my experimental process with a SparkNotebook or with sbt project in order to allow others to tests.
    I took the `sqdist` function as it is, make a functional version of it and compare both as best as i can.
    Run the notebook on 100k pair (Imperativ/Functional) of `sqdist` for medium size vectors took less than 1min and results are different enough to consider them i presume, even if it's not the best way to test it and i'm sorry for that.
    After having browsed on StackOverFlow i fall on [this one](https://stackoverflow.com/questions/9168624/why-is-my-scala-tail-recursion-faster-than-the-while-loop) which talks about a difference between JVM version on tailrec.
    
    I hope we'll find something interesting and that i'm not disturbing you on your fantastic work !



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #23048: transform DenseVector x DenseVector sqdist from i...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/23048#discussion_r234399421
  
    --- Diff: mllib-local/src/main/scala/org/apache/spark/ml/linalg/Vectors.scala ---
    @@ -370,14 +370,19 @@ object Vectors {
           case (v1: DenseVector, v2: SparseVector) =>
             squaredDistance = sqdist(v2, v1)
     
    -      case (DenseVector(vv1), DenseVector(vv2)) =>
    -        var kv = 0
    +      case (DenseVector(vv1), DenseVector(vv2)) => {
             val sz = vv1.length
    -        while (kv < sz) {
    -          val score = vv1(kv) - vv2(kv)
    -          squaredDistance += score * score
    -          kv += 1
    +        @annotation.tailrec
    --- End diff --
    
    FWIW, I did an experimental by myself before because someone proposed similar change (turning `while` to tail recursive one). There was no virtual difference. If this change is only tail recursion vs `while`, I doubt how much it can improve. I believe the benchmark steps should be clarified.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #23048: transform DenseVector x DenseVector sqdist from i...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/23048#discussion_r233866690
  
    --- Diff: mllib-local/src/main/scala/org/apache/spark/ml/linalg/Vectors.scala ---
    @@ -370,14 +370,19 @@ object Vectors {
           case (v1: DenseVector, v2: SparseVector) =>
             squaredDistance = sqdist(v2, v1)
     
    -      case (DenseVector(vv1), DenseVector(vv2)) =>
    -        var kv = 0
    +      case (DenseVector(vv1), DenseVector(vv2)) => {
             val sz = vv1.length
    -        while (kv < sz) {
    -          val score = vv1(kv) - vv2(kv)
    -          squaredDistance += score * score
    -          kv += 1
    +        @annotation.tailrec
    --- End diff --
    
    Why would tail recursion be faster? I can't really imagine why. A small method can get JITted, but the overhead of the method calls is still got to be much higher than an increment and compare. I'd like to understand how it was benchmarked and what the difference is


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #23048: transform DenseVector x DenseVector sqdist from imperati...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23048
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org