You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by yanboliang <gi...@git.apache.org> on 2015/12/15 10:42:59 UTC

[GitHub] spark pull request: [SPARK-8519] [ML] [MLlib] Blockify distance co...

GitHub user yanboliang opened a pull request:

    https://github.com/apache/spark/pull/10306

    [SPARK-8519] [ML] [MLlib] Blockify distance computation in k-means

    Use BLAS Level 3 matrix-matrix multiplications to compute pairwise distance in k-means.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/yanboliang/spark spark-8519

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/10306.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #10306
    
----
commit 202b578ac966e4d61015dda33f10cf7f2cc76022
Author: Yanbo Liang <yb...@gmail.com>
Date:   2015-12-15T07:57:33Z

    Initial draft of blockify distance computation in k-means

commit 46816f3e8999a5a3ba3c0e93406939143ab1267b
Author: Yanbo Liang <yb...@gmail.com>
Date:   2015-12-15T09:37:48Z

    clean up code and rename some variables

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8519] [ML] [MLlib] Blockify distance co...

Posted by sethah <gi...@git.apache.org>.

Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10306#discussion_r47819551
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala ---
    @@ -250,114 +240,142 @@ class KMeans private (
             }
           }
         }
    +
         val initTimeInSeconds = (System.nanoTime() - initStartTime) / 1e9
         logInfo(s"Initialization with $initializationMode took " + "%.3f".format(initTimeInSeconds) +
           " seconds.")
     
    -    val active = Array.fill(numRuns)(true)
    -    val costs = Array.fill(numRuns)(0.0)
    -
    -    var activeRuns = new ArrayBuffer[Int] ++ (0 until numRuns)
    +    var costs = 0.0
         var iteration = 0
    -
         val iterationStartTime = System.nanoTime()
    +    val isSparse = data.take(1)(0).vector.isInstanceOf[SparseVector]
     
    -    // Execute iterations of Lloyd's algorithm until all runs have converged
    -    while (iteration < maxIterations && !activeRuns.isEmpty) {
    +    // Execute Lloyd's algorithm until converged or reached the max number of iterations
    +    while (iteration < maxIterations) {
    --- End diff --
    
    Won't the algorithm always run until `maxIterations` with this change? The convergence check inside the loop is unused.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8519] [ML] [MLlib] Blockify distance co...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10306#issuecomment-165064242
  
    **[Test build #47809 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47809/consoleFull)** for PR 10306 at commit [`347d3ac`](https://github.com/apache/spark/commit/347d3acfedcfabf8e97e9ccd037a9b268af0f351).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8519] [ML] [MLlib] Blockify distance co...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10306#issuecomment-165064565
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/47809/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8519] [ML] [MLlib] Blockify distance co...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10306#issuecomment-165385374
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/47912/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8519] [ML] [MLlib] Blockify distance co...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10306#discussion_r47763856
  
    --- Diff: mllib/src/test/scala/org/apache/spark/mllib/clustering/PowerIterationClusteringSuite.scala ---
    @@ -96,14 +96,14 @@ class PowerIterationClusteringSuite extends SparkFunSuite with MLlibTestSparkCon
         }
         val graph = Graph.fromEdges(sc.parallelize(edges, 2), 0.0)
     
    -    val model = new PowerIterationClustering()
    -      .setK(2)
    -      .run(graph)
    -    val predictions = Array.fill(2)(mutable.Set.empty[Long])
    -    model.assignments.collect().foreach { a =>
    -      predictions(a.cluster) += a.id
    -    }
    -    assert(predictions.toSet == Set((0 to 3).toSet, (4 to 15).toSet))
    +//    val model = new PowerIterationClustering()
    --- End diff --
    
    Disable this test case because it was blocked by [SPARK-12363](https://issues.apache.org/jira/browse/SPARK-12363).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8519][SPARK-11560][SPARK-11559] [ML] [M...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on the pull request:

    https://github.com/apache/spark/pull/10306#issuecomment-172558579
  
    @mengxr I found the misconfiguration of my test environment and updated it, thanks! 
    Now ```gemm``` is about 20-30 times faster than ```axpy/dot``` in the updated test cases.
    ```Scala
        println(com.github.fommil.netlib.BLAS.getInstance().getClass.getName)
        val n = 3000
        val count = 10
        val random = new Random()
    
        val a = Vectors.dense(Array.fill(n)(random.nextDouble()))
        val aa = Array.fill(n)(a)
        val b = Vectors.dense(Array.fill(n)(random.nextDouble()))
        val bb = Array.fill(n)(b)
    
        val a1 = new DenseMatrix(n, n, aa.flatMap(_.toArray), true)
        val b1 = new DenseMatrix(n, n, bb.flatMap(_.toArray), false)
        val c1 = Matrices.zeros(n, n).asInstanceOf[DenseMatrix]
    
        var total1 = 0.0
    
        // Trial runs
        for (i <- 0 until 10) {
          gemm(2.0, a1, b1, 2.0, c1)
        }
    
        for (i <- 0 until count) {
          val start = System.nanoTime()
          gemm(2.0, a1, b1, 2.0, c1)
          total1 += (System.nanoTime() - start)/1e9
        }
        total1 = total1 / count
        println("gemm elapsed time: = %.3f".format(total1) + " seconds.")
    
        // Trial runs
        for (m <- 0 until 10) {
          for (i <- 0 until n; j <- 0 until n) {
            dot(bb(j), aa(i))
          }
        }
    
        var total2 = 0.0
        for (m <- 0 until count) {
          val start = System.nanoTime()
          for (i <- 0 until n; j <- 0 until n) {
            //      axpy(1.0, bb(j), aa(i))
            dot(bb(j), aa(i))
          }
          total2 += (System.nanoTime() - start)/1e9
        }
        total2 = total2 / count
        println("dot elapsed time: = %.3f".format(total2) + " seconds.")
    ```
    The output is:
    ```
    com.github.fommil.netlib.NativeSystemBLAS
    gemm elapsed time: = 1.022 seconds.
    dot elapsed time: = 29.017 seconds.
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8519] [ML] [MLlib] Blockify distance co...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10306#issuecomment-165378537
  
    **[Test build #47912 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47912/consoleFull)** for PR 10306 at commit [`8f76116`](https://github.com/apache/spark/commit/8f76116a471a54fd0cb6331ebf8eeccc5764a23e).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8519] [ML] [MLlib] Blockify distance co...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10306#issuecomment-165385198
  
    **[Test build #47912 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47912/consoleFull)** for PR 10306 at commit [`8f76116`](https://github.com/apache/spark/commit/8f76116a471a54fd0cb6331ebf8eeccc5764a23e).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8519] [ML] [MLlib] Blockify distance co...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10306#issuecomment-164707603
  
    **[Test build #47724 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47724/consoleFull)** for PR 10306 at commit [`46816f3`](https://github.com/apache/spark/commit/46816f3e8999a5a3ba3c0e93406939143ab1267b).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8519] [ML] [MLlib] Blockify distance co...

Posted by 3ourroom <gi...@git.apache.org>.

Github user 3ourroom commented on the pull request:

    https://github.com/apache/spark/pull/10306#issuecomment-164703972
  
    
    NAVER - http://www.naver.com/
    --------------------------------------------
    
    3ourroom@naver.com 님께 보내신 메일 <[spark] [SPARK-8519] [ML] [MLlib] Blockify distance computation in k-means (#10306)> 이 다음과 같은 이유로 전송 실패했습니다.
    
    --------------------------------------------
    
    받는 사람이 회원님의 메일을 수신차단 하였습니다. 
    
    
    --------------------------------------------



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8519] [ML] [MLlib] Blockify distance co...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10306#issuecomment-165054086
  
    **[Test build #47809 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47809/consoleFull)** for PR 10306 at commit [`347d3ac`](https://github.com/apache/spark/commit/347d3acfedcfabf8e97e9ccd037a9b268af0f351).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8519][SPARK-11560][SPARK-11559] [ML] [M...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang closed the pull request at:

    https://github.com/apache/spark/pull/10306


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8519][SPARK-11560][SPARK-11559] [ML] [M...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on the pull request:

    https://github.com/apache/spark/pull/10306#issuecomment-171942083
  
    @mengxr Thanks for the prompt. I will check my environment and re-run the test.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8519] [ML] [MLlib] Blockify distance co...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10306#issuecomment-165064560
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8519] [ML] [MLlib] Blockify distance co...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10306#discussion_r47878080
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala ---
    @@ -250,114 +240,142 @@ class KMeans private (
             }
           }
         }
    +
         val initTimeInSeconds = (System.nanoTime() - initStartTime) / 1e9
         logInfo(s"Initialization with $initializationMode took " + "%.3f".format(initTimeInSeconds) +
           " seconds.")
     
    -    val active = Array.fill(numRuns)(true)
    -    val costs = Array.fill(numRuns)(0.0)
    -
    -    var activeRuns = new ArrayBuffer[Int] ++ (0 until numRuns)
    +    var costs = 0.0
         var iteration = 0
    -
         val iterationStartTime = System.nanoTime()
    +    val isSparse = data.take(1)(0).vector.isInstanceOf[SparseVector]
     
    -    // Execute iterations of Lloyd's algorithm until all runs have converged
    -    while (iteration < maxIterations && !activeRuns.isEmpty) {
    +    // Execute Lloyd's algorithm until converged or reached the max number of iterations
    +    while (iteration < maxIterations) {
    --- End diff --
    
    @sethah Thanks for catching this!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8519] [ML] [MLlib] Blockify distance co...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10306#issuecomment-165385372
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8519] [ML] [MLlib] Blockify distance co...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10306#issuecomment-164717776
  
    **[Test build #47724 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47724/consoleFull)** for PR 10306 at commit [`46816f3`](https://github.com/apache/spark/commit/46816f3e8999a5a3ba3c0e93406939143ab1267b).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8519][SPARK-11560][SPARK-11559] [ML] [M...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on the pull request:

    https://github.com/apache/spark/pull/10306#issuecomment-172562607
  
    @mengxr I have a new and advanced implementation for this issue at #10806 , let's move the discussion there. I will close this PR now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8519] [ML] [MLlib] Blockify distance co...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10306#discussion_r47878122
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala ---
    @@ -250,114 +240,142 @@ class KMeans private (
             }
           }
         }
    +
         val initTimeInSeconds = (System.nanoTime() - initStartTime) / 1e9
         logInfo(s"Initialization with $initializationMode took " + "%.3f".format(initTimeInSeconds) +
           " seconds.")
     
    -    val active = Array.fill(numRuns)(true)
    -    val costs = Array.fill(numRuns)(0.0)
    -
    -    var activeRuns = new ArrayBuffer[Int] ++ (0 until numRuns)
    +    var costs = 0.0
         var iteration = 0
    -
         val iterationStartTime = System.nanoTime()
    +    val isSparse = data.take(1)(0).vector.isInstanceOf[SparseVector]
     
    -    // Execute iterations of Lloyd's algorithm until all runs have converged
    -    while (iteration < maxIterations && !activeRuns.isEmpty) {
    +    // Execute Lloyd's algorithm until converged or reached the max number of iterations
    +    while (iteration < maxIterations) {
           type WeightedPoint = (Vector, Long)
           def mergeContribs(x: WeightedPoint, y: WeightedPoint): WeightedPoint = {
             axpy(1.0, x._1, y._1)
             (y._1, x._2 + y._2)
           }
     
    -      val activeCenters = activeRuns.map(r => centers(r)).toArray
    -      val costAccums = activeRuns.map(_ => sc.accumulator(0.0))
    -
    -      val bcActiveCenters = sc.broadcast(activeCenters)
    +      val costAccums = sc.accumulator(0.0)
    +      val bcCenters = sc.broadcast(centers)
     
           // Find the sum and count of points mapping to each center
           val totalContribs = data.mapPartitions { points =>
    -        val thisActiveCenters = bcActiveCenters.value
    -        val runs = thisActiveCenters.length
    -        val k = thisActiveCenters(0).length
    -        val dims = thisActiveCenters(0)(0).vector.size
    +        val thisCenters = bcCenters.value
    +        val k = thisCenters.length
    +        val dims = thisCenters(0).vector.size
    +
    +        val sums = Array.fill(k)(Vectors.zeros(dims))
    +        val counts = Array.fill(k)(0L)
     
    -        val sums = Array.fill(runs, k)(Vectors.zeros(dims))
    -        val counts = Array.fill(runs, k)(0L)
    +        val vectorOfPoints = new ArrayBuffer[Vector]()
    +        val normOfPoints = new ArrayBuffer[Double]()
    +        var numRows = 0
     
    +        // Construct points matrix
             points.foreach { point =>
    -          (0 until runs).foreach { i =>
    -            val (bestCenter, cost) = KMeans.findClosest(thisActiveCenters(i), point)
    -            costAccums(i) += cost
    -            val sum = sums(i)(bestCenter)
    -            axpy(1.0, point.vector, sum)
    -            counts(i)(bestCenter) += 1
    +          vectorOfPoints.append(point.vector)
    +          normOfPoints.append(point.norm)
    +          numRows += 1
    +        }
    +
    +        val pointMatrix = if (isSparse) {
    +          val coo = new ArrayBuffer[(Int, Int, Double)]()
    +          vectorOfPoints.zipWithIndex.foreach { v =>
    +            val sv = v._1.asInstanceOf[SparseVector]
    +            sv.indices.indices.foreach { i =>
    +              coo.append((v._2, sv.indices(i), sv.values(i)))
    +            }
               }
    +          SparseMatrix.fromCOO(numRows, dims, coo.toSeq)
    +        } else {
    +          new DenseMatrix(numRows, dims, vectorOfPoints.flatMap(_.toArray).toArray, true)
             }
     
    -        val contribs = for (i <- 0 until runs; j <- 0 until k) yield {
    -          ((i, j), (sums(i)(j), counts(i)(j)))
    +        // Construct centers matrix
    +        val vectorOfCenters = new ArrayBuffer[Double]()
    +        val normOfCenters = new ArrayBuffer[Double]()
    +        thisCenters.foreach { center =>
    +          vectorOfCenters.appendAll(center.vector.toArray)
    +          normOfCenters.append(center.norm)
    +        }
    +        val centerMatrix = new DenseMatrix(dims, k, vectorOfCenters.toArray)
    +
    +        val a2b2 = new ArrayBuffer[Double]()
    +        val normOfPointsArray = normOfPoints.toArray
    +        val normOfCentersArray = normOfCenters.toArray
    +        for (i <- 0 until k; j <- 0 until numRows) {
    +          a2b2.append(normOfPointsArray(j) * normOfPointsArray(j) +
    +            normOfCentersArray(i) * normOfCentersArray(i))
    +        }
    +
    +        val distanceMatrix = new DenseMatrix(numRows, k, a2b2.toArray)
    +        gemm(-2.0, pointMatrix, centerMatrix, 1.0, distanceMatrix)
    +
    +        val vectorOfPointsArray = vectorOfPoints.toArray
    +        distanceMatrix.transpose.toArray.grouped(k).toArray.map(_.zipWithIndex.min).zipWithIndex
    --- End diff --
    
    Good point!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8519] [ML] [MLlib] Blockify distance co...

Posted by sethah <gi...@git.apache.org>.

Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10306#discussion_r47820060
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala ---
    @@ -250,114 +240,142 @@ class KMeans private (
             }
           }
         }
    +
         val initTimeInSeconds = (System.nanoTime() - initStartTime) / 1e9
         logInfo(s"Initialization with $initializationMode took " + "%.3f".format(initTimeInSeconds) +
           " seconds.")
     
    -    val active = Array.fill(numRuns)(true)
    -    val costs = Array.fill(numRuns)(0.0)
    -
    -    var activeRuns = new ArrayBuffer[Int] ++ (0 until numRuns)
    +    var costs = 0.0
         var iteration = 0
    -
         val iterationStartTime = System.nanoTime()
    +    val isSparse = data.take(1)(0).vector.isInstanceOf[SparseVector]
     
    -    // Execute iterations of Lloyd's algorithm until all runs have converged
    -    while (iteration < maxIterations && !activeRuns.isEmpty) {
    +    // Execute Lloyd's algorithm until converged or reached the max number of iterations
    +    while (iteration < maxIterations) {
           type WeightedPoint = (Vector, Long)
           def mergeContribs(x: WeightedPoint, y: WeightedPoint): WeightedPoint = {
             axpy(1.0, x._1, y._1)
             (y._1, x._2 + y._2)
           }
     
    -      val activeCenters = activeRuns.map(r => centers(r)).toArray
    -      val costAccums = activeRuns.map(_ => sc.accumulator(0.0))
    -
    -      val bcActiveCenters = sc.broadcast(activeCenters)
    +      val costAccums = sc.accumulator(0.0)
    +      val bcCenters = sc.broadcast(centers)
     
           // Find the sum and count of points mapping to each center
           val totalContribs = data.mapPartitions { points =>
    -        val thisActiveCenters = bcActiveCenters.value
    -        val runs = thisActiveCenters.length
    -        val k = thisActiveCenters(0).length
    -        val dims = thisActiveCenters(0)(0).vector.size
    +        val thisCenters = bcCenters.value
    +        val k = thisCenters.length
    +        val dims = thisCenters(0).vector.size
    +
    +        val sums = Array.fill(k)(Vectors.zeros(dims))
    +        val counts = Array.fill(k)(0L)
     
    -        val sums = Array.fill(runs, k)(Vectors.zeros(dims))
    -        val counts = Array.fill(runs, k)(0L)
    +        val vectorOfPoints = new ArrayBuffer[Vector]()
    +        val normOfPoints = new ArrayBuffer[Double]()
    +        var numRows = 0
     
    +        // Construct points matrix
             points.foreach { point =>
    -          (0 until runs).foreach { i =>
    -            val (bestCenter, cost) = KMeans.findClosest(thisActiveCenters(i), point)
    -            costAccums(i) += cost
    -            val sum = sums(i)(bestCenter)
    -            axpy(1.0, point.vector, sum)
    -            counts(i)(bestCenter) += 1
    +          vectorOfPoints.append(point.vector)
    +          normOfPoints.append(point.norm)
    +          numRows += 1
    +        }
    +
    +        val pointMatrix = if (isSparse) {
    +          val coo = new ArrayBuffer[(Int, Int, Double)]()
    +          vectorOfPoints.zipWithIndex.foreach { v =>
    +            val sv = v._1.asInstanceOf[SparseVector]
    +            sv.indices.indices.foreach { i =>
    +              coo.append((v._2, sv.indices(i), sv.values(i)))
    +            }
               }
    +          SparseMatrix.fromCOO(numRows, dims, coo.toSeq)
    +        } else {
    +          new DenseMatrix(numRows, dims, vectorOfPoints.flatMap(_.toArray).toArray, true)
             }
     
    -        val contribs = for (i <- 0 until runs; j <- 0 until k) yield {
    -          ((i, j), (sums(i)(j), counts(i)(j)))
    +        // Construct centers matrix
    +        val vectorOfCenters = new ArrayBuffer[Double]()
    +        val normOfCenters = new ArrayBuffer[Double]()
    +        thisCenters.foreach { center =>
    +          vectorOfCenters.appendAll(center.vector.toArray)
    +          normOfCenters.append(center.norm)
    +        }
    +        val centerMatrix = new DenseMatrix(dims, k, vectorOfCenters.toArray)
    +
    +        val a2b2 = new ArrayBuffer[Double]()
    +        val normOfPointsArray = normOfPoints.toArray
    +        val normOfCentersArray = normOfCenters.toArray
    +        for (i <- 0 until k; j <- 0 until numRows) {
    +          a2b2.append(normOfPointsArray(j) * normOfPointsArray(j) +
    +            normOfCentersArray(i) * normOfCentersArray(i))
    +        }
    +
    +        val distanceMatrix = new DenseMatrix(numRows, k, a2b2.toArray)
    +        gemm(-2.0, pointMatrix, centerMatrix, 1.0, distanceMatrix)
    +
    +        val vectorOfPointsArray = vectorOfPoints.toArray
    +        distanceMatrix.transpose.toArray.grouped(k).toArray.map(_.zipWithIndex.min).zipWithIndex
    --- End diff --
    
    Given then complexity of this line, it might be helpful to unpack the tuple and give semantic meaning to its elements. Something like:
    
    ```scala
    distanceMatrix.transpose.toArray.grouped(k).toArray.map(_.zipWithIndex.min).zipWithIndex
              .foreach { case ((cost, center), i) =>
                costAccums += cost
                val sum = sums(center)
                axpy(1.0, vectorOfPointsArray(i), sum)
                counts(center) += 1
              }
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8519] [ML] [MLlib] Blockify distance co...

Posted by 3ourroom <gi...@git.apache.org>.

Github user 3ourroom commented on the pull request:

    https://github.com/apache/spark/pull/10306#issuecomment-164711591
  
    
    NAVER - http://www.naver.com/
    --------------------------------------------
    
    3ourroom@naver.com 님께 보내신 메일 <Re: [spark] [SPARK-8519] [ML] [MLlib] Blockify distance computation in k-means (#10306)> 이 다음과 같은 이유로 전송 실패했습니다.
    
    --------------------------------------------
    
    받는 사람이 회원님의 메일을 수신차단 하였습니다. 
    
    
    --------------------------------------------



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8519] [ML] [MLlib] Blockify distance co...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10306#issuecomment-164717930
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/47724/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8519][SPARK-11560][SPARK-11559] [ML] [M...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/10306#issuecomment-171834736
  
    Regarding your local performance test:
    
    1. Make sure you installed optimized BLAS on your system and loaded correctly in JVM via netlib-java. The different should be significant at 3000x3000 (with or without multi-treading).
    2. Your test of GEMM and AXPY is not equivalent. First of all, they are not using the same matrices for multiplication. Secondly, ` axpy(1.0, bb(j), aa(j))` should be ` axpy(1.0, bb(j), aa(i))`. Otherwise, you get some benefit from local caching.
    
    Could you re-run the test? I will take a look at your implementation.
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-8519] [ML] [MLlib] Blockify distance co...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10306#issuecomment-164717926
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org