You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by johnc1231 <gi...@git.apache.org> on 2017/03/28 15:29:20 UTC

[GitHub] spark pull request #17459: [SPARK-20109][MLlib] Added toBlockMatrixDense to ...

GitHub user johnc1231 opened a pull request:

    https://github.com/apache/spark/pull/17459

    [SPARK-20109][MLlib] Added toBlockMatrixDense to IndexedRowMatrix

    ## What changes were proposed in this pull request?
    
    -I added the method `toBlockMatrixDense` to the IndexedRowMatrix class. The current implementation of `toBlockMatrix` is insufficient for users with relatively dense IndexedRowMatrix objects, since it assumes sparsity. 
    
    ## How was this patch tested?
    
    I used the same tests already written for `toBlockMatrix()` to test this method. I also added a new additional unit test for an edge case that was not adequately tested by current test suite. 


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/johnc1231/spark johnc-fix-ir-to-block

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/17459.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #17459
    
----
commit 0614ebc102eaf05769cb10d38c9328be3ce517f8
Author: John Compitello <jo...@broadinstitute.org>
Date:   2017-03-23T17:46:00Z

    Wrote alternative toBlockMatrix method for IndexedRowMatrix

commit c3b9b8ae6bf87c333d64a51cfe049b82534a0d34
Author: John Compitello <jo...@broadinstitute.org>
Date:   2017-03-27T20:58:03Z

    Added to tests, cleaned up code for clarity

commit 25f4989c02bb71ab6ce0714cc9f9c3f448b7baea
Author: John Compitello <jo...@broadinstitute.org>
Date:   2017-03-28T15:14:09Z

    Added toBlockMatrixDense with no arguments

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17459: [SPARK-20109][MLlib] Rewrote toBlockMatrix method...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/17459


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17459: [SPARK-20109][MLlib] Rewrote toBlockMatrix method on Ind...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17459
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75575/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17459: [SPARK-20109][MLlib] Rewrote toBlockMatrix method on Ind...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17459
  
    **[Test build #76504 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76504/testReport)** for PR 17459 at commit [`994b457`](https://github.com/apache/spark/commit/994b457aaee25763cdfac8edf9e6672da04b392d).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17459: [SPARK-20109][MLlib] Rewrote toBlockMatrix method on Ind...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17459
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17459: [SPARK-20109][MLlib] Rewrote toBlockMatrix method...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17459#discussion_r111074630
  
    --- Diff: mllib/src/test/scala/org/apache/spark/mllib/linalg/distributed/IndexedRowMatrixSuite.scala ---
    @@ -87,21 +87,74 @@ class IndexedRowMatrixSuite extends SparkFunSuite with MLlibTestSparkContext {
         assert(coordMat.toBreeze() === idxRowMat.toBreeze())
       }
     
    -  test("toBlockMatrix") {
    -    val idxRowMat = new IndexedRowMatrix(indexedRows)
    -    val blockMat = idxRowMat.toBlockMatrix(2, 2)
    +  test("toBlockMatrix dense backing") {
    +    val idxRowMatDense = new IndexedRowMatrix(indexedRows)
    +
    +    // Tests when n % colsPerBlock != 0
    +    val blockMat = idxRowMatDense.toBlockMatrix(2, 2)
         assert(blockMat.numRows() === m)
         assert(blockMat.numCols() === n)
    -    assert(blockMat.toBreeze() === idxRowMat.toBreeze())
    +    assert(blockMat.toBreeze() === idxRowMatDense.toBreeze())
    +
    +    // Tests when m % rowsPerBlock != 0
    +    val blockMat2 = idxRowMatDense.toBlockMatrix(3, 1)
    +    assert(blockMat2.numRows() === m)
    +    assert(blockMat2.numCols() === n)
    +    assert(blockMat2.toBreeze() === idxRowMatDense.toBreeze())
     
         intercept[IllegalArgumentException] {
    -      idxRowMat.toBlockMatrix(-1, 2)
    +      idxRowMatDense.toBlockMatrix(-1, 2)
         }
         intercept[IllegalArgumentException] {
    -      idxRowMat.toBlockMatrix(2, 0)
    +      idxRowMatDense.toBlockMatrix(2, 0)
         }
    +    assert(blockMat.blocks.map { case (_, matrix: Matrix) =>
    +      matrix.isInstanceOf[DenseMatrix]}.reduce(_ && _))
    +  }
    +
    +  test("toBlockMatrix sparse backing") {
    +    val sparseData = Seq(
    +      (3L, Vectors.sparse(3, Seq((0, 4.0))))
    +    ).map(x => IndexedRow(x._1, x._2))
    +
    +    val idxRowMatSparse = new IndexedRowMatrix(sc.parallelize(sparseData))
    +
    +    // Tests when n % colsPerBlock != 0
    +    val blockMat = idxRowMatSparse.toBlockMatrix(2, 2)
    +    assert(blockMat.numRows() === m)
    +    assert(blockMat.numCols() === n)
    +    assert(blockMat.toBreeze() === idxRowMatSparse.toBreeze())
    +
    +    // Tests when m % rowsPerBlock != 0
    +    val blockMat2 = idxRowMatSparse.toBlockMatrix(3, 1)
    +    assert(blockMat2.numRows() === m)
    +    assert(blockMat2.numCols() === n)
    +    assert(blockMat2.toBreeze() === idxRowMatSparse.toBreeze())
    +
    +    assert(blockMat.blocks.map { case (_, matrix: Matrix) =>
    +      matrix.isInstanceOf[SparseMatrix]}.reduce(_ && _))
    +  }
    +
    +  test("toBlockMatrix mixed backing") {
    +    val mixedData = Seq(
    +      (0L, Vectors.dense(1, 2, 3)),
    +      (3L, Vectors.sparse(3, Seq((0, 4.0)))))
    +
    +    val idxRowMatMixed = new IndexedRowMatrix(
    +      sc.parallelize(mixedData.map(x => IndexedRow(x._1, x._2))))
    +
    +    val blockMat = idxRowMatMixed.toBlockMatrix(2, 2)
    +    assert(blockMat.numRows() === m)
    +    assert(blockMat.numCols() === n)
    +    assert(blockMat.toBreeze() === idxRowMatMixed.toBreeze())
    +
    +    val blocks = blockMat.blocks.collect()
    +
    +    blocks.forall { case((row, col), matrix) =>
    +      if (row == 0) matrix.isInstanceOf[DenseMatrix] else matrix.isInstanceOf[SparseMatrix]}
       }
     
    +
    --- End diff --
    
    Please remove extra space line.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17459: [SPARK-20109][MLlib] Rewrote toBlockMatrix method...

Posted by johnc1231 <gi...@git.apache.org>.

Github user johnc1231 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17459#discussion_r117620801
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/IndexedRowMatrix.scala ---
    @@ -108,8 +108,64 @@ class IndexedRowMatrix @Since("1.0.0") (
        */
       @Since("1.3.0")
       def toBlockMatrix(rowsPerBlock: Int, colsPerBlock: Int): BlockMatrix = {
    -    // TODO: This implementation may be optimized
    -    toCoordinateMatrix().toBlockMatrix(rowsPerBlock, colsPerBlock)
    +    require(rowsPerBlock > 0,
    +      s"rowsPerBlock needs to be greater than 0. rowsPerBlock: $rowsPerBlock")
    +    require(colsPerBlock > 0,
    +      s"colsPerBlock needs to be greater than 0. colsPerBlock: $colsPerBlock")
    +
    +    val m = numRows()
    +    val n = numCols()
    +    val lastRowBlockIndex = m / rowsPerBlock
    +    val lastColBlockIndex = n / colsPerBlock
    +    val lastRowBlockSize = (m % rowsPerBlock).toInt
    +    val lastColBlockSize = (n % colsPerBlock).toInt
    +    val numRowBlocks = math.ceil(m.toDouble / rowsPerBlock).toInt
    +    val numColBlocks = math.ceil(n.toDouble / colsPerBlock).toInt
    +
    +    val blocks = rows.flatMap { ir: IndexedRow =>
    +      val blockRow = ir.index / rowsPerBlock
    +      val rowInBlock = ir.index % rowsPerBlock
    +
    +      ir.vector match {
    +        case SparseVector(size, indices, values) =>
    +          indices.zip(values).map { case (index, value) =>
    +            val blockColumn = index / colsPerBlock
    --- End diff --
    
    So it is true that IndexedRowMatrix could have a Long number of rows, but BlockMatrix is backed by an RDD of ((Int, Int), Matrix), so we're limited by that. I can just add a check that computes whether it's possible to make a BlockMatrix from the given IndexedRowMatrix. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17459: [SPARK-20109][MLlib] Added toBlockMatrixDense to Indexed...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17459
  
    **[Test build #75518 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75518/testReport)** for PR 17459 at commit [`4582a7e`](https://github.com/apache/spark/commit/4582a7e33836b54315407d7176d1b95c25335791).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17459: [SPARK-20109][MLlib] Rewrote toBlockMatrix method on Ind...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17459
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17459: [SPARK-20109][MLlib] Rewrote toBlockMatrix method...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17459#discussion_r114486021
  
    --- Diff: mllib/src/test/scala/org/apache/spark/mllib/linalg/distributed/IndexedRowMatrixSuite.scala ---
    @@ -87,19 +87,96 @@ class IndexedRowMatrixSuite extends SparkFunSuite with MLlibTestSparkContext {
         assert(coordMat.toBreeze() === idxRowMat.toBreeze())
       }
     
    -  test("toBlockMatrix") {
    -    val idxRowMat = new IndexedRowMatrix(indexedRows)
    -    val blockMat = idxRowMat.toBlockMatrix(2, 2)
    +  test("toBlockMatrix dense backing") {
    +    val idxRowMatDense = new IndexedRowMatrix(indexedRows)
    +
    +    // Tests when n % colsPerBlock != 0
    +    val blockMat = idxRowMatDense.toBlockMatrix(2, 2)
         assert(blockMat.numRows() === m)
         assert(blockMat.numCols() === n)
    -    assert(blockMat.toBreeze() === idxRowMat.toBreeze())
    +    assert(blockMat.toBreeze() === idxRowMatDense.toBreeze())
    +
    +    // Tests when m % rowsPerBlock != 0
    +    val blockMat2 = idxRowMatDense.toBlockMatrix(3, 1)
    +    assert(blockMat2.numRows() === m)
    +    assert(blockMat2.numCols() === n)
    +    assert(blockMat2.toBreeze() === idxRowMatDense.toBreeze())
     
         intercept[IllegalArgumentException] {
    -      idxRowMat.toBlockMatrix(-1, 2)
    +      idxRowMatDense.toBlockMatrix(-1, 2)
         }
         intercept[IllegalArgumentException] {
    -      idxRowMat.toBlockMatrix(2, 0)
    +      idxRowMatDense.toBlockMatrix(2, 0)
         }
    +
    +    assert(blockMat.blocks.map { case (_, matrix: Matrix) =>
    +      matrix.isInstanceOf[DenseMatrix]}.reduce(_ && _))
    +    assert(blockMat2.blocks.map { case (_, matrix: Matrix) =>
    --- End diff --
    
    nit: styling like suggested above.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17459: [SPARK-20109][MLlib] Added toBlockMatrixDense to Indexed...

Posted by johnc1231 <gi...@git.apache.org>.

Github user johnc1231 commented on the issue:

    https://github.com/apache/spark/pull/17459
  
    I think your prototype looks good. I'm pretty much just gonna do exactly that then. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17459: [SPARK-20109][MLlib] Rewrote toBlockMatrix method on Ind...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17459
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/77606/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17459: [SPARK-20109][MLlib] Rewrote toBlockMatrix method...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17459#discussion_r119019226
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/IndexedRowMatrix.scala ---
    @@ -108,8 +108,69 @@ class IndexedRowMatrix @Since("1.0.0") (
        */
       @Since("1.3.0")
       def toBlockMatrix(rowsPerBlock: Int, colsPerBlock: Int): BlockMatrix = {
    -    // TODO: This implementation may be optimized
    -    toCoordinateMatrix().toBlockMatrix(rowsPerBlock, colsPerBlock)
    +    require(rowsPerBlock > 0,
    +      s"rowsPerBlock needs to be greater than 0. rowsPerBlock: $rowsPerBlock")
    +    require(colsPerBlock > 0,
    +      s"colsPerBlock needs to be greater than 0. colsPerBlock: $colsPerBlock")
    +
    +    // Since block matrices require an integer row index
    +    require(numRows() / rowsPerBlock.toDouble <= Int.MaxValue,
    --- End diff --
    
    For cols, we may not need to do this check. Because each `IndexedRow` can only have the number of columns less than `Int.MaxValue`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17459: [SPARK-20109][MLlib] Rewrote toBlockMatrix method on Ind...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17459
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17459: [SPARK-20109][MLlib] Added toBlockMatrixDense to Indexed...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17459
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17459: [SPARK-20109][MLlib] Rewrote toBlockMatrix method...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17459#discussion_r111074033
  
    --- Diff: mllib/src/test/scala/org/apache/spark/mllib/linalg/distributed/IndexedRowMatrixSuite.scala ---
    @@ -87,21 +87,74 @@ class IndexedRowMatrixSuite extends SparkFunSuite with MLlibTestSparkContext {
         assert(coordMat.toBreeze() === idxRowMat.toBreeze())
       }
     
    -  test("toBlockMatrix") {
    -    val idxRowMat = new IndexedRowMatrix(indexedRows)
    -    val blockMat = idxRowMat.toBlockMatrix(2, 2)
    +  test("toBlockMatrix dense backing") {
    +    val idxRowMatDense = new IndexedRowMatrix(indexedRows)
    +
    +    // Tests when n % colsPerBlock != 0
    +    val blockMat = idxRowMatDense.toBlockMatrix(2, 2)
         assert(blockMat.numRows() === m)
         assert(blockMat.numCols() === n)
    -    assert(blockMat.toBreeze() === idxRowMat.toBreeze())
    +    assert(blockMat.toBreeze() === idxRowMatDense.toBreeze())
    +
    +    // Tests when m % rowsPerBlock != 0
    +    val blockMat2 = idxRowMatDense.toBlockMatrix(3, 1)
    +    assert(blockMat2.numRows() === m)
    +    assert(blockMat2.numCols() === n)
    +    assert(blockMat2.toBreeze() === idxRowMatDense.toBreeze())
     
         intercept[IllegalArgumentException] {
    -      idxRowMat.toBlockMatrix(-1, 2)
    +      idxRowMatDense.toBlockMatrix(-1, 2)
         }
         intercept[IllegalArgumentException] {
    -      idxRowMat.toBlockMatrix(2, 0)
    +      idxRowMatDense.toBlockMatrix(2, 0)
         }
    +    assert(blockMat.blocks.map { case (_, matrix: Matrix) =>
    +      matrix.isInstanceOf[DenseMatrix]}.reduce(_ && _))
    +  }
    +
    +  test("toBlockMatrix sparse backing") {
    +    val sparseData = Seq(
    +      (3L, Vectors.sparse(3, Seq((0, 4.0))))
    +    ).map(x => IndexedRow(x._1, x._2))
    +
    +    val idxRowMatSparse = new IndexedRowMatrix(sc.parallelize(sparseData))
    +
    +    // Tests when n % colsPerBlock != 0
    +    val blockMat = idxRowMatSparse.toBlockMatrix(2, 2)
    +    assert(blockMat.numRows() === m)
    +    assert(blockMat.numCols() === n)
    +    assert(blockMat.toBreeze() === idxRowMatSparse.toBreeze())
    +
    +    // Tests when m % rowsPerBlock != 0
    +    val blockMat2 = idxRowMatSparse.toBlockMatrix(3, 1)
    +    assert(blockMat2.numRows() === m)
    +    assert(blockMat2.numCols() === n)
    +    assert(blockMat2.toBreeze() === idxRowMatSparse.toBreeze())
    +
    +    assert(blockMat.blocks.map { case (_, matrix: Matrix) =>
    +      matrix.isInstanceOf[SparseMatrix]}.reduce(_ && _))
    +  }
    +
    +  test("toBlockMatrix mixed backing") {
    +    val mixedData = Seq(
    +      (0L, Vectors.dense(1, 2, 3)),
    +      (3L, Vectors.sparse(3, Seq((0, 4.0)))))
    +
    +    val idxRowMatMixed = new IndexedRowMatrix(
    +      sc.parallelize(mixedData.map(x => IndexedRow(x._1, x._2))))
    --- End diff --
    
    To be consistent with above, can we move the call `.map(x => IndexedRow(x._1, x._2))` to `mixedData`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17459: [SPARK-20109][MLlib] Rewrote toBlockMatrix method on Ind...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17459
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17459: [SPARK-20109][MLlib] Rewrote toBlockMatrix method on Ind...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/17459
  
    @johnc1231 Thanks for updating this. I'll review it in the weekend.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17459: [SPARK-20109][MLlib] Rewrote toBlockMatrix method...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17459#discussion_r111072525
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/IndexedRowMatrix.scala ---
    @@ -108,8 +108,64 @@ class IndexedRowMatrix @Since("1.0.0") (
        */
       @Since("1.3.0")
       def toBlockMatrix(rowsPerBlock: Int, colsPerBlock: Int): BlockMatrix = {
    -    // TODO: This implementation may be optimized
    -    toCoordinateMatrix().toBlockMatrix(rowsPerBlock, colsPerBlock)
    +    require(rowsPerBlock > 0,
    +      s"rowsPerBlock needs to be greater than 0. rowsPerBlock: $rowsPerBlock")
    +    require(colsPerBlock > 0,
    +      s"colsPerBlock needs to be greater than 0. colsPerBlock: $colsPerBlock")
    +
    +    val m = numRows()
    +    val n = numCols()
    +    val lastRowBlockIndex = m / rowsPerBlock
    +    val lastColBlockIndex = n / colsPerBlock
    +    val lastRowBlockSize = (m % rowsPerBlock).toInt
    +    val lastColBlockSize = (n % colsPerBlock).toInt
    +    val numRowBlocks = math.ceil(m.toDouble / rowsPerBlock).toInt
    +    val numColBlocks = math.ceil(n.toDouble / colsPerBlock).toInt
    +
    +    val blocks = rows.flatMap { ir: IndexedRow =>
    +      val blockRow = ir.index / rowsPerBlock
    +      val rowInBlock = ir.index % rowsPerBlock
    +
    +      ir.vector match {
    +        case SparseVector(size, indices, values) =>
    +          indices.zip(values).map { case (index, value) =>
    +            val blockColumn = index / colsPerBlock
    +            val columnInBlock = index % colsPerBlock
    +            ((blockRow.toInt, blockColumn.toInt), (rowInBlock.toInt, Array((value, columnInBlock))))
    +          }
    +        case DenseVector(values) =>
    +          values.grouped(colsPerBlock)
    +            .zipWithIndex
    +            .map { case (values, blockColumn) =>
    +              ((blockRow.toInt, blockColumn), (rowInBlock.toInt, values.zipWithIndex))
    +            }
    +      }
    +    }.groupByKey(GridPartitioner(numRowBlocks, numColBlocks, rows.getNumPartitions)).map {
    +      case ((blockRow, blockColumn), itr) =>
    +        val actualNumRows =
    +          if (blockRow == lastRowBlockIndex) lastRowBlockSize else rowsPerBlock
    +        val actualNumColumns =
    +          if (blockColumn == lastColBlockIndex) lastColBlockSize else colsPerBlock
    +
    +        val arraySize = actualNumRows * actualNumColumns
    +        val matrixAsArray = new Array[Double](arraySize)
    +        var countForValues = 0
    +        itr.foreach { case (rowWithinBlock, valuesWithColumns) =>
    +          valuesWithColumns.foreach { case (value, columnWithinBlock) =>
    +            matrixAsArray.update(columnWithinBlock * actualNumRows + rowWithinBlock, value)
    +            countForValues += 1
    +          }
    +        }
    +        val denseMatrix = new DenseMatrix(actualNumRows, actualNumColumns, matrixAsArray)
    +        val finalMatrix = if (countForValues / arraySize.toDouble >= 0.5) {
    --- End diff --
    
    I am not sure if 0.5 is a proper value practically. Maybe others have more ideas about it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17459: [SPARK-20109][MLlib] Rewrote toBlockMatrix method...

Posted by johnc1231 <gi...@git.apache.org>.

Github user johnc1231 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17459#discussion_r118793614
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/IndexedRowMatrix.scala ---
    @@ -108,8 +108,69 @@ class IndexedRowMatrix @Since("1.0.0") (
        */
       @Since("1.3.0")
       def toBlockMatrix(rowsPerBlock: Int, colsPerBlock: Int): BlockMatrix = {
    -    // TODO: This implementation may be optimized
    -    toCoordinateMatrix().toBlockMatrix(rowsPerBlock, colsPerBlock)
    +    require(rowsPerBlock > 0,
    +      s"rowsPerBlock needs to be greater than 0. rowsPerBlock: $rowsPerBlock")
    +    require(colsPerBlock > 0,
    +      s"colsPerBlock needs to be greater than 0. colsPerBlock: $colsPerBlock")
    +
    +    // Since block matrices require an integer row index
    +    require(numRows() / rowsPerBlock < Int.MaxValue,
    --- End diff --
    
    Good catch, that is true. Will change. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17459: [SPARK-20109][MLlib] Rewrote toBlockMatrix method on Ind...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/17459
  
    Merged to master


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17459: [SPARK-20109][MLlib] Added toBlockMatrixDense to Indexed...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/17459
  
    I've done some prototype locally to generalize this change to `SparseMatrix`. During that, I have a thought that do we have the limit that all Matrix in `BlockMatrix` need to be the same kind of Matrix (i.e., `DenseMatrix` or `SparseMatrix`)?
    
    Actually we can easily have only one `toBlockMatrix` method which creates a `BlockMatrix` including both `DenseMatrix` or `SparseMatrix`, depending if the blocks are sparse or not.
    
    From the external view of this API, we don't have an explicit difference between `SparseMatrix`-backed and `DenseMatrix`-backed `BlockMatrix`s. We don't have subclasses for it, nor any property can be used to know about it. Doesn't it mean we don't really care about it?
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17459: [SPARK-20109][MLlib] Rewrote toBlockMatrix method...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17459#discussion_r111073831
  
    --- Diff: mllib/src/test/scala/org/apache/spark/mllib/linalg/distributed/IndexedRowMatrixSuite.scala ---
    @@ -87,21 +87,74 @@ class IndexedRowMatrixSuite extends SparkFunSuite with MLlibTestSparkContext {
         assert(coordMat.toBreeze() === idxRowMat.toBreeze())
       }
     
    -  test("toBlockMatrix") {
    -    val idxRowMat = new IndexedRowMatrix(indexedRows)
    -    val blockMat = idxRowMat.toBlockMatrix(2, 2)
    +  test("toBlockMatrix dense backing") {
    +    val idxRowMatDense = new IndexedRowMatrix(indexedRows)
    +
    +    // Tests when n % colsPerBlock != 0
    +    val blockMat = idxRowMatDense.toBlockMatrix(2, 2)
         assert(blockMat.numRows() === m)
         assert(blockMat.numCols() === n)
    -    assert(blockMat.toBreeze() === idxRowMat.toBreeze())
    +    assert(blockMat.toBreeze() === idxRowMatDense.toBreeze())
    +
    +    // Tests when m % rowsPerBlock != 0
    +    val blockMat2 = idxRowMatDense.toBlockMatrix(3, 1)
    +    assert(blockMat2.numRows() === m)
    +    assert(blockMat2.numCols() === n)
    +    assert(blockMat2.toBreeze() === idxRowMatDense.toBreeze())
     
         intercept[IllegalArgumentException] {
    -      idxRowMat.toBlockMatrix(-1, 2)
    +      idxRowMatDense.toBlockMatrix(-1, 2)
         }
         intercept[IllegalArgumentException] {
    -      idxRowMat.toBlockMatrix(2, 0)
    +      idxRowMatDense.toBlockMatrix(2, 0)
         }
    +    assert(blockMat.blocks.map { case (_, matrix: Matrix) =>
    +      matrix.isInstanceOf[DenseMatrix]}.reduce(_ && _))
    +  }
    +
    +  test("toBlockMatrix sparse backing") {
    +    val sparseData = Seq(
    +      (3L, Vectors.sparse(3, Seq((0, 4.0))))
    +    ).map(x => IndexedRow(x._1, x._2))
    +
    +    val idxRowMatSparse = new IndexedRowMatrix(sc.parallelize(sparseData))
    +
    +    // Tests when n % colsPerBlock != 0
    +    val blockMat = idxRowMatSparse.toBlockMatrix(2, 2)
    +    assert(blockMat.numRows() === m)
    +    assert(blockMat.numCols() === n)
    +    assert(blockMat.toBreeze() === idxRowMatSparse.toBreeze())
    +
    +    // Tests when m % rowsPerBlock != 0
    +    val blockMat2 = idxRowMatSparse.toBlockMatrix(3, 1)
    +    assert(blockMat2.numRows() === m)
    +    assert(blockMat2.numCols() === n)
    +    assert(blockMat2.toBreeze() === idxRowMatSparse.toBreeze())
    +
    +    assert(blockMat.blocks.map { case (_, matrix: Matrix) =>
    +      matrix.isInstanceOf[SparseMatrix]}.reduce(_ && _))
    --- End diff --
    
    Also check if `blockMat2` is all sparse matrix backing?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17459: [SPARK-20109][MLlib] Added toBlockMatrixDense to Indexed...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17459
  
    **[Test build #75482 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75482/testReport)** for PR 17459 at commit [`12e78bf`](https://github.com/apache/spark/commit/12e78bf9558849be6588360e8dd54ed60f34111f).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17459: [SPARK-20109][MLlib] Rewrote toBlockMatrix method on Ind...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17459
  
    **[Test build #76107 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76107/testReport)** for PR 17459 at commit [`d692d30`](https://github.com/apache/spark/commit/d692d3031f9c57ff92b63ccce0962bc899402826).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17459: [SPARK-20109][MLlib] Rewrote toBlockMatrix method on Ind...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17459
  
    **[Test build #77437 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77437/testReport)** for PR 17459 at commit [`a7a03dc`](https://github.com/apache/spark/commit/a7a03dc7627ee10f6eeb879dd1d1ee0c35051c71).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17459: [SPARK-20109][MLlib] Rewrote toBlockMatrix method...

Posted by johnc1231 <gi...@git.apache.org>.

Github user johnc1231 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17459#discussion_r111302391
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/IndexedRowMatrix.scala ---
    @@ -108,8 +108,64 @@ class IndexedRowMatrix @Since("1.0.0") (
        */
       @Since("1.3.0")
       def toBlockMatrix(rowsPerBlock: Int, colsPerBlock: Int): BlockMatrix = {
    -    // TODO: This implementation may be optimized
    -    toCoordinateMatrix().toBlockMatrix(rowsPerBlock, colsPerBlock)
    +    require(rowsPerBlock > 0,
    +      s"rowsPerBlock needs to be greater than 0. rowsPerBlock: $rowsPerBlock")
    +    require(colsPerBlock > 0,
    +      s"colsPerBlock needs to be greater than 0. colsPerBlock: $colsPerBlock")
    +
    +    val m = numRows()
    +    val n = numCols()
    +    val lastRowBlockIndex = m / rowsPerBlock
    +    val lastColBlockIndex = n / colsPerBlock
    +    val lastRowBlockSize = (m % rowsPerBlock).toInt
    +    val lastColBlockSize = (n % colsPerBlock).toInt
    +    val numRowBlocks = math.ceil(m.toDouble / rowsPerBlock).toInt
    +    val numColBlocks = math.ceil(n.toDouble / colsPerBlock).toInt
    +
    +    val blocks = rows.flatMap { ir: IndexedRow =>
    +      val blockRow = ir.index / rowsPerBlock
    +      val rowInBlock = ir.index % rowsPerBlock
    +
    +      ir.vector match {
    +        case SparseVector(size, indices, values) =>
    +          indices.zip(values).map { case (index, value) =>
    +            val blockColumn = index / colsPerBlock
    +            val columnInBlock = index % colsPerBlock
    +            ((blockRow.toInt, blockColumn.toInt), (rowInBlock.toInt, Array((value, columnInBlock))))
    +          }
    +        case DenseVector(values) =>
    +          values.grouped(colsPerBlock)
    +            .zipWithIndex
    +            .map { case (values, blockColumn) =>
    +              ((blockRow.toInt, blockColumn), (rowInBlock.toInt, values.zipWithIndex))
    +            }
    +      }
    +    }.groupByKey(GridPartitioner(numRowBlocks, numColBlocks, rows.getNumPartitions)).map {
    +      case ((blockRow, blockColumn), itr) =>
    +        val actualNumRows =
    +          if (blockRow == lastRowBlockIndex) lastRowBlockSize else rowsPerBlock
    +        val actualNumColumns =
    +          if (blockColumn == lastColBlockIndex) lastColBlockSize else colsPerBlock
    +
    +        val arraySize = actualNumRows * actualNumColumns
    +        val matrixAsArray = new Array[Double](arraySize)
    +        var countForValues = 0
    +        itr.foreach { case (rowWithinBlock, valuesWithColumns) =>
    +          valuesWithColumns.foreach { case (value, columnWithinBlock) =>
    +            matrixAsArray.update(columnWithinBlock * actualNumRows + rowWithinBlock, value)
    +            countForValues += 1
    +          }
    +        }
    +        val denseMatrix = new DenseMatrix(actualNumRows, actualNumColumns, matrixAsArray)
    +        val finalMatrix = if (countForValues / arraySize.toDouble >= 0.5) {
    --- End diff --
    
    Well, our main operations on block matrix are add and multiply. I guess a reasonable thing to do would be to profile those operations with different fractions besides .5 and see what's most performant 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17459: [SPARK-20109][MLlib] Rewrote toBlockMatrix method on Ind...

Posted by johnc1231 <gi...@git.apache.org>.

Github user johnc1231 commented on the issue:

    https://github.com/apache/spark/pull/17459
  
    @srowen Addressed comments, back to you. And thanks for taking time to look this over. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17459: [SPARK-20109][MLlib] Added toBlockMatrixDense to Indexed...

Posted by johnc1231 <gi...@git.apache.org>.

Github user johnc1231 commented on the issue:

    https://github.com/apache/spark/pull/17459
  
    @viirya I made changes exactly as you did in your prototype, plus a few style edits. But yeah, I think this is a good, easy to use implementation that will be better in all use cases than current implementation. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17459: [SPARK-20109][MLlib] Added toBlockMatrixDense to Indexed...

Posted by johnc1231 <gi...@git.apache.org>.

Github user johnc1231 commented on the issue:

    https://github.com/apache/spark/pull/17459
  
    Thanks for the feedback guys. All comments addressed, though if anyone else has feedback on discussion me and @viirya are having about whether there should be a separate `toBlockMatrixDense` method or we should just have an argument to specify Dense or Sparse in the default `toBlockMatrix` method, please weigh in. Thanks. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17459: [SPARK-20109][MLlib] Rewrote toBlockMatrix method on Ind...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17459
  
    **[Test build #75754 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75754/testReport)** for PR 17459 at commit [`3fe21cf`](https://github.com/apache/spark/commit/3fe21cf2e80a47f6d54578b286d338efb9b7f401).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17459: [SPARK-20109][MLlib] Rewrote toBlockMatrix method on Ind...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17459
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/76107/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17459: [SPARK-20109][MLlib] Added toBlockMatrixDense to ...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17459#discussion_r109367056
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/IndexedRowMatrix.scala ---
    @@ -113,6 +113,67 @@ class IndexedRowMatrix @Since("1.0.0") (
       }
     
       /**
    +    * Converts to BlockMatrix. Creates blocks of `DenseMatrix` with size 1024 x 1024.
    --- End diff --
    
    The style of the comments here and below is not correct. Can you fix it?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17459: [SPARK-20109][MLlib] Rewrote toBlockMatrix method on Ind...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17459
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/76504/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17459: [SPARK-20109][MLlib] Rewrote toBlockMatrix method on Ind...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17459
  
    **[Test build #77439 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77439/testReport)** for PR 17459 at commit [`289dbdb`](https://github.com/apache/spark/commit/289dbdb8ef4b4c216870e306a729ae5c3d6e4936).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17459: [SPARK-20109][MLlib] Rewrote toBlockMatrix method...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17459#discussion_r119018864
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/IndexedRowMatrix.scala ---
    @@ -108,8 +108,69 @@ class IndexedRowMatrix @Since("1.0.0") (
        */
       @Since("1.3.0")
       def toBlockMatrix(rowsPerBlock: Int, colsPerBlock: Int): BlockMatrix = {
    -    // TODO: This implementation may be optimized
    -    toCoordinateMatrix().toBlockMatrix(rowsPerBlock, colsPerBlock)
    +    require(rowsPerBlock > 0,
    +      s"rowsPerBlock needs to be greater than 0. rowsPerBlock: $rowsPerBlock")
    +    require(colsPerBlock > 0,
    +      s"colsPerBlock needs to be greater than 0. colsPerBlock: $colsPerBlock")
    +
    +    // Since block matrices require an integer row index
    +    require(numRows() / rowsPerBlock.toDouble <= Int.MaxValue,
    --- End diff --
    
    +1 We should fix `CoordinateMatrix.toBlockMatrix` too.
    
    
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17459: [SPARK-20109][MLlib] Rewrote toBlockMatrix method on Ind...

Posted by johnc1231 <gi...@git.apache.org>.

Github user johnc1231 commented on the issue:

    https://github.com/apache/spark/pull/17459
  
    @viirya Do you have any more comments on this, or are you happy with it? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17459: [SPARK-20109][MLlib] Rewrote toBlockMatrix method...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17459#discussion_r114486297
  
    --- Diff: mllib/src/test/scala/org/apache/spark/mllib/linalg/distributed/IndexedRowMatrixSuite.scala ---
    @@ -87,19 +87,96 @@ class IndexedRowMatrixSuite extends SparkFunSuite with MLlibTestSparkContext {
         assert(coordMat.toBreeze() === idxRowMat.toBreeze())
       }
     
    -  test("toBlockMatrix") {
    -    val idxRowMat = new IndexedRowMatrix(indexedRows)
    -    val blockMat = idxRowMat.toBlockMatrix(2, 2)
    +  test("toBlockMatrix dense backing") {
    +    val idxRowMatDense = new IndexedRowMatrix(indexedRows)
    +
    +    // Tests when n % colsPerBlock != 0
    +    val blockMat = idxRowMatDense.toBlockMatrix(2, 2)
         assert(blockMat.numRows() === m)
         assert(blockMat.numCols() === n)
    -    assert(blockMat.toBreeze() === idxRowMat.toBreeze())
    +    assert(blockMat.toBreeze() === idxRowMatDense.toBreeze())
    +
    +    // Tests when m % rowsPerBlock != 0
    +    val blockMat2 = idxRowMatDense.toBlockMatrix(3, 1)
    +    assert(blockMat2.numRows() === m)
    +    assert(blockMat2.numCols() === n)
    +    assert(blockMat2.toBreeze() === idxRowMatDense.toBreeze())
     
         intercept[IllegalArgumentException] {
    -      idxRowMat.toBlockMatrix(-1, 2)
    +      idxRowMatDense.toBlockMatrix(-1, 2)
         }
         intercept[IllegalArgumentException] {
    -      idxRowMat.toBlockMatrix(2, 0)
    +      idxRowMatDense.toBlockMatrix(2, 0)
         }
    +
    +    assert(blockMat.blocks.map { case (_, matrix: Matrix) =>
    +      matrix.isInstanceOf[DenseMatrix]}.reduce(_ && _))
    +    assert(blockMat2.blocks.map { case (_, matrix: Matrix) =>
    +      matrix.isInstanceOf[DenseMatrix]}.reduce(_ && _))
    +  }
    +
    +  test("toBlockMatrix sparse backing") {
    +    val sparseData = Seq(
    +      (7L, Vectors.sparse(6, Seq((0, 4.0))))
    +    ).map(x => IndexedRow(x._1, x._2))
    +
    +    // Gonna make m and n larger here so the matrices can easily be completely sparse:
    +    val m = 8
    +    val n = 6
    +
    +    val idxRowMatSparse = new IndexedRowMatrix(sc.parallelize(sparseData))
    +
    +    // Tests when n % colsPerBlock != 0
    +    val blockMat = idxRowMatSparse.toBlockMatrix(4, 4)
    +    assert(blockMat.numRows() === m)
    +    assert(blockMat.numCols() === n)
    +    assert(blockMat.toBreeze() === idxRowMatSparse.toBreeze())
    +
    +    // Tests when m % rowsPerBlock != 0
    +    val blockMat2 = idxRowMatSparse.toBlockMatrix(3, 3)
    +    assert(blockMat2.numRows() === m)
    +    assert(blockMat2.numCols() === n)
    +    assert(blockMat2.toBreeze() === idxRowMatSparse.toBreeze())
    +
    +    assert(blockMat.blocks.map { case (_, matrix: Matrix) =>
    +      matrix.isInstanceOf[SparseMatrix]}.reduce(_ && _))
    +    assert(blockMat2.blocks.map { case (_, matrix: Matrix) =>
    +      matrix.isInstanceOf[SparseMatrix]}.reduce(_ && _))
    +  }
    --- End diff --
    
    nit: styling like above suggested.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17459: [SPARK-20109][MLlib] Added toBlockMatrixDense to Indexed...

Posted by johnc1231 <gi...@git.apache.org>.

Github user johnc1231 commented on the issue:

    https://github.com/apache/spark/pull/17459
  
    Addressed comments where everything was clear, replied to the last one about only having one toBlockMatrix. Back to you @viirya . Thanks for feedback. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17459: [SPARK-20109][MLlib] Added toBlockMatrixDense to ...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17459#discussion_r109370058
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/IndexedRowMatrix.scala ---
    @@ -113,6 +113,67 @@ class IndexedRowMatrix @Since("1.0.0") (
       }
     
       /**
    +    * Converts to BlockMatrix. Creates blocks of `DenseMatrix` with size 1024 x 1024.
    +    */
    +  def toBlockMatrixDense(): BlockMatrix = {
    +    toBlockMatrixDense(1024, 1024)
    +  }
    +
    +  /**
    +    * Converts to BlockMatrix. Creates blocks of `DenseMatrix`.
    +    * @param rowsPerBlock The number of rows of each block. The blocks at the bottom edge may have
    +    *                     a smaller value. Must be an integer value greater than 0.
    +    * @param colsPerBlock The number of columns of each block. The blocks at the right edge may have
    +    *                     a smaller value. Must be an integer value greater than 0.
    +    * @return a [[BlockMatrix]]
    +    */
    +  def toBlockMatrixDense(rowsPerBlock: Int, colsPerBlock: Int): BlockMatrix = {
    +    require(rowsPerBlock > 0,
    +      s"rowsPerBlock needs to be greater than 0. rowsPerBlock: $rowsPerBlock")
    +    require(colsPerBlock > 0,
    +      s"colsPerBlock needs to be greater than 0. colsPerBlock: $colsPerBlock")
    +
    +    val m = numRows()
    +    val n = numCols()
    +    val lastRowBlockIndex = m / rowsPerBlock
    +    val lastColBlockIndex = n / colsPerBlock
    +    val lastRowBlockSize = (m % rowsPerBlock).toInt
    +    val lastColBlockSize = (n % colsPerBlock).toInt
    +    val numRowBlocks = math.ceil(m.toDouble / rowsPerBlock).toInt
    +    val numColBlocks = math.ceil(n.toDouble / colsPerBlock).toInt
    +
    +    val blocks: RDD[((Int, Int), Matrix)] = rows.flatMap({ ir =>
    +      val blockRow = ir.index / rowsPerBlock
    +      val rowInBlock = ir.index % rowsPerBlock
    +
    +      ir.vector.toArray
    +        .grouped(colsPerBlock)
    +        .zipWithIndex
    +        .map({ case (values, blockColumn) =>
    --- End diff --
    
    Style: where you are writing `({ ... })` just write `{ ... }`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17459: [SPARK-20109][MLlib] Added toBlockMatrixDense to ...

Posted by johnc1231 <gi...@git.apache.org>.

Github user johnc1231 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17459#discussion_r109320386
  
    --- Diff: mllib/src/test/scala/org/apache/spark/mllib/linalg/distributed/IndexedRowMatrixSuite.scala ---
    @@ -89,11 +89,42 @@ class IndexedRowMatrixSuite extends SparkFunSuite with MLlibTestSparkContext {
     
       test("toBlockMatrix") {
         val idxRowMat = new IndexedRowMatrix(indexedRows)
    +
    +    // Tests when n % colsPerBlock != 0
    +    val blockMat = idxRowMat.toBlockMatrix(2, 2)
    +    assert(blockMat.numRows() === m)
    +    assert(blockMat.numCols() === n)
    +    assert(blockMat.toBreeze() === idxRowMat.toBreeze())
    +
    +    // Tests when m % rowsPerBlock != 0
    +    val blockMat2 = idxRowMat.toBlockMatrix(3, 1)
    +    assert(blockMat2.numRows() === m)
    +    assert(blockMat2.numCols() === n)
    +    assert(blockMat2.toBreeze() === idxRowMat.toBreeze())
    +
    +    intercept[IllegalArgumentException] {
    +      idxRowMat.toBlockMatrix(-1, 2)
    +    }
    +    intercept[IllegalArgumentException] {
    +      idxRowMat.toBlockMatrix(2, 0)
    +    }
    +  }
    +
    +  test("toBlockMatrixDense") {
    --- End diff --
    
    I'm confused, you seem to have commented right on the toBlockMatrixDense tests. Originally, toBlockMatrix had only the tests marked with the comment `// Tests when n % colsPerBlock != 0`. I added the tests marked with `// Tests when m % rowsPerBlock != 0` to toBlockMatrix, then used the same tests for the Dense version. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17459: [SPARK-20109][MLlib] Rewrote toBlockMatrix method on Ind...

Posted by johnc1231 <gi...@git.apache.org>.

Github user johnc1231 commented on the issue:

    https://github.com/apache/spark/pull/17459
  
    Also, made changes suggested by @srowen . Don't know if he now has to sign off on those changes being done. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17459: [SPARK-20109][MLlib] Rewrote toBlockMatrix method on Ind...

Posted by johnc1231 <gi...@git.apache.org>.

Github user johnc1231 commented on the issue:

    https://github.com/apache/spark/pull/17459
  
    This has been reviewed pretty thoroughly at this point. Can a committer give this a quick look? @srowen @MLnick @jkbradley I think it's basically ready to go in. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17459: [SPARK-20109][MLlib] Added toBlockMatrixDense to Indexed...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/17459
  
    @johnc1231 The prototype I did: https://github.com/apache/spark/compare/master...viirya:general-toblockmatrix?expand=1
    
    Maybe you can take a look and see if it is useful to you.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17459: [SPARK-20109][MLlib] Rewrote toBlockMatrix method...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17459#discussion_r116543558
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/IndexedRowMatrix.scala ---
    @@ -108,8 +108,64 @@ class IndexedRowMatrix @Since("1.0.0") (
        */
       @Since("1.3.0")
       def toBlockMatrix(rowsPerBlock: Int, colsPerBlock: Int): BlockMatrix = {
    -    // TODO: This implementation may be optimized
    -    toCoordinateMatrix().toBlockMatrix(rowsPerBlock, colsPerBlock)
    +    require(rowsPerBlock > 0,
    +      s"rowsPerBlock needs to be greater than 0. rowsPerBlock: $rowsPerBlock")
    +    require(colsPerBlock > 0,
    +      s"colsPerBlock needs to be greater than 0. colsPerBlock: $colsPerBlock")
    +
    +    val m = numRows()
    +    val n = numCols()
    +    val lastRowBlockIndex = m / rowsPerBlock
    +    val lastColBlockIndex = n / colsPerBlock
    +    val lastRowBlockSize = (m % rowsPerBlock).toInt
    +    val lastColBlockSize = (n % colsPerBlock).toInt
    +    val numRowBlocks = math.ceil(m.toDouble / rowsPerBlock).toInt
    +    val numColBlocks = math.ceil(n.toDouble / colsPerBlock).toInt
    +
    +    val blocks = rows.flatMap { ir: IndexedRow =>
    +      val blockRow = ir.index / rowsPerBlock
    +      val rowInBlock = ir.index % rowsPerBlock
    +
    +      ir.vector match {
    +        case SparseVector(size, indices, values) =>
    +          indices.zip(values).map { case (index, value) =>
    +            val blockColumn = index / colsPerBlock
    --- End diff --
    
    I think there's an assumption here that the block index can't be larger than an Int, but it could, right? conceptually the index in an IndexedRow could be huge. Does blockRow need to stay a Long or am I overlooking why it won't happen?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17459: [SPARK-20109][MLlib] Added toBlockMatrixDense to ...

Posted by johnc1231 <gi...@git.apache.org>.

Github user johnc1231 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17459#discussion_r109320642
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/IndexedRowMatrix.scala ---
    @@ -113,6 +114,67 @@ class IndexedRowMatrix @Since("1.0.0") (
       }
     
       /**
    +    * Converts to BlockMatrix. Creates blocks of `DenseMatrix` with size 1024 x 1024.
    +    */
    +  def toBlockMatrixDense(): BlockMatrix = {
    +    toBlockMatrixDense(1024, 1024)
    +  }
    +
    +  /**
    +    * Converts to BlockMatrix. Creates blocks of `DenseMatrix`.
    +    * @param rowsPerBlock The number of rows of each block. The blocks at the bottom edge may have
    +    *                     a smaller value. Must be an integer value greater than 0.
    +    * @param colsPerBlock The number of columns of each block. The blocks at the right edge may have
    +    *                     a smaller value. Must be an integer value greater than 0.
    +    * @return a [[BlockMatrix]]
    +    */
    +  def toBlockMatrixDense(rowsPerBlock: Int, colsPerBlock: Int): BlockMatrix = {
    +    require(rowsPerBlock > 0,
    +      s"rowsPerBlock needs to be greater than 0. rowsPerBlock: $rowsPerBlock")
    +    require(colsPerBlock > 0,
    +      s"colsPerBlock needs to be greater than 0. colsPerBlock: $colsPerBlock")
    +
    +    val m = numRows()
    +    val n = numCols()
    +    val lastRowBlockIndex = m / rowsPerBlock
    +    val lastColBlockIndex = n / colsPerBlock
    +    val lastRowBlockSize = (m % rowsPerBlock).toInt
    +    val lastColBlockSize = (n % colsPerBlock).toInt
    +    val numRowBlocks = math.ceil(m.toDouble / rowsPerBlock).toInt
    +    val numColBlocks = math.ceil(n.toDouble / colsPerBlock).toInt
    +
    +    val blocks: RDD[((Int, Int), Matrix)] = rows.flatMap({ ir =>
    +      val blockRow = ir.index / rowsPerBlock
    +      val rowInBlock = ir.index % rowsPerBlock
    +
    +      ir.vector.toArray
    +        .grouped(colsPerBlock)
    +        .zipWithIndex
    +        .map({ case (values, blockColumn) =>
    +          ((blockRow.toInt, blockColumn), (rowInBlock.toInt, values))
    +        })
    +    }).groupByKey(GridPartitioner(numRowBlocks, numColBlocks, rowsPerBlock, colsPerBlock)).map({
    --- End diff --
    
    You're right. My code makes the assumption that there is a single block per partition, which is incorrect. Thanks for that. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17459: [SPARK-20109][MLlib] Added toBlockMatrixDense to ...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17459#discussion_r109359850
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/IndexedRowMatrix.scala ---
    @@ -113,6 +113,67 @@ class IndexedRowMatrix @Since("1.0.0") (
       }
     
       /**
    +    * Converts to BlockMatrix. Creates blocks of `DenseMatrix` with size 1024 x 1024.
    +    */
    +  def toBlockMatrixDense(): BlockMatrix = {
    +    toBlockMatrixDense(1024, 1024)
    +  }
    +
    +  /**
    +    * Converts to BlockMatrix. Creates blocks of `DenseMatrix`.
    +    * @param rowsPerBlock The number of rows of each block. The blocks at the bottom edge may have
    +    *                     a smaller value. Must be an integer value greater than 0.
    +    * @param colsPerBlock The number of columns of each block. The blocks at the right edge may have
    +    *                     a smaller value. Must be an integer value greater than 0.
    +    * @return a [[BlockMatrix]]
    +    */
    --- End diff --
    
    Add a `@Since` annotation like `toBlockMatrix`. Although I doubt this can make in 2.2.0, you can set it to 2.2.0 temporarily. If there is a suggested version from committers, we can change it later.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17459: [SPARK-20109][MLlib] Added toBlockMatrixDense to Indexed...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17459
  
    **[Test build #75482 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75482/testReport)** for PR 17459 at commit [`12e78bf`](https://github.com/apache/spark/commit/12e78bf9558849be6588360e8dd54ed60f34111f).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17459: [SPARK-20109][MLlib] Rewrote toBlockMatrix method on Ind...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17459
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17459: [SPARK-20109][MLlib] Rewrote toBlockMatrix method on Ind...

Posted by johnc1231 <gi...@git.apache.org>.

Github user johnc1231 commented on the issue:

    https://github.com/apache/spark/pull/17459
  
    @srowen @viirya All comments addressed, back to you guys. Hopefully we've just about reached something ready to commit. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17459: [SPARK-20109][MLlib] Added toBlockMatrixDense to Indexed...

Posted by johnc1231 <gi...@git.apache.org>.

Github user johnc1231 commented on the issue:

    https://github.com/apache/spark/pull/17459
  
    Did some thinking about this, and I think that to make the API cleaner maybe we could deprecate the regular `toBlockMatrix` method and add `toBlockMatrixSparse`. Until it's removed, we could just have `toBlockMatrix` call `toBlockMatrixSparse`. I think that'd be more explicit and make it clearer for users what kind of `BlockMatrix` they're creating. After that I think it'd be easy enough for me to abstract out the DenseMatrix creation step of `toBlockMatrixDense` and make it a general purpose helper method that `toBlockMatrixDense` and `toBlockMatrixSparse` can call to create the specific type of `BlockMatrix` they want. 
    
    I think this explicitness is important since it seems a lot of users create a `BlockMatrix` through these methods as opposed to with a `BlockMatrix` constructor since it's kind of a hard constructor to use (official Spark docs also suggest to users that it's easier to use `toBlockMatrix` than to attempt to use constructors: https://spark.apache.org/docs/latest/mllib-data-types.html#blockmatrix). 
    
    If we don't want to go the deprecation route, we could have `toBlockMatrix` take an argument specifying whether the data is sparse or dense, but I think that should be an explicitly required argument since it's otherwise easy to create something unintended. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17459: [SPARK-20109][MLlib] Rewrote toBlockMatrix method on Ind...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/17459
  
    Except for few comments regarding style, the code changes LGTM.
    
    And it'd be good if we can have benchmark for sparse case too.
    
    cc @MLnick @jkbradley for review.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17459: [SPARK-20109][MLlib] Rewrote toBlockMatrix method on Ind...

Posted by johnc1231 <gi...@git.apache.org>.

Github user johnc1231 commented on the issue:

    https://github.com/apache/spark/pull/17459
  
    @srowen Fixed both, back to you


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17459: [SPARK-20109][MLlib] Added toBlockMatrixDense to ...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17459#discussion_r109299896
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/IndexedRowMatrix.scala ---
    @@ -98,6 +98,7 @@ class IndexedRowMatrix @Since("1.0.0") (
         toBlockMatrix(1024, 1024)
       }
     
    +
    --- End diff --
    
    Please remove extra line.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17459: [SPARK-20109][MLlib] Added toBlockMatrixDense to Indexed...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17459
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17459: [SPARK-20109][MLlib] Added toBlockMatrixDense to ...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17459#discussion_r109817581
  
    --- Diff: mllib/src/test/scala/org/apache/spark/mllib/linalg/distributed/IndexedRowMatrixSuite.scala ---
    @@ -89,11 +89,19 @@ class IndexedRowMatrixSuite extends SparkFunSuite with MLlibTestSparkContext {
     
       test("toBlockMatrix") {
         val idxRowMat = new IndexedRowMatrix(indexedRows)
    +
    +    // Tests when n % colsPerBlock != 0
    --- End diff --
    
    This only tests a `IndexedRowMatrix` consisted of `DenseVector`. We should add another test to test the cases of `SparseVector` and the mix of them.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17459: [SPARK-20109][MLlib] Rewrote toBlockMatrix method on Ind...

Posted by johnc1231 <gi...@git.apache.org>.

Github user johnc1231 commented on the issue:

    https://github.com/apache/spark/pull/17459
  
    @viirya Addressed style nitpicks and did spark benchmarks. Think that should be everything. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17459: [SPARK-20109][MLlib] Rewrote toBlockMatrix method...

Posted by johnc1231 <gi...@git.apache.org>.

Github user johnc1231 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17459#discussion_r110163778
  
    --- Diff: mllib/src/test/scala/org/apache/spark/mllib/linalg/distributed/IndexedRowMatrixSuite.scala ---
    @@ -89,11 +89,19 @@ class IndexedRowMatrixSuite extends SparkFunSuite with MLlibTestSparkContext {
     
       test("toBlockMatrix") {
         val idxRowMat = new IndexedRowMatrix(indexedRows)
    +
    +    // Tests when n % colsPerBlock != 0
    --- End diff --
    
    I'll write more tests today. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17459: [SPARK-20109][MLlib] Rewrote toBlockMatrix method on Ind...

Posted by johnc1231 <gi...@git.apache.org>.

Github user johnc1231 commented on the issue:

    https://github.com/apache/spark/pull/17459
  
    @viirya I fixed the test as you asked, so please take a look when you get a chance. I'm having a little bit of trouble with my local spark build for some reason, but I'll do that other benchmark when it's resolved. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17459: [SPARK-20109][MLlib] Rewrote toBlockMatrix method on Ind...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/17459
  
    Only one remaining comment left for the test.
    
    Btw, @johnc1231 Do you think it is possible you can run a simple benchmark, so we know whether this change improves the performance too?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17459: [SPARK-20109][MLlib] Rewrote toBlockMatrix method...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17459#discussion_r111073066
  
    --- Diff: mllib/src/test/scala/org/apache/spark/mllib/linalg/distributed/IndexedRowMatrixSuite.scala ---
    @@ -87,21 +87,74 @@ class IndexedRowMatrixSuite extends SparkFunSuite with MLlibTestSparkContext {
         assert(coordMat.toBreeze() === idxRowMat.toBreeze())
       }
     
    -  test("toBlockMatrix") {
    -    val idxRowMat = new IndexedRowMatrix(indexedRows)
    -    val blockMat = idxRowMat.toBlockMatrix(2, 2)
    +  test("toBlockMatrix dense backing") {
    +    val idxRowMatDense = new IndexedRowMatrix(indexedRows)
    +
    +    // Tests when n % colsPerBlock != 0
    +    val blockMat = idxRowMatDense.toBlockMatrix(2, 2)
         assert(blockMat.numRows() === m)
         assert(blockMat.numCols() === n)
    -    assert(blockMat.toBreeze() === idxRowMat.toBreeze())
    +    assert(blockMat.toBreeze() === idxRowMatDense.toBreeze())
    +
    +    // Tests when m % rowsPerBlock != 0
    +    val blockMat2 = idxRowMatDense.toBlockMatrix(3, 1)
    +    assert(blockMat2.numRows() === m)
    +    assert(blockMat2.numCols() === n)
    +    assert(blockMat2.toBreeze() === idxRowMatDense.toBreeze())
     
         intercept[IllegalArgumentException] {
    -      idxRowMat.toBlockMatrix(-1, 2)
    +      idxRowMatDense.toBlockMatrix(-1, 2)
         }
         intercept[IllegalArgumentException] {
    -      idxRowMat.toBlockMatrix(2, 0)
    +      idxRowMatDense.toBlockMatrix(2, 0)
         }
    +    assert(blockMat.blocks.map { case (_, matrix: Matrix) =>
    +      matrix.isInstanceOf[DenseMatrix]}.reduce(_ && _))
    +  }
    +
    +  test("toBlockMatrix sparse backing") {
    +    val sparseData = Seq(
    +      (3L, Vectors.sparse(3, Seq((0, 4.0))))
    +    ).map(x => IndexedRow(x._1, x._2))
    --- End diff --
    
    You create indexed rows with duplicate indexes?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17459: [SPARK-20109][MLlib] Rewrote toBlockMatrix method on Ind...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17459
  
    **[Test build #77437 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77437/testReport)** for PR 17459 at commit [`a7a03dc`](https://github.com/apache/spark/commit/a7a03dc7627ee10f6eeb879dd1d1ee0c35051c71).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17459: [SPARK-20109][MLlib] Added toBlockMatrixDense to ...

Posted by johnc1231 <gi...@git.apache.org>.

Github user johnc1231 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17459#discussion_r109417523
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/IndexedRowMatrix.scala ---
    @@ -113,6 +114,67 @@ class IndexedRowMatrix @Since("1.0.0") (
       }
     
       /**
    +    * Converts to BlockMatrix. Creates blocks of `DenseMatrix` with size 1024 x 1024.
    +    */
    +  def toBlockMatrixDense(): BlockMatrix = {
    --- End diff --
    
    I agree that there's no clear default, though if DenseMatrix-backed was the default than people would still be able to get SparseMatrix backed by calling `IndexedRowMatrix.toCoordinateMatrix().toBlockMatrix()`, which is the current implementation of toBlockMatrix anyway. But that's less than ideal, as that's not a particularly good implementation either. 
    
    As for generalizing, if we went ahead and just had one toBlockMatrix method with an argument to specify how it should be backed, it'd be easy enough to take the code that turns the iterator of `(Int, Array[Double])` pairs into a `DenseMatrix` and pull it out into a helper method. Then I could make a corresponding one to make a `SparseMatrix` and just call the appropriate one based on the given argument. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17459: [SPARK-20109][MLlib] Added toBlockMatrixDense to ...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17459#discussion_r109300686
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/IndexedRowMatrix.scala ---
    @@ -113,6 +114,67 @@ class IndexedRowMatrix @Since("1.0.0") (
       }
     
       /**
    +    * Converts to BlockMatrix. Creates blocks of `DenseMatrix` with size 1024 x 1024.
    +    */
    +  def toBlockMatrixDense(): BlockMatrix = {
    --- End diff --
    
    Is it a good idea to have both `toBlockMatrix` and `toBlockMatrixDense` for converting to `BlockMatrix` ?
    
    Shall we combine them and have just one `toBlockMatrix` method?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17459: [SPARK-20109][MLlib] Added toBlockMatrixDense to ...

Posted by johnc1231 <gi...@git.apache.org>.

Github user johnc1231 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17459#discussion_r109320701
  
    --- Diff: mllib/src/test/scala/org/apache/spark/mllib/linalg/distributed/IndexedRowMatrixSuite.scala ---
    @@ -89,11 +89,42 @@ class IndexedRowMatrixSuite extends SparkFunSuite with MLlibTestSparkContext {
     
       test("toBlockMatrix") {
         val idxRowMat = new IndexedRowMatrix(indexedRows)
    +
    +    // Tests when n % colsPerBlock != 0
    +    val blockMat = idxRowMat.toBlockMatrix(2, 2)
    +    assert(blockMat.numRows() === m)
    +    assert(blockMat.numCols() === n)
    +    assert(blockMat.toBreeze() === idxRowMat.toBreeze())
    +
    +    // Tests when m % rowsPerBlock != 0
    +    val blockMat2 = idxRowMat.toBlockMatrix(3, 1)
    +    assert(blockMat2.numRows() === m)
    +    assert(blockMat2.numCols() === n)
    +    assert(blockMat2.toBreeze() === idxRowMat.toBreeze())
    +
    +    intercept[IllegalArgumentException] {
    +      idxRowMat.toBlockMatrix(-1, 2)
    +    }
    +    intercept[IllegalArgumentException] {
    +      idxRowMat.toBlockMatrix(2, 0)
    +    }
    +  }
    +
    +  test("toBlockMatrixDense") {
    --- End diff --
    
    Oh I see what you mean now, will fix. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17459: [SPARK-20109][MLlib] Rewrote toBlockMatrix method on Ind...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17459
  
    **[Test build #77606 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77606/testReport)** for PR 17459 at commit [`f9c5506`](https://github.com/apache/spark/commit/f9c5506e3165bb851415d5625f085438413c23fe).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17459: [SPARK-20109][MLlib] Added toBlockMatrixDense to Indexed...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17459
  
    **[Test build #75518 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75518/testReport)** for PR 17459 at commit [`4582a7e`](https://github.com/apache/spark/commit/4582a7e33836b54315407d7176d1b95c25335791).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17459: [SPARK-20109][MLlib] Rewrote toBlockMatrix method on Ind...

Posted by johnc1231 <gi...@git.apache.org>.

Github user johnc1231 commented on the issue:

    https://github.com/apache/spark/pull/17459
  
    @viirya Any more feedback on this? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17459: [SPARK-20109][MLlib] Rewrote toBlockMatrix method...

Posted by johnc1231 <gi...@git.apache.org>.

Github user johnc1231 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17459#discussion_r111492414
  
    --- Diff: mllib/src/test/scala/org/apache/spark/mllib/linalg/distributed/IndexedRowMatrixSuite.scala ---
    @@ -87,19 +87,92 @@ class IndexedRowMatrixSuite extends SparkFunSuite with MLlibTestSparkContext {
         assert(coordMat.toBreeze() === idxRowMat.toBreeze())
       }
     
    -  test("toBlockMatrix") {
    -    val idxRowMat = new IndexedRowMatrix(indexedRows)
    -    val blockMat = idxRowMat.toBlockMatrix(2, 2)
    +  test("toBlockMatrix dense backing") {
    +    val idxRowMatDense = new IndexedRowMatrix(indexedRows)
    +
    +    // Tests when n % colsPerBlock != 0
    +    val blockMat = idxRowMatDense.toBlockMatrix(2, 2)
         assert(blockMat.numRows() === m)
         assert(blockMat.numCols() === n)
    -    assert(blockMat.toBreeze() === idxRowMat.toBreeze())
    +    assert(blockMat.toBreeze() === idxRowMatDense.toBreeze())
    +
    +    // Tests when m % rowsPerBlock != 0
    +    val blockMat2 = idxRowMatDense.toBlockMatrix(3, 1)
    +    assert(blockMat2.numRows() === m)
    +    assert(blockMat2.numCols() === n)
    +    assert(blockMat2.toBreeze() === idxRowMatDense.toBreeze())
     
         intercept[IllegalArgumentException] {
    -      idxRowMat.toBlockMatrix(-1, 2)
    +      idxRowMatDense.toBlockMatrix(-1, 2)
         }
         intercept[IllegalArgumentException] {
    -      idxRowMat.toBlockMatrix(2, 0)
    +      idxRowMatDense.toBlockMatrix(2, 0)
         }
    +
    +    assert(blockMat.blocks.map { case (_, matrix: Matrix) =>
    +      matrix.isInstanceOf[DenseMatrix]}.reduce(_ && _))
    +    assert(blockMat2.blocks.map { case (_, matrix: Matrix) =>
    +      matrix.isInstanceOf[DenseMatrix]}.reduce(_ && _))
    +  }
    +
    +  test("toBlockMatrix sparse backing") {
    +    val sparseData = Seq(
    +      (2L, Vectors.sparse(3, Seq((0, 4.0))))
    --- End diff --
    
    If you look at the constructor for the IndexedRowMatrix, I passed it a width and height so that this would not be an issue. I guess that's not a good idea though, since that invalidates the tests of dimensionality. I'll just make the matrix bigger than 4 x 3 so that it'll be easy to make something sparse. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17459: [SPARK-20109][MLlib] Rewrote toBlockMatrix method...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17459#discussion_r118792727
  
    --- Diff: mllib/src/test/scala/org/apache/spark/mllib/linalg/distributed/IndexedRowMatrixSuite.scala ---
    @@ -87,19 +87,96 @@ class IndexedRowMatrixSuite extends SparkFunSuite with MLlibTestSparkContext {
         assert(coordMat.toBreeze() === idxRowMat.toBreeze())
       }
     
    -  test("toBlockMatrix") {
    -    val idxRowMat = new IndexedRowMatrix(indexedRows)
    -    val blockMat = idxRowMat.toBlockMatrix(2, 2)
    +  test("toBlockMatrix dense backing") {
    +    val idxRowMatDense = new IndexedRowMatrix(indexedRows)
    +
    +    // Tests when n % colsPerBlock != 0
    +    val blockMat = idxRowMatDense.toBlockMatrix(2, 2)
         assert(blockMat.numRows() === m)
         assert(blockMat.numCols() === n)
    -    assert(blockMat.toBreeze() === idxRowMat.toBreeze())
    +    assert(blockMat.toBreeze() === idxRowMatDense.toBreeze())
    +
    +    // Tests when m % rowsPerBlock != 0
    +    val blockMat2 = idxRowMatDense.toBlockMatrix(3, 1)
    +    assert(blockMat2.numRows() === m)
    +    assert(blockMat2.numCols() === n)
    +    assert(blockMat2.toBreeze() === idxRowMatDense.toBreeze())
     
         intercept[IllegalArgumentException] {
    -      idxRowMat.toBlockMatrix(-1, 2)
    +      idxRowMatDense.toBlockMatrix(-1, 2)
         }
         intercept[IllegalArgumentException] {
    -      idxRowMat.toBlockMatrix(2, 0)
    +      idxRowMatDense.toBlockMatrix(2, 0)
         }
    +
    +    assert(blockMat.blocks.map { case (_, matrix: Matrix) =>
    +      matrix.isInstanceOf[DenseMatrix]
    +    }.reduce(_ && _))
    +    assert(blockMat2.blocks.map { case (_, matrix: Matrix) =>
    +      matrix.isInstanceOf[DenseMatrix]
    +    }.reduce(_ && _))
    +  }
    +
    +  test("toBlockMatrix sparse backing") {
    +    val sparseData = Seq(
    +      (15L, Vectors.sparse(12, Seq((0, 4.0))))
    +    ).map(x => IndexedRow(x._1, x._2))
    +
    +    // Gonna make m and n larger here so the matrices can easily be completely sparse:
    +    val m = 16
    +    val n = 12
    +
    +    val idxRowMatSparse = new IndexedRowMatrix(sc.parallelize(sparseData))
    +
    +    // Tests when n % colsPerBlock != 0
    +    val blockMat = idxRowMatSparse.toBlockMatrix(8, 8)
    +    assert(blockMat.numRows() === m)
    +    assert(blockMat.numCols() === n)
    +    assert(blockMat.toBreeze() === idxRowMatSparse.toBreeze())
    +
    +    // Tests when m % rowsPerBlock != 0
    +    val blockMat2 = idxRowMatSparse.toBlockMatrix(6, 6)
    +    assert(blockMat2.numRows() === m)
    +    assert(blockMat2.numCols() === n)
    +    assert(blockMat2.toBreeze() === idxRowMatSparse.toBreeze())
    +
    +    assert(blockMat.blocks.map { case (_, matrix: Matrix) =>
    --- End diff --
    
    Isn't this just `blockMat.blocks.forall { case (_, matrix) => matrix.isInstanceOf[SparseMatrix] }`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17459: [SPARK-20109][MLlib] Rewrote toBlockMatrix method on Ind...

Posted by johnc1231 <gi...@git.apache.org>.

Github user johnc1231 commented on the issue:

    https://github.com/apache/spark/pull/17459
  
    @viirya Added appropriate tests, back to you. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17459: [SPARK-20109][MLlib] Rewrote toBlockMatrix method...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17459#discussion_r111073855
  
    --- Diff: mllib/src/test/scala/org/apache/spark/mllib/linalg/distributed/IndexedRowMatrixSuite.scala ---
    @@ -87,21 +87,74 @@ class IndexedRowMatrixSuite extends SparkFunSuite with MLlibTestSparkContext {
         assert(coordMat.toBreeze() === idxRowMat.toBreeze())
       }
     
    -  test("toBlockMatrix") {
    -    val idxRowMat = new IndexedRowMatrix(indexedRows)
    -    val blockMat = idxRowMat.toBlockMatrix(2, 2)
    +  test("toBlockMatrix dense backing") {
    +    val idxRowMatDense = new IndexedRowMatrix(indexedRows)
    +
    +    // Tests when n % colsPerBlock != 0
    +    val blockMat = idxRowMatDense.toBlockMatrix(2, 2)
         assert(blockMat.numRows() === m)
         assert(blockMat.numCols() === n)
    -    assert(blockMat.toBreeze() === idxRowMat.toBreeze())
    +    assert(blockMat.toBreeze() === idxRowMatDense.toBreeze())
    +
    +    // Tests when m % rowsPerBlock != 0
    +    val blockMat2 = idxRowMatDense.toBlockMatrix(3, 1)
    +    assert(blockMat2.numRows() === m)
    +    assert(blockMat2.numCols() === n)
    +    assert(blockMat2.toBreeze() === idxRowMatDense.toBreeze())
     
         intercept[IllegalArgumentException] {
    -      idxRowMat.toBlockMatrix(-1, 2)
    +      idxRowMatDense.toBlockMatrix(-1, 2)
         }
         intercept[IllegalArgumentException] {
    -      idxRowMat.toBlockMatrix(2, 0)
    +      idxRowMatDense.toBlockMatrix(2, 0)
         }
    +    assert(blockMat.blocks.map { case (_, matrix: Matrix) =>
    +      matrix.isInstanceOf[DenseMatrix]}.reduce(_ && _))
    --- End diff --
    
    Also check if blockMat2 is all sparse matrix backing?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17459: [SPARK-20109][MLlib] Rewrote toBlockMatrix method on Ind...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17459
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17459: [SPARK-20109][MLlib] Added toBlockMatrixDense to Indexed...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/17459
  
    Please modify the PR title also. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17459: [SPARK-20109][MLlib] Rewrote toBlockMatrix method on Ind...

Posted by johnc1231 <gi...@git.apache.org>.

Github user johnc1231 commented on the issue:

    https://github.com/apache/spark/pull/17459
  
    @viirya 
    Gonna fix that last test tonight, but just doing a quick timing on my laptop, (2014 Macbook Pro with 2.2Hz i7), the difference was as follows on a 10k by 10k matrix of random doubles between 0 and 1:
    New method: 113837 ms
    Old method: 167853 ms
    
    So it's clearly better in the dense case. Will benchmark sparse tonight as well. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17459: [SPARK-20109][MLlib] Added toBlockMatrixDense to Indexed...

Posted by johnc1231 <gi...@git.apache.org>.

Github user johnc1231 commented on the issue:

    https://github.com/apache/spark/pull/17459
  
    Alright, I agree with this. We'll switch off Dense or Sparse matrix backings based on what the type of the first vector in the iterator is. I'd be happy to take on making these adjustments. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17459: [SPARK-20109][MLlib] Rewrote toBlockMatrix method...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17459#discussion_r111305748
  
    --- Diff: mllib/src/test/scala/org/apache/spark/mllib/linalg/distributed/IndexedRowMatrixSuite.scala ---
    @@ -87,19 +87,92 @@ class IndexedRowMatrixSuite extends SparkFunSuite with MLlibTestSparkContext {
         assert(coordMat.toBreeze() === idxRowMat.toBreeze())
       }
     
    -  test("toBlockMatrix") {
    -    val idxRowMat = new IndexedRowMatrix(indexedRows)
    -    val blockMat = idxRowMat.toBlockMatrix(2, 2)
    +  test("toBlockMatrix dense backing") {
    +    val idxRowMatDense = new IndexedRowMatrix(indexedRows)
    +
    +    // Tests when n % colsPerBlock != 0
    +    val blockMat = idxRowMatDense.toBlockMatrix(2, 2)
         assert(blockMat.numRows() === m)
         assert(blockMat.numCols() === n)
    -    assert(blockMat.toBreeze() === idxRowMat.toBreeze())
    +    assert(blockMat.toBreeze() === idxRowMatDense.toBreeze())
    +
    +    // Tests when m % rowsPerBlock != 0
    +    val blockMat2 = idxRowMatDense.toBlockMatrix(3, 1)
    +    assert(blockMat2.numRows() === m)
    +    assert(blockMat2.numCols() === n)
    +    assert(blockMat2.toBreeze() === idxRowMatDense.toBreeze())
     
         intercept[IllegalArgumentException] {
    -      idxRowMat.toBlockMatrix(-1, 2)
    +      idxRowMatDense.toBlockMatrix(-1, 2)
         }
         intercept[IllegalArgumentException] {
    -      idxRowMat.toBlockMatrix(2, 0)
    +      idxRowMatDense.toBlockMatrix(2, 0)
         }
    +
    +    assert(blockMat.blocks.map { case (_, matrix: Matrix) =>
    +      matrix.isInstanceOf[DenseMatrix]}.reduce(_ && _))
    +    assert(blockMat2.blocks.map { case (_, matrix: Matrix) =>
    +      matrix.isInstanceOf[DenseMatrix]}.reduce(_ && _))
    +  }
    +
    +  test("toBlockMatrix sparse backing") {
    +    val sparseData = Seq(
    +      (2L, Vectors.sparse(3, Seq((0, 4.0))))
    --- End diff --
    
    Then the following `Tests when m % rowsPerBlock != 0` is not true anymore.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17459: [SPARK-20109][MLlib] Rewrote toBlockMatrix method...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17459#discussion_r116543887
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/IndexedRowMatrix.scala ---
    @@ -108,8 +108,64 @@ class IndexedRowMatrix @Since("1.0.0") (
        */
       @Since("1.3.0")
       def toBlockMatrix(rowsPerBlock: Int, colsPerBlock: Int): BlockMatrix = {
    -    // TODO: This implementation may be optimized
    -    toCoordinateMatrix().toBlockMatrix(rowsPerBlock, colsPerBlock)
    +    require(rowsPerBlock > 0,
    +      s"rowsPerBlock needs to be greater than 0. rowsPerBlock: $rowsPerBlock")
    +    require(colsPerBlock > 0,
    +      s"colsPerBlock needs to be greater than 0. colsPerBlock: $colsPerBlock")
    +
    +    val m = numRows()
    +    val n = numCols()
    +    val lastRowBlockIndex = m / rowsPerBlock
    +    val lastColBlockIndex = n / colsPerBlock
    +    val lastRowBlockSize = (m % rowsPerBlock).toInt
    +    val lastColBlockSize = (n % colsPerBlock).toInt
    +    val numRowBlocks = math.ceil(m.toDouble / rowsPerBlock).toInt
    +    val numColBlocks = math.ceil(n.toDouble / colsPerBlock).toInt
    +
    +    val blocks = rows.flatMap { ir: IndexedRow =>
    +      val blockRow = ir.index / rowsPerBlock
    +      val rowInBlock = ir.index % rowsPerBlock
    +
    +      ir.vector match {
    +        case SparseVector(size, indices, values) =>
    +          indices.zip(values).map { case (index, value) =>
    +            val blockColumn = index / colsPerBlock
    +            val columnInBlock = index % colsPerBlock
    +            ((blockRow.toInt, blockColumn.toInt), (rowInBlock.toInt, Array((value, columnInBlock))))
    +          }
    +        case DenseVector(values) =>
    +          values.grouped(colsPerBlock)
    +            .zipWithIndex
    +            .map { case (values, blockColumn) =>
    +              ((blockRow.toInt, blockColumn), (rowInBlock.toInt, values.zipWithIndex))
    +            }
    +      }
    +    }.groupByKey(GridPartitioner(numRowBlocks, numColBlocks, rows.getNumPartitions)).map {
    +      case ((blockRow, blockColumn), itr) =>
    +        val actualNumRows =
    +          if (blockRow == lastRowBlockIndex) lastRowBlockSize else rowsPerBlock
    +        val actualNumColumns =
    +          if (blockColumn == lastColBlockIndex) lastColBlockSize else colsPerBlock
    +
    +        val arraySize = actualNumRows * actualNumColumns
    +        val matrixAsArray = new Array[Double](arraySize)
    +        var countForValues = 0
    +        itr.foreach { case (rowWithinBlock, valuesWithColumns) =>
    +          valuesWithColumns.foreach { case (value, columnWithinBlock) =>
    +            matrixAsArray.update(columnWithinBlock * actualNumRows + rowWithinBlock, value)
    +            countForValues += 1
    +          }
    +        }
    +        val denseMatrix = new DenseMatrix(actualNumRows, actualNumColumns, matrixAsArray)
    +        val finalMatrix = if (countForValues / arraySize.toDouble >= 0.5) {
    +          denseMatrix
    +        } else {
    +          denseMatrix.toSparse
    --- End diff --
    
    OK, this isn't inefficient because making the dense matrix doesn't copy or anything. Seems OK


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17459: [SPARK-20109][MLlib] Rewrote toBlockMatrix method...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17459#discussion_r114485962
  
    --- Diff: mllib/src/test/scala/org/apache/spark/mllib/linalg/distributed/IndexedRowMatrixSuite.scala ---
    @@ -87,19 +87,96 @@ class IndexedRowMatrixSuite extends SparkFunSuite with MLlibTestSparkContext {
         assert(coordMat.toBreeze() === idxRowMat.toBreeze())
       }
     
    -  test("toBlockMatrix") {
    -    val idxRowMat = new IndexedRowMatrix(indexedRows)
    -    val blockMat = idxRowMat.toBlockMatrix(2, 2)
    +  test("toBlockMatrix dense backing") {
    +    val idxRowMatDense = new IndexedRowMatrix(indexedRows)
    +
    +    // Tests when n % colsPerBlock != 0
    +    val blockMat = idxRowMatDense.toBlockMatrix(2, 2)
         assert(blockMat.numRows() === m)
         assert(blockMat.numCols() === n)
    -    assert(blockMat.toBreeze() === idxRowMat.toBreeze())
    +    assert(blockMat.toBreeze() === idxRowMatDense.toBreeze())
    +
    +    // Tests when m % rowsPerBlock != 0
    +    val blockMat2 = idxRowMatDense.toBlockMatrix(3, 1)
    +    assert(blockMat2.numRows() === m)
    +    assert(blockMat2.numCols() === n)
    +    assert(blockMat2.toBreeze() === idxRowMatDense.toBreeze())
     
         intercept[IllegalArgumentException] {
    -      idxRowMat.toBlockMatrix(-1, 2)
    +      idxRowMatDense.toBlockMatrix(-1, 2)
         }
         intercept[IllegalArgumentException] {
    -      idxRowMat.toBlockMatrix(2, 0)
    +      idxRowMatDense.toBlockMatrix(2, 0)
         }
    +
    +    assert(blockMat.blocks.map { case (_, matrix: Matrix) =>
    --- End diff --
    
    nit: the style looks weird.
    
    Maybe:
    
        assert(blockMat.blocks.map { case (_, matrix: Matrix) =>
          matrix.isInstanceOf[DenseMatrix]
        }.reduce(_ && _))


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17459: [SPARK-20109][MLlib] Added toBlockMatrixDense to Indexed...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17459
  
    **[Test build #75475 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75475/testReport)** for PR 17459 at commit [`06c2b3a`](https://github.com/apache/spark/commit/06c2b3a22f16052a90f339edfdbe9644fd8ab797).
     * This patch **fails Scala style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17459: [SPARK-20109][MLlib] Added toBlockMatrixDense to ...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17459#discussion_r109370145
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/IndexedRowMatrix.scala ---
    @@ -113,6 +113,67 @@ class IndexedRowMatrix @Since("1.0.0") (
       }
     
       /**
    +    * Converts to BlockMatrix. Creates blocks of `DenseMatrix` with size 1024 x 1024.
    +    */
    +  def toBlockMatrixDense(): BlockMatrix = {
    +    toBlockMatrixDense(1024, 1024)
    +  }
    +
    +  /**
    +    * Converts to BlockMatrix. Creates blocks of `DenseMatrix`.
    +    * @param rowsPerBlock The number of rows of each block. The blocks at the bottom edge may have
    +    *                     a smaller value. Must be an integer value greater than 0.
    +    * @param colsPerBlock The number of columns of each block. The blocks at the right edge may have
    +    *                     a smaller value. Must be an integer value greater than 0.
    +    * @return a [[BlockMatrix]]
    +    */
    +  def toBlockMatrixDense(rowsPerBlock: Int, colsPerBlock: Int): BlockMatrix = {
    +    require(rowsPerBlock > 0,
    +      s"rowsPerBlock needs to be greater than 0. rowsPerBlock: $rowsPerBlock")
    +    require(colsPerBlock > 0,
    +      s"colsPerBlock needs to be greater than 0. colsPerBlock: $colsPerBlock")
    +
    +    val m = numRows()
    +    val n = numCols()
    +    val lastRowBlockIndex = m / rowsPerBlock
    +    val lastColBlockIndex = n / colsPerBlock
    +    val lastRowBlockSize = (m % rowsPerBlock).toInt
    +    val lastColBlockSize = (n % colsPerBlock).toInt
    +    val numRowBlocks = math.ceil(m.toDouble / rowsPerBlock).toInt
    +    val numColBlocks = math.ceil(n.toDouble / colsPerBlock).toInt
    +
    +    val blocks: RDD[((Int, Int), Matrix)] = rows.flatMap({ ir =>
    +      val blockRow = ir.index / rowsPerBlock
    +      val rowInBlock = ir.index % rowsPerBlock
    +
    +      ir.vector.toArray
    +        .grouped(colsPerBlock)
    +        .zipWithIndex
    +        .map({ case (values, blockColumn) =>
    +          ((blockRow.toInt, blockColumn), (rowInBlock.toInt, values))
    +        })
    +    }).groupByKey(GridPartitioner(numRowBlocks, numColBlocks, rows.getNumPartitions)).map({
    +      case ((blockRow, blockColumn), itr) =>
    +        val actualNumRows: Int =
    --- End diff --
    
    We usually don't put a type on vals/vars unless it's important for clarity or needed for a cast


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17459: [SPARK-20109][MLlib] Added toBlockMatrixDense to ...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17459#discussion_r109364067
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/IndexedRowMatrix.scala ---
    @@ -113,6 +114,67 @@ class IndexedRowMatrix @Since("1.0.0") (
       }
     
       /**
    +    * Converts to BlockMatrix. Creates blocks of `DenseMatrix` with size 1024 x 1024.
    +    */
    +  def toBlockMatrixDense(): BlockMatrix = {
    --- End diff --
    
    I am not sure if DenseMatrix-backed is better default behavior for `toBlockMatrix`. Actually the rows in `IndexedRowMatrix` can be sparse or dense. Choose which one, SparkMatrix-backed or DenseMartix-backed, is totally depending on the use case.
    
    Looks like `toBlockMatrixDense` is already a non-small function. Merging it with current `toBlockMatrix` might not a good idea. I'd keep it as it's now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17459: [SPARK-20109][MLlib] Rewrote toBlockMatrix method...

Posted by johnc1231 <gi...@git.apache.org>.

Github user johnc1231 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17459#discussion_r119463845
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/IndexedRowMatrix.scala ---
    @@ -108,8 +108,69 @@ class IndexedRowMatrix @Since("1.0.0") (
        */
       @Since("1.3.0")
       def toBlockMatrix(rowsPerBlock: Int, colsPerBlock: Int): BlockMatrix = {
    -    // TODO: This implementation may be optimized
    -    toCoordinateMatrix().toBlockMatrix(rowsPerBlock, colsPerBlock)
    +    require(rowsPerBlock > 0,
    +      s"rowsPerBlock needs to be greater than 0. rowsPerBlock: $rowsPerBlock")
    +    require(colsPerBlock > 0,
    +      s"colsPerBlock needs to be greater than 0. colsPerBlock: $colsPerBlock")
    +
    +    // Since block matrices require an integer row index
    +    require(numRows() / rowsPerBlock.toDouble <= Int.MaxValue,
    --- End diff --
    
    Yeah, even with block size one IndexedRows are limited by length of array which is itself limited by max int, so should be fine. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17459: [SPARK-20109][MLlib] Rewrote toBlockMatrix method...

Posted by johnc1231 <gi...@git.apache.org>.

Github user johnc1231 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17459#discussion_r118793567
  
    --- Diff: mllib/src/test/scala/org/apache/spark/mllib/linalg/distributed/IndexedRowMatrixSuite.scala ---
    @@ -87,19 +87,96 @@ class IndexedRowMatrixSuite extends SparkFunSuite with MLlibTestSparkContext {
         assert(coordMat.toBreeze() === idxRowMat.toBreeze())
       }
     
    -  test("toBlockMatrix") {
    -    val idxRowMat = new IndexedRowMatrix(indexedRows)
    -    val blockMat = idxRowMat.toBlockMatrix(2, 2)
    +  test("toBlockMatrix dense backing") {
    +    val idxRowMatDense = new IndexedRowMatrix(indexedRows)
    +
    +    // Tests when n % colsPerBlock != 0
    +    val blockMat = idxRowMatDense.toBlockMatrix(2, 2)
         assert(blockMat.numRows() === m)
         assert(blockMat.numCols() === n)
    -    assert(blockMat.toBreeze() === idxRowMat.toBreeze())
    +    assert(blockMat.toBreeze() === idxRowMatDense.toBreeze())
    +
    +    // Tests when m % rowsPerBlock != 0
    +    val blockMat2 = idxRowMatDense.toBlockMatrix(3, 1)
    +    assert(blockMat2.numRows() === m)
    +    assert(blockMat2.numCols() === n)
    +    assert(blockMat2.toBreeze() === idxRowMatDense.toBreeze())
     
         intercept[IllegalArgumentException] {
    -      idxRowMat.toBlockMatrix(-1, 2)
    +      idxRowMatDense.toBlockMatrix(-1, 2)
         }
         intercept[IllegalArgumentException] {
    -      idxRowMat.toBlockMatrix(2, 0)
    +      idxRowMatDense.toBlockMatrix(2, 0)
         }
    +
    +    assert(blockMat.blocks.map { case (_, matrix: Matrix) =>
    +      matrix.isInstanceOf[DenseMatrix]
    +    }.reduce(_ && _))
    +    assert(blockMat2.blocks.map { case (_, matrix: Matrix) =>
    +      matrix.isInstanceOf[DenseMatrix]
    +    }.reduce(_ && _))
    +  }
    +
    +  test("toBlockMatrix sparse backing") {
    +    val sparseData = Seq(
    +      (15L, Vectors.sparse(12, Seq((0, 4.0))))
    +    ).map(x => IndexedRow(x._1, x._2))
    +
    +    // Gonna make m and n larger here so the matrices can easily be completely sparse:
    +    val m = 16
    +    val n = 12
    +
    +    val idxRowMatSparse = new IndexedRowMatrix(sc.parallelize(sparseData))
    +
    +    // Tests when n % colsPerBlock != 0
    +    val blockMat = idxRowMatSparse.toBlockMatrix(8, 8)
    +    assert(blockMat.numRows() === m)
    +    assert(blockMat.numCols() === n)
    +    assert(blockMat.toBreeze() === idxRowMatSparse.toBreeze())
    +
    +    // Tests when m % rowsPerBlock != 0
    +    val blockMat2 = idxRowMatSparse.toBlockMatrix(6, 6)
    +    assert(blockMat2.numRows() === m)
    +    assert(blockMat2.numCols() === n)
    +    assert(blockMat2.toBreeze() === idxRowMatSparse.toBreeze())
    +
    +    assert(blockMat.blocks.map { case (_, matrix: Matrix) =>
    --- End diff --
    
    Pretty sure there is no forall on RDD's, which is why I wrote it this way. Could do it as `collect().forall` I suppose. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17459: [SPARK-20109][MLlib] Rewrote toBlockMatrix method...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17459#discussion_r111074605
  
    --- Diff: mllib/src/test/scala/org/apache/spark/mllib/linalg/distributed/IndexedRowMatrixSuite.scala ---
    @@ -87,21 +87,74 @@ class IndexedRowMatrixSuite extends SparkFunSuite with MLlibTestSparkContext {
         assert(coordMat.toBreeze() === idxRowMat.toBreeze())
       }
     
    -  test("toBlockMatrix") {
    -    val idxRowMat = new IndexedRowMatrix(indexedRows)
    -    val blockMat = idxRowMat.toBlockMatrix(2, 2)
    +  test("toBlockMatrix dense backing") {
    +    val idxRowMatDense = new IndexedRowMatrix(indexedRows)
    +
    +    // Tests when n % colsPerBlock != 0
    +    val blockMat = idxRowMatDense.toBlockMatrix(2, 2)
         assert(blockMat.numRows() === m)
         assert(blockMat.numCols() === n)
    -    assert(blockMat.toBreeze() === idxRowMat.toBreeze())
    +    assert(blockMat.toBreeze() === idxRowMatDense.toBreeze())
    +
    +    // Tests when m % rowsPerBlock != 0
    +    val blockMat2 = idxRowMatDense.toBlockMatrix(3, 1)
    +    assert(blockMat2.numRows() === m)
    +    assert(blockMat2.numCols() === n)
    +    assert(blockMat2.toBreeze() === idxRowMatDense.toBreeze())
     
         intercept[IllegalArgumentException] {
    -      idxRowMat.toBlockMatrix(-1, 2)
    +      idxRowMatDense.toBlockMatrix(-1, 2)
         }
         intercept[IllegalArgumentException] {
    -      idxRowMat.toBlockMatrix(2, 0)
    +      idxRowMatDense.toBlockMatrix(2, 0)
         }
    +    assert(blockMat.blocks.map { case (_, matrix: Matrix) =>
    +      matrix.isInstanceOf[DenseMatrix]}.reduce(_ && _))
    +  }
    +
    +  test("toBlockMatrix sparse backing") {
    +    val sparseData = Seq(
    +      (3L, Vectors.sparse(3, Seq((0, 4.0))))
    +    ).map(x => IndexedRow(x._1, x._2))
    +
    +    val idxRowMatSparse = new IndexedRowMatrix(sc.parallelize(sparseData))
    +
    +    // Tests when n % colsPerBlock != 0
    +    val blockMat = idxRowMatSparse.toBlockMatrix(2, 2)
    +    assert(blockMat.numRows() === m)
    +    assert(blockMat.numCols() === n)
    +    assert(blockMat.toBreeze() === idxRowMatSparse.toBreeze())
    +
    +    // Tests when m % rowsPerBlock != 0
    +    val blockMat2 = idxRowMatSparse.toBlockMatrix(3, 1)
    +    assert(blockMat2.numRows() === m)
    +    assert(blockMat2.numCols() === n)
    +    assert(blockMat2.toBreeze() === idxRowMatSparse.toBreeze())
    +
    +    assert(blockMat.blocks.map { case (_, matrix: Matrix) =>
    +      matrix.isInstanceOf[SparseMatrix]}.reduce(_ && _))
    +  }
    +
    +  test("toBlockMatrix mixed backing") {
    +    val mixedData = Seq(
    +      (0L, Vectors.dense(1, 2, 3)),
    +      (3L, Vectors.sparse(3, Seq((0, 4.0)))))
    +
    +    val idxRowMatMixed = new IndexedRowMatrix(
    +      sc.parallelize(mixedData.map(x => IndexedRow(x._1, x._2))))
    +
    +    val blockMat = idxRowMatMixed.toBlockMatrix(2, 2)
    +    assert(blockMat.numRows() === m)
    +    assert(blockMat.numCols() === n)
    +    assert(blockMat.toBreeze() === idxRowMatMixed.toBreeze())
    +
    +    val blocks = blockMat.blocks.collect()
    +
    +    blocks.forall { case((row, col), matrix) =>
    --- End diff --
    
    Can you add a comment with a simple diagram to show the content of the matrix? So it can be easy to judge which part is sparse/dense.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17459: [SPARK-20109][MLlib] Added toBlockMatrixDense to ...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17459#discussion_r109817633
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/IndexedRowMatrix.scala ---
    @@ -108,8 +108,64 @@ class IndexedRowMatrix @Since("1.0.0") (
        */
       @Since("1.3.0")
       def toBlockMatrix(rowsPerBlock: Int, colsPerBlock: Int): BlockMatrix = {
    -    // TODO: This implementation may be optimized
    -    toCoordinateMatrix().toBlockMatrix(rowsPerBlock, colsPerBlock)
    +    require(rowsPerBlock > 0,
    +      s"rowsPerBlock needs to be greater than 0. rowsPerBlock: $rowsPerBlock")
    +    require(colsPerBlock > 0,
    +      s"colsPerBlock needs to be greater than 0. colsPerBlock: $colsPerBlock")
    +
    +    val m = numRows()
    +    val n = numCols()
    +    val lastRowBlockIndex = m / rowsPerBlock
    +    val lastColBlockIndex = n / colsPerBlock
    +    val lastRowBlockSize = (m % rowsPerBlock).toInt
    +    val lastColBlockSize = (n % colsPerBlock).toInt
    +    val numRowBlocks = math.ceil(m.toDouble / rowsPerBlock).toInt
    +    val numColBlocks = math.ceil(n.toDouble / colsPerBlock).toInt
    +
    +    val blocks = rows.flatMap { ir: IndexedRow =>
    +      val blockRow = ir.index / rowsPerBlock
    +      val rowInBlock = ir.index % rowsPerBlock
    +
    +      ir.vector match {
    +        case SparseVector(size, indices, values) =>
    +          indices.zip(values).map { case (index, value) =>
    +            val blockColumn = index / colsPerBlock
    +            val columnInBlock = index % colsPerBlock
    +            ((blockRow.toInt, blockColumn.toInt), (rowInBlock.toInt, Array((value, columnInBlock))))
    +          }
    +        case DenseVector(values) =>
    +          values.grouped(colsPerBlock)
    +            .zipWithIndex
    +            .map { case (values, blockColumn) =>
    +              ((blockRow.toInt, blockColumn), (rowInBlock.toInt, values.zipWithIndex))
    +            }
    +      }
    +    }.groupByKey(GridPartitioner(numRowBlocks, numColBlocks, rows.getNumPartitions)).map{
    --- End diff --
    
    style: `map{` -> `map {`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17459: [SPARK-20109][MLlib] Added toBlockMatrixDense to Indexed...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/17459
  
    ok to test.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17459: [SPARK-20109][MLlib] Added toBlockMatrixDense to Indexed...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17459
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17459: [SPARK-20109][MLlib] Rewrote toBlockMatrix method on Ind...

Posted by johnc1231 <gi...@git.apache.org>.

Github user johnc1231 commented on the issue:

    https://github.com/apache/spark/pull/17459
  
    Did a sparse benchmark (2014 Macbook Pro with 2.2Hz i7) 60 partitions, 10k by 10k matrix with mostly 0's, 10% 1's, made of SparseVectors. Both old method and new method took about 7.5 seconds. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17459: [SPARK-20109][MLlib] Rewrote toBlockMatrix method...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17459#discussion_r111074139
  
    --- Diff: mllib/src/test/scala/org/apache/spark/mllib/linalg/distributed/IndexedRowMatrixSuite.scala ---
    @@ -87,21 +87,74 @@ class IndexedRowMatrixSuite extends SparkFunSuite with MLlibTestSparkContext {
         assert(coordMat.toBreeze() === idxRowMat.toBreeze())
       }
     
    -  test("toBlockMatrix") {
    -    val idxRowMat = new IndexedRowMatrix(indexedRows)
    -    val blockMat = idxRowMat.toBlockMatrix(2, 2)
    +  test("toBlockMatrix dense backing") {
    +    val idxRowMatDense = new IndexedRowMatrix(indexedRows)
    +
    +    // Tests when n % colsPerBlock != 0
    +    val blockMat = idxRowMatDense.toBlockMatrix(2, 2)
         assert(blockMat.numRows() === m)
         assert(blockMat.numCols() === n)
    -    assert(blockMat.toBreeze() === idxRowMat.toBreeze())
    +    assert(blockMat.toBreeze() === idxRowMatDense.toBreeze())
    +
    +    // Tests when m % rowsPerBlock != 0
    +    val blockMat2 = idxRowMatDense.toBlockMatrix(3, 1)
    +    assert(blockMat2.numRows() === m)
    +    assert(blockMat2.numCols() === n)
    +    assert(blockMat2.toBreeze() === idxRowMatDense.toBreeze())
     
         intercept[IllegalArgumentException] {
    -      idxRowMat.toBlockMatrix(-1, 2)
    +      idxRowMatDense.toBlockMatrix(-1, 2)
         }
         intercept[IllegalArgumentException] {
    -      idxRowMat.toBlockMatrix(2, 0)
    +      idxRowMatDense.toBlockMatrix(2, 0)
         }
    +    assert(blockMat.blocks.map { case (_, matrix: Matrix) =>
    +      matrix.isInstanceOf[DenseMatrix]}.reduce(_ && _))
    +  }
    +
    +  test("toBlockMatrix sparse backing") {
    +    val sparseData = Seq(
    +      (3L, Vectors.sparse(3, Seq((0, 4.0))))
    +    ).map(x => IndexedRow(x._1, x._2))
    +
    +    val idxRowMatSparse = new IndexedRowMatrix(sc.parallelize(sparseData))
    +
    +    // Tests when n % colsPerBlock != 0
    +    val blockMat = idxRowMatSparse.toBlockMatrix(2, 2)
    +    assert(blockMat.numRows() === m)
    +    assert(blockMat.numCols() === n)
    +    assert(blockMat.toBreeze() === idxRowMatSparse.toBreeze())
    +
    +    // Tests when m % rowsPerBlock != 0
    +    val blockMat2 = idxRowMatSparse.toBlockMatrix(3, 1)
    +    assert(blockMat2.numRows() === m)
    +    assert(blockMat2.numCols() === n)
    +    assert(blockMat2.toBreeze() === idxRowMatSparse.toBreeze())
    +
    +    assert(blockMat.blocks.map { case (_, matrix: Matrix) =>
    +      matrix.isInstanceOf[SparseMatrix]}.reduce(_ && _))
    +  }
    +
    +  test("toBlockMatrix mixed backing") {
    +    val mixedData = Seq(
    +      (0L, Vectors.dense(1, 2, 3)),
    +      (3L, Vectors.sparse(3, Seq((0, 4.0)))))
    +
    +    val idxRowMatMixed = new IndexedRowMatrix(
    +      sc.parallelize(mixedData.map(x => IndexedRow(x._1, x._2))))
    +
    +    val blockMat = idxRowMatMixed.toBlockMatrix(2, 2)
    +    assert(blockMat.numRows() === m)
    +    assert(blockMat.numCols() === n)
    +    assert(blockMat.toBreeze() === idxRowMatMixed.toBreeze())
    --- End diff --
    
    Can we also add test for `Tests when m % rowsPerBlock != 0`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17459: [SPARK-20109][MLlib] Rewrote toBlockMatrix method on Ind...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17459
  
    **[Test build #75575 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75575/testReport)** for PR 17459 at commit [`a38851c`](https://github.com/apache/spark/commit/a38851c2abb438f148df83ed43b029b040373cd8).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17459: [SPARK-20109][MLlib] Rewrote toBlockMatrix method...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17459#discussion_r111305664
  
    --- Diff: mllib/src/test/scala/org/apache/spark/mllib/linalg/distributed/IndexedRowMatrixSuite.scala ---
    @@ -87,19 +87,92 @@ class IndexedRowMatrixSuite extends SparkFunSuite with MLlibTestSparkContext {
         assert(coordMat.toBreeze() === idxRowMat.toBreeze())
       }
     
    -  test("toBlockMatrix") {
    -    val idxRowMat = new IndexedRowMatrix(indexedRows)
    -    val blockMat = idxRowMat.toBlockMatrix(2, 2)
    +  test("toBlockMatrix dense backing") {
    +    val idxRowMatDense = new IndexedRowMatrix(indexedRows)
    +
    +    // Tests when n % colsPerBlock != 0
    +    val blockMat = idxRowMatDense.toBlockMatrix(2, 2)
         assert(blockMat.numRows() === m)
         assert(blockMat.numCols() === n)
    -    assert(blockMat.toBreeze() === idxRowMat.toBreeze())
    +    assert(blockMat.toBreeze() === idxRowMatDense.toBreeze())
    +
    +    // Tests when m % rowsPerBlock != 0
    +    val blockMat2 = idxRowMatDense.toBlockMatrix(3, 1)
    +    assert(blockMat2.numRows() === m)
    +    assert(blockMat2.numCols() === n)
    +    assert(blockMat2.toBreeze() === idxRowMatDense.toBreeze())
     
         intercept[IllegalArgumentException] {
    -      idxRowMat.toBlockMatrix(-1, 2)
    +      idxRowMatDense.toBlockMatrix(-1, 2)
         }
         intercept[IllegalArgumentException] {
    -      idxRowMat.toBlockMatrix(2, 0)
    +      idxRowMatDense.toBlockMatrix(2, 0)
         }
    +
    +    assert(blockMat.blocks.map { case (_, matrix: Matrix) =>
    +      matrix.isInstanceOf[DenseMatrix]}.reduce(_ && _))
    +    assert(blockMat2.blocks.map { case (_, matrix: Matrix) =>
    +      matrix.isInstanceOf[DenseMatrix]}.reduce(_ && _))
    +  }
    +
    +  test("toBlockMatrix sparse backing") {
    +    val sparseData = Seq(
    +      (2L, Vectors.sparse(3, Seq((0, 4.0))))
    --- End diff --
    
    hmm, I think `2L` makes the matrix as `3x3`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17459: [SPARK-20109][MLlib] Rewrote toBlockMatrix method...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17459#discussion_r118817991
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/IndexedRowMatrix.scala ---
    @@ -108,8 +108,69 @@ class IndexedRowMatrix @Since("1.0.0") (
        */
       @Since("1.3.0")
       def toBlockMatrix(rowsPerBlock: Int, colsPerBlock: Int): BlockMatrix = {
    -    // TODO: This implementation may be optimized
    -    toCoordinateMatrix().toBlockMatrix(rowsPerBlock, colsPerBlock)
    +    require(rowsPerBlock > 0,
    +      s"rowsPerBlock needs to be greater than 0. rowsPerBlock: $rowsPerBlock")
    +    require(colsPerBlock > 0,
    +      s"colsPerBlock needs to be greater than 0. colsPerBlock: $colsPerBlock")
    +
    +    // Since block matrices require an integer row index
    +    require(numRows() / rowsPerBlock.toDouble <= Int.MaxValue,
    --- End diff --
    
    So I guess the previous `toBlockMatrix` would have failed too when the number of rows exceeded this threshold? it looks like it, given how CoordinateMatrix.toBlockMatrix works. Hm, I wonder if you should even put this warning over there too because it will fail mysteriously otherwise. The result might even be wrong.
    
    BTW on second look, I realize this check isn't quite the same as the math that's performed below: `math.ceil(m.toDouble / rowsPerBlock).toInt`. I think you want to check exactly the same thing. Maybe move the check below the declaration of m, n, and just say: `require(math.ceil(m.toDouble / rowsPerBlock) <= Int.MaxValue)` That's very clear.
    
    Also, cols need to be checked.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17459: [SPARK-20109][MLlib] Rewrote toBlockMatrix method on Ind...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17459
  
    **[Test build #75754 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75754/testReport)** for PR 17459 at commit [`3fe21cf`](https://github.com/apache/spark/commit/3fe21cf2e80a47f6d54578b286d338efb9b7f401).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17459: [SPARK-20109][MLlib] Rewrote toBlockMatrix method on Ind...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17459
  
    **[Test build #77439 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77439/testReport)** for PR 17459 at commit [`289dbdb`](https://github.com/apache/spark/commit/289dbdb8ef4b4c216870e306a729ae5c3d6e4936).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17459: [SPARK-20109][MLlib] Added toBlockMatrixDense to ...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17459#discussion_r109299891
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/IndexedRowMatrix.scala ---
    @@ -113,6 +114,67 @@ class IndexedRowMatrix @Since("1.0.0") (
       }
     
       /**
    +    * Converts to BlockMatrix. Creates blocks of `DenseMatrix` with size 1024 x 1024.
    +    */
    +  def toBlockMatrixDense(): BlockMatrix = {
    +    toBlockMatrixDense(1024, 1024)
    +  }
    +
    +  /**
    +    * Converts to BlockMatrix. Creates blocks of `DenseMatrix`.
    +    * @param rowsPerBlock The number of rows of each block. The blocks at the bottom edge may have
    +    *                     a smaller value. Must be an integer value greater than 0.
    +    * @param colsPerBlock The number of columns of each block. The blocks at the right edge may have
    +    *                     a smaller value. Must be an integer value greater than 0.
    +    * @return a [[BlockMatrix]]
    +    */
    +  def toBlockMatrixDense(rowsPerBlock: Int, colsPerBlock: Int): BlockMatrix = {
    +    require(rowsPerBlock > 0,
    +      s"rowsPerBlock needs to be greater than 0. rowsPerBlock: $rowsPerBlock")
    +    require(colsPerBlock > 0,
    +      s"colsPerBlock needs to be greater than 0. colsPerBlock: $colsPerBlock")
    +
    +    val m = numRows()
    +    val n = numCols()
    +    val lastRowBlockIndex = m / rowsPerBlock
    +    val lastColBlockIndex = n / colsPerBlock
    +    val lastRowBlockSize = (m % rowsPerBlock).toInt
    +    val lastColBlockSize = (n % colsPerBlock).toInt
    +    val numRowBlocks = math.ceil(m.toDouble / rowsPerBlock).toInt
    +    val numColBlocks = math.ceil(n.toDouble / colsPerBlock).toInt
    +
    +    val blocks: RDD[((Int, Int), Matrix)] = rows.flatMap({ ir =>
    +      val blockRow = ir.index / rowsPerBlock
    +      val rowInBlock = ir.index % rowsPerBlock
    +
    +      ir.vector.toArray
    +        .grouped(colsPerBlock)
    +        .zipWithIndex
    +        .map({ case (values, blockColumn) =>
    +          ((blockRow.toInt, blockColumn), (rowInBlock.toInt, values))
    +        })
    +    }).groupByKey(GridPartitioner(numRowBlocks, numColBlocks, rowsPerBlock, colsPerBlock)).map({
    --- End diff --
    
    If I don't miss anything, the parameters of `GridPartitioner` are wrong. Should be:
    
        GridPartitioner(numRowBlocks, numColBlocks, rows.partitions.length)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17459: [SPARK-20109][MLlib] Added toBlockMatrixDense to ...

Posted by johnc1231 <gi...@git.apache.org>.

Github user johnc1231 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17459#discussion_r109320468
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/IndexedRowMatrix.scala ---
    @@ -113,6 +114,67 @@ class IndexedRowMatrix @Since("1.0.0") (
       }
     
       /**
    +    * Converts to BlockMatrix. Creates blocks of `DenseMatrix` with size 1024 x 1024.
    +    */
    +  def toBlockMatrixDense(): BlockMatrix = {
    --- End diff --
    
    I have been going back and forth on this myself. I think converting to a BlockMatrix backed by dense matrices is better default behavior than one backed by sparse matrices, but the the current implementation of toBlockMatrix advertises that it converts to a BlockMatrix backed  SparseMatrices, and I thought changing that could negatively affect people who want that behavior. I suppose we could add a default argument to toBlockMatrix like `isSparse = true` so that it would not break anyone's code but people would be able to convert to dense version if they wanted. What do you think of that? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17459: [SPARK-20109][MLlib] Rewrote toBlockMatrix method on Ind...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17459
  
    **[Test build #77606 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77606/testReport)** for PR 17459 at commit [`f9c5506`](https://github.com/apache/spark/commit/f9c5506e3165bb851415d5625f085438413c23fe).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17459: [SPARK-20109][MLlib] Added toBlockMatrixDense to Indexed...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on the issue:

    https://github.com/apache/spark/pull/17459
  
    ok to test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17459: [SPARK-20109][MLlib] Added toBlockMatrixDense to Indexed...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17459
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75482/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17459: [SPARK-20109][MLlib] Added toBlockMatrixDense to Indexed...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

https://github.com/apache/spark/pull/17459

> I considered having toBlockMatrix check if the rows of IndexedRowMatrix were dense or sparse, but there is no guarantee of consistency. Like, an IndexedRowMatrix could be a mix of Dense and Sparse Vectors. In that case, it would not be clear what type of BlockMatrix to create. A decent approximation of this would be to just decide the matrix type based on the first vector we look at in the iterator we get from groupByKey, creating a mix of Dense and Sparse matrices in a BlockMatrix, but I still think it's best to be explicit. Also, we currently have the description of toBlockMatrix promising to make a BlockMatrix backed by instances of SparseMatrix, so we have made promises to users about the composition of the BlockMatrix before.

I don't mean we don't care about it. I meant there is no guarantee that `BlockMatrix` is purely consisted of `DenseMatrix` or `SparseMatrix`. It could be a mix of them.

Thus, we can have a `toBlockMatrix` which creates a `BlockMatrix` which is a mix of `DenseMatrix` and `SparseMatrix`. A block in a `BlockMatrix` can be a `DenseMatrix` and `SparseMatrix`, depending on the ratio of values in the block. Yes, it is like `a decent approximation` you talked.

For a `IndexedRowMatrix` completely consisted of `DenseVector`, this `toBlockMatrix` definitely returns a `BlockMatrix` backed by `DenseMatrix`. For other cases, `DenseMatrix` might not be best choice for all blocks in the `BlockMatrix`, as many blocks will be sparse.

About the promise that `toBlockMatrix` makes a `BlockMatrix` backed by instances of `SparseMatrix`, as I said it is not explicitly bound to the API level. I think it is not a big problem.

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17459: [SPARK-20109][MLlib] Added toBlockMatrixDense to Indexed...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/17459
  
    oh, seems only the committers can trigger jenkins test. cc @jkbradley @MLnick 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17459: [SPARK-20109][MLlib] Rewrote toBlockMatrix method...

Posted by johnc1231 <gi...@git.apache.org>.

Github user johnc1231 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17459#discussion_r117621336
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/IndexedRowMatrix.scala ---
    @@ -108,8 +108,64 @@ class IndexedRowMatrix @Since("1.0.0") (
        */
       @Since("1.3.0")
       def toBlockMatrix(rowsPerBlock: Int, colsPerBlock: Int): BlockMatrix = {
    -    // TODO: This implementation may be optimized
    -    toCoordinateMatrix().toBlockMatrix(rowsPerBlock, colsPerBlock)
    +    require(rowsPerBlock > 0,
    +      s"rowsPerBlock needs to be greater than 0. rowsPerBlock: $rowsPerBlock")
    +    require(colsPerBlock > 0,
    +      s"colsPerBlock needs to be greater than 0. colsPerBlock: $colsPerBlock")
    +
    +    val m = numRows()
    +    val n = numCols()
    +    val lastRowBlockIndex = m / rowsPerBlock
    --- End diff --
    
    Good point. Replaced word "last" with "remainder" and added a small clarifying comment. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17459: [SPARK-20109][MLlib] Rewrote toBlockMatrix method on Ind...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17459
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75754/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17459: [SPARK-20109][MLlib] Rewrote toBlockMatrix method...

Posted by johnc1231 <gi...@git.apache.org>.

Github user johnc1231 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17459#discussion_r111302756
  
    --- Diff: mllib/src/test/scala/org/apache/spark/mllib/linalg/distributed/IndexedRowMatrixSuite.scala ---
    @@ -87,21 +87,74 @@ class IndexedRowMatrixSuite extends SparkFunSuite with MLlibTestSparkContext {
         assert(coordMat.toBreeze() === idxRowMat.toBreeze())
       }
     
    -  test("toBlockMatrix") {
    -    val idxRowMat = new IndexedRowMatrix(indexedRows)
    -    val blockMat = idxRowMat.toBlockMatrix(2, 2)
    +  test("toBlockMatrix dense backing") {
    +    val idxRowMatDense = new IndexedRowMatrix(indexedRows)
    +
    +    // Tests when n % colsPerBlock != 0
    +    val blockMat = idxRowMatDense.toBlockMatrix(2, 2)
         assert(blockMat.numRows() === m)
         assert(blockMat.numCols() === n)
    -    assert(blockMat.toBreeze() === idxRowMat.toBreeze())
    +    assert(blockMat.toBreeze() === idxRowMatDense.toBreeze())
    +
    +    // Tests when m % rowsPerBlock != 0
    +    val blockMat2 = idxRowMatDense.toBlockMatrix(3, 1)
    +    assert(blockMat2.numRows() === m)
    +    assert(blockMat2.numCols() === n)
    +    assert(blockMat2.toBreeze() === idxRowMatDense.toBreeze())
     
         intercept[IllegalArgumentException] {
    -      idxRowMat.toBlockMatrix(-1, 2)
    +      idxRowMatDense.toBlockMatrix(-1, 2)
         }
         intercept[IllegalArgumentException] {
    -      idxRowMat.toBlockMatrix(2, 0)
    +      idxRowMatDense.toBlockMatrix(2, 0)
         }
    +    assert(blockMat.blocks.map { case (_, matrix: Matrix) =>
    +      matrix.isInstanceOf[DenseMatrix]}.reduce(_ && _))
    +  }
    +
    +  test("toBlockMatrix sparse backing") {
    +    val sparseData = Seq(
    +      (3L, Vectors.sparse(3, Seq((0, 4.0))))
    +    ).map(x => IndexedRow(x._1, x._2))
    +
    +    val idxRowMatSparse = new IndexedRowMatrix(sc.parallelize(sparseData))
    +
    +    // Tests when n % colsPerBlock != 0
    +    val blockMat = idxRowMatSparse.toBlockMatrix(2, 2)
    +    assert(blockMat.numRows() === m)
    +    assert(blockMat.numCols() === n)
    +    assert(blockMat.toBreeze() === idxRowMatSparse.toBreeze())
    +
    +    // Tests when m % rowsPerBlock != 0
    +    val blockMat2 = idxRowMatSparse.toBlockMatrix(3, 1)
    +    assert(blockMat2.numRows() === m)
    +    assert(blockMat2.numCols() === n)
    +    assert(blockMat2.toBreeze() === idxRowMatSparse.toBreeze())
    +
    +    assert(blockMat.blocks.map { case (_, matrix: Matrix) =>
    +      matrix.isInstanceOf[SparseMatrix]}.reduce(_ && _))
    --- End diff --
    
    This is done, but comment is not showing as outdated. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17459: [SPARK-20109][MLlib] Rewrote toBlockMatrix method on Ind...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17459
  
    **[Test build #76504 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76504/testReport)** for PR 17459 at commit [`994b457`](https://github.com/apache/spark/commit/994b457aaee25763cdfac8edf9e6672da04b392d).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17459: [SPARK-20109][MLlib] Added toBlockMatrixDense to Indexed...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17459
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75518/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17459: [SPARK-20109][MLlib] Rewrote toBlockMatrix method on Ind...

Posted by johnc1231 <gi...@git.apache.org>.

Github user johnc1231 commented on the issue:

    https://github.com/apache/spark/pull/17459
  
    @viirya Now that style nitpicks and sparse benchmarks are done, are you good with this? Also, per your recommendation, CCing @MLnick and @jkbradley for review of this. Should be easy to review, since we've iterated on it a lot. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17459: [SPARK-20109][MLlib] Rewrote toBlockMatrix method...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17459#discussion_r116542082
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/IndexedRowMatrix.scala ---
    @@ -108,8 +108,64 @@ class IndexedRowMatrix @Since("1.0.0") (
        */
       @Since("1.3.0")
       def toBlockMatrix(rowsPerBlock: Int, colsPerBlock: Int): BlockMatrix = {
    -    // TODO: This implementation may be optimized
    -    toCoordinateMatrix().toBlockMatrix(rowsPerBlock, colsPerBlock)
    +    require(rowsPerBlock > 0,
    +      s"rowsPerBlock needs to be greater than 0. rowsPerBlock: $rowsPerBlock")
    +    require(colsPerBlock > 0,
    +      s"colsPerBlock needs to be greater than 0. colsPerBlock: $colsPerBlock")
    +
    +    val m = numRows()
    +    val n = numCols()
    +    val lastRowBlockIndex = m / rowsPerBlock
    +    val lastColBlockIndex = n / colsPerBlock
    +    val lastRowBlockSize = (m % rowsPerBlock).toInt
    +    val lastColBlockSize = (n % colsPerBlock).toInt
    +    val numRowBlocks = math.ceil(m.toDouble / rowsPerBlock).toInt
    +    val numColBlocks = math.ceil(n.toDouble / colsPerBlock).toInt
    +
    +    val blocks = rows.flatMap { ir: IndexedRow =>
    +      val blockRow = ir.index / rowsPerBlock
    +      val rowInBlock = ir.index % rowsPerBlock
    +
    +      ir.vector match {
    +        case SparseVector(size, indices, values) =>
    +          indices.zip(values).map { case (index, value) =>
    +            val blockColumn = index / colsPerBlock
    +            val columnInBlock = index % colsPerBlock
    +            ((blockRow.toInt, blockColumn.toInt), (rowInBlock.toInt, Array((value, columnInBlock))))
    +          }
    +        case DenseVector(values) =>
    +          values.grouped(colsPerBlock)
    +            .zipWithIndex
    +            .map { case (values, blockColumn) =>
    +              ((blockRow.toInt, blockColumn), (rowInBlock.toInt, values.zipWithIndex))
    +            }
    +      }
    +    }.groupByKey(GridPartitioner(numRowBlocks, numColBlocks, rows.getNumPartitions)).map {
    +      case ((blockRow, blockColumn), itr) =>
    +        val actualNumRows =
    +          if (blockRow == lastRowBlockIndex) lastRowBlockSize else rowsPerBlock
    +        val actualNumColumns =
    +          if (blockColumn == lastColBlockIndex) lastColBlockSize else colsPerBlock
    +
    +        val arraySize = actualNumRows * actualNumColumns
    +        val matrixAsArray = new Array[Double](arraySize)
    +        var countForValues = 0
    +        itr.foreach { case (rowWithinBlock, valuesWithColumns) =>
    +          valuesWithColumns.foreach { case (value, columnWithinBlock) =>
    +            matrixAsArray.update(columnWithinBlock * actualNumRows + rowWithinBlock, value)
    +            countForValues += 1
    +          }
    +        }
    +        val denseMatrix = new DenseMatrix(actualNumRows, actualNumColumns, matrixAsArray)
    +        val finalMatrix = if (countForValues / arraySize.toDouble >= 0.5) {
    --- End diff --
    
    `BlockMatrix` seems to use sparse representations when <= 10% of values are non-zero when converting to an indexed row matrix. Maybe go with that?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17459: [SPARK-20109][MLlib] Added toBlockMatrixDense to ...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17459#discussion_r109300484
  
    --- Diff: mllib/src/test/scala/org/apache/spark/mllib/linalg/distributed/IndexedRowMatrixSuite.scala ---
    @@ -89,11 +89,42 @@ class IndexedRowMatrixSuite extends SparkFunSuite with MLlibTestSparkContext {
     
       test("toBlockMatrix") {
         val idxRowMat = new IndexedRowMatrix(indexedRows)
    +
    +    // Tests when n % colsPerBlock != 0
    +    val blockMat = idxRowMat.toBlockMatrix(2, 2)
    +    assert(blockMat.numRows() === m)
    +    assert(blockMat.numCols() === n)
    +    assert(blockMat.toBreeze() === idxRowMat.toBreeze())
    +
    +    // Tests when m % rowsPerBlock != 0
    +    val blockMat2 = idxRowMat.toBlockMatrix(3, 1)
    +    assert(blockMat2.numRows() === m)
    +    assert(blockMat2.numCols() === n)
    +    assert(blockMat2.toBreeze() === idxRowMat.toBreeze())
    +
    +    intercept[IllegalArgumentException] {
    +      idxRowMat.toBlockMatrix(-1, 2)
    +    }
    +    intercept[IllegalArgumentException] {
    +      idxRowMat.toBlockMatrix(2, 0)
    +    }
    +  }
    +
    +  test("toBlockMatrixDense") {
    --- End diff --
    
    I don't see you test newly added `toBlockMatrixDense`, do you?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17459: [SPARK-20109][MLlib] Added toBlockMatrixDense to Indexed...

Posted by johnc1231 <gi...@git.apache.org>.

Github user johnc1231 commented on the issue:

https://github.com/apache/spark/pull/17459

@viirya I think we definitely care about giving users the ability to make either dense or sparse Block matrices. I made a 100k by 10k IndexedRowMatrix of random doubles, then converted it to a BlockMatrix to multiply it by its transpose. With the current toBlockMatrix implementation, that took 252 seconds on 128 cores. With my implementation, that took 35 seconds. The backing of a BlockMatrix matters a lot, and we need to let users be explicit about it.

I considered having toBlockMatrix check if the rows of IndexedRowMatrix were dense or sparse, but there is no guarantee of consistency. Like, an IndexedRowMatrix could be a mix of Dense and Sparse Vectors. In that case, it would not be clear what type of BlockMatrix to create. A decent approximation of this would be to just decide the matrix type based on the first vector we look at in the iterator we get from groupByKey, but I still think it's best to be explicit.

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17459: [SPARK-20109][MLlib] Added toBlockMatrixDense to Indexed...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17459
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17459: [SPARK-20109][MLlib] Added toBlockMatrixDense to ...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17459#discussion_r109382284
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/IndexedRowMatrix.scala ---
    @@ -113,6 +114,67 @@ class IndexedRowMatrix @Since("1.0.0") (
       }
     
       /**
    +    * Converts to BlockMatrix. Creates blocks of `DenseMatrix` with size 1024 x 1024.
    +    */
    +  def toBlockMatrixDense(): BlockMatrix = {
    --- End diff --
    
    Actually I think we can generalize the change to SparseMatrix-based `BlockMatrix ` too. But maybe we can do it in following PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17459: [SPARK-20109][MLlib] Added toBlockMatrixDense to Indexed...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17459
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75475/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17459: [SPARK-20109][MLlib] Rewrote toBlockMatrix method on Ind...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17459
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17459: [SPARK-20109][MLlib] Rewrote toBlockMatrix method...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17459#discussion_r118818160
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/IndexedRowMatrix.scala ---
    @@ -108,8 +108,69 @@ class IndexedRowMatrix @Since("1.0.0") (
        */
       @Since("1.3.0")
       def toBlockMatrix(rowsPerBlock: Int, colsPerBlock: Int): BlockMatrix = {
    -    // TODO: This implementation may be optimized
    -    toCoordinateMatrix().toBlockMatrix(rowsPerBlock, colsPerBlock)
    +    require(rowsPerBlock > 0,
    +      s"rowsPerBlock needs to be greater than 0. rowsPerBlock: $rowsPerBlock")
    +    require(colsPerBlock > 0,
    +      s"colsPerBlock needs to be greater than 0. colsPerBlock: $colsPerBlock")
    +
    +    // Since block matrices require an integer row index
    +    require(numRows() / rowsPerBlock.toDouble <= Int.MaxValue,
    +      "Number of rows divided by rowsPerBlock cannot exceed maximum integer.")
    +
    +    val m = numRows()
    +    val n = numCols()
    +    // The remainder calculations only matter when m % rowsPerBlock != 0 or n % colsPerBlock != 0
    +    val remainderRowBlockIndex = m / rowsPerBlock
    +    val remainderColBlockIndex = n / colsPerBlock
    +    val remainderRowBlockSize = (m % rowsPerBlock).toInt
    +    val remainderColBlockSize = (n % colsPerBlock).toInt
    +    val numRowBlocks = math.ceil(m.toDouble / rowsPerBlock).toInt
    +    val numColBlocks = math.ceil(n.toDouble / colsPerBlock).toInt
    +
    +    val blocks = rows.flatMap { ir: IndexedRow =>
    +      val blockRow = ir.index / rowsPerBlock
    +      val rowInBlock = ir.index % rowsPerBlock
    +
    +      ir.vector match {
    +        case SparseVector(size, indices, values) =>
    +          indices.zip(values).map { case (index, value) =>
    +            val blockColumn = index / colsPerBlock
    +            val columnInBlock = index % colsPerBlock
    +            ((blockRow.toInt, blockColumn.toInt), (rowInBlock.toInt, Array((value, columnInBlock))))
    +          }
    +        case DenseVector(values) =>
    +          values.grouped(colsPerBlock)
    +            .zipWithIndex
    +            .map { case (values, blockColumn) =>
    +              ((blockRow.toInt, blockColumn), (rowInBlock.toInt, values.zipWithIndex))
    +            }
    +      }
    +    }.groupByKey(GridPartitioner(numRowBlocks, numColBlocks, rows.getNumPartitions)).map {
    +      case ((blockRow, blockColumn), itr) =>
    +        val actualNumRows =
    +          if (blockRow == remainderRowBlockIndex) remainderRowBlockSize else rowsPerBlock
    +        val actualNumColumns =
    +          if (blockColumn == remainderColBlockIndex) remainderColBlockSize else colsPerBlock
    +
    +        val arraySize = actualNumRows * actualNumColumns
    +        val matrixAsArray = new Array[Double](arraySize)
    +        var countForValues = 0
    +        itr.foreach { case (rowWithinBlock, valuesWithColumns) =>
    +          valuesWithColumns.foreach { case (value, columnWithinBlock) =>
    +            matrixAsArray.update(columnWithinBlock * actualNumRows + rowWithinBlock, value)
    +            countForValues += 1
    +          }
    +        }
    +        val denseMatrix = new DenseMatrix(actualNumRows, actualNumColumns, matrixAsArray)
    +        val finalMatrix = if (countForValues / arraySize.toDouble >= 0.1) {
    +          denseMatrix
    +        } else {
    +          denseMatrix.toSparse
    +        }
    +
    +        ((blockRow, blockColumn), finalMatrix)
    +    }
    +    new BlockMatrix(blocks, rowsPerBlock, colsPerBlock, this.numRows(), this.numCols())
    --- End diff --
    
    Nit: can the last two args simply be m, n for clarity?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17459: [SPARK-20109][MLlib] Rewrote toBlockMatrix method...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17459#discussion_r116542816
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/IndexedRowMatrix.scala ---
    @@ -108,8 +108,64 @@ class IndexedRowMatrix @Since("1.0.0") (
        */
       @Since("1.3.0")
       def toBlockMatrix(rowsPerBlock: Int, colsPerBlock: Int): BlockMatrix = {
    -    // TODO: This implementation may be optimized
    -    toCoordinateMatrix().toBlockMatrix(rowsPerBlock, colsPerBlock)
    +    require(rowsPerBlock > 0,
    +      s"rowsPerBlock needs to be greater than 0. rowsPerBlock: $rowsPerBlock")
    +    require(colsPerBlock > 0,
    +      s"colsPerBlock needs to be greater than 0. colsPerBlock: $colsPerBlock")
    +
    +    val m = numRows()
    +    val n = numCols()
    +    val lastRowBlockIndex = m / rowsPerBlock
    --- End diff --
    
    So this is the index of the final, smaller block, if any? I get it, but if m = 100 and n = 10 then this is 10, which is not the index of the last row block. There is no leftover smaller block and the last one is 9. I  think the code works and I'm splitting hairs but wonder if this is clearer if it's the "remainder" block index or something?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17459: [SPARK-20109][MLlib] Rewrote toBlockMatrix method on Ind...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17459
  
    **[Test build #76107 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76107/testReport)** for PR 17459 at commit [`d692d30`](https://github.com/apache/spark/commit/d692d3031f9c57ff92b63ccce0962bc899402826).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17459: [SPARK-20109][MLlib] Rewrote toBlockMatrix method on Ind...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17459
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/77437/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17459: [SPARK-20109][MLlib] Rewrote toBlockMatrix method...

Posted by johnc1231 <gi...@git.apache.org>.

Github user johnc1231 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17459#discussion_r118793854
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/IndexedRowMatrix.scala ---
    @@ -108,8 +108,69 @@ class IndexedRowMatrix @Since("1.0.0") (
        */
       @Since("1.3.0")
       def toBlockMatrix(rowsPerBlock: Int, colsPerBlock: Int): BlockMatrix = {
    -    // TODO: This implementation may be optimized
    -    toCoordinateMatrix().toBlockMatrix(rowsPerBlock, colsPerBlock)
    +    require(rowsPerBlock > 0,
    +      s"rowsPerBlock needs to be greater than 0. rowsPerBlock: $rowsPerBlock")
    +    require(colsPerBlock > 0,
    +      s"colsPerBlock needs to be greater than 0. colsPerBlock: $colsPerBlock")
    +
    +    // Since block matrices require an integer row index
    +    require(numRows() / rowsPerBlock < Int.MaxValue,
    --- End diff --
    
    Well, it's true if I do floating point division. It's not necessarily true if it's long / int division. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17459: [SPARK-20109][MLlib] Rewrote toBlockMatrix method on Ind...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17459
  
    **[Test build #75575 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75575/testReport)** for PR 17459 at commit [`a38851c`](https://github.com/apache/spark/commit/a38851c2abb438f148df83ed43b029b040373cd8).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17459: [SPARK-20109][MLlib] Rewrote toBlockMatrix method...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17459#discussion_r118792582
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/IndexedRowMatrix.scala ---
    @@ -108,8 +108,69 @@ class IndexedRowMatrix @Since("1.0.0") (
        */
       @Since("1.3.0")
       def toBlockMatrix(rowsPerBlock: Int, colsPerBlock: Int): BlockMatrix = {
    -    // TODO: This implementation may be optimized
    -    toCoordinateMatrix().toBlockMatrix(rowsPerBlock, colsPerBlock)
    +    require(rowsPerBlock > 0,
    +      s"rowsPerBlock needs to be greater than 0. rowsPerBlock: $rowsPerBlock")
    +    require(colsPerBlock > 0,
    +      s"colsPerBlock needs to be greater than 0. colsPerBlock: $colsPerBlock")
    +
    +    // Since block matrices require an integer row index
    +    require(numRows() / rowsPerBlock < Int.MaxValue,
    --- End diff --
    
    Nit: isn't `<=` OK too? very much a corner case, but hey


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17459: [SPARK-20109][MLlib] Added toBlockMatrixDense to Indexed...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17459
  
    **[Test build #75475 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75475/testReport)** for PR 17459 at commit [`06c2b3a`](https://github.com/apache/spark/commit/06c2b3a22f16052a90f339edfdbe9644fd8ab797).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17459: [SPARK-20109][MLlib] Rewrote toBlockMatrix method on Ind...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17459
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/77439/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org