You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by uzadude <gi...@git.apache.org> on 2016/07/06 10:39:24 UTC

[GitHub] spark pull request #14068: enhanced simulate multiply

GitHub user uzadude opened a pull request:

    https://github.com/apache/spark/pull/14068

    enhanced simulate multiply

    ## What changes were proposed in this pull request?
    
    We have a use case of multiplying very big sparse matrices. we have about 1000x1000 distributed block matrices multiplication and the simulate multiply goes like O(n^4) (n being 1000). it takes about 1.5 hours. We modified it slightly with classical hashmap and now run in about 30 seconds O(n^2).
    
    
    ## How was this patch tested?
    
    We have added a performance test and verified the reduced time.
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/uzadude/spark master

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/14068.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #14068
    
----
commit 0126ea142ef3a842f00b82920483285c28ab7666
Author: oraviv <or...@paypal.com>
Date:   2016-07-06T06:35:58Z

    enhanced simulate multiply

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #14068: enhanced simulate multiply

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14068#discussion_r70059298
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrix.scala ---
    @@ -426,16 +426,23 @@ class BlockMatrix @Since("1.3.0") (
           partitioner: GridPartitioner): (BlockDestinations, BlockDestinations) = {
         val leftMatrix = blockInfo.keys.collect() // blockInfo should already be cached
         val rightMatrix = other.blocks.keys.collect()
    +
    +    val rightCounterpartsHelper = rightMatrix.groupBy(_._1).map { case (rowIndex, arr) =>
    --- End diff --
    
    Nit: could this just be `rightMatrix.groupBy(_._1).mapValues(_.map(_._2))`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14068: enhanced simulate multiply

Posted by uzadude <gi...@git.apache.org>.
Github user uzadude commented on the issue:

    https://github.com/apache/spark/pull/14068
  
    Hi srowen,
    I have read the "how to contribute" wiki. I thought that it is too small of enhancement to open a jira for it and it passes the tests.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14068: [SPARK-16469] enhanced simulate multiply

Posted by akaltsikis <gi...@git.apache.org>.
Github user akaltsikis commented on the issue:

    https://github.com/apache/spark/pull/14068
  
    @uzadude Looking forward to make it work for matrixes bigger than 1Mx1M and sparsity of 0.001-0.002.
    It means that i need to use distributed linear algebra methods. I am using a CoordinateMatrix format for my matrices right now but it seems that it can not multiply the matrix by himself (square it ) fast enough. I will be happy if you can send me an email with the same username that i have at gmail dot com.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14068: [SPARK-16469] enhanced simulate multiply

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/14068
  
    Jenkins retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14068: [SPARK-16469] enhanced simulate multiply

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14068
  
    **[Test build #62234 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62234/consoleFull)** for PR 14068 at commit [`3ae9ac7`](https://github.com/apache/spark/commit/3ae9ac74418b149d4dd82bf9c08199064874c807).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14068: [SPARK-16469] enhanced simulate multiply

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/14068
  
    LGTM but CC @brkyvz 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14068: [SPARK-16469] enhanced simulate multiply

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14068
  
    **[Test build #62234 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62234/consoleFull)** for PR 14068 at commit [`3ae9ac7`](https://github.com/apache/spark/commit/3ae9ac74418b149d4dd82bf9c08199064874c807).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14068: [SPARK-16469] enhanced simulate multiply

Posted by uzadude <gi...@git.apache.org>.
Github user uzadude commented on the issue:

    https://github.com/apache/spark/pull/14068
  
    Have done the requested nit changes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14068: [SPARK-16469] enhanced simulate multiply

Posted by akaltsikis <gi...@git.apache.org>.
Github user akaltsikis commented on the issue:

    https://github.com/apache/spark/pull/14068
  
    @uzadude hey, i am looking to multiply VERY LARGE AND VERY SPARSE matrixes using Spark. I would love some discussion over it. Can you give me a way to contact you?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #14068: [SPARK-16469] enhanced simulate multiply

Posted by brkyvz <gi...@git.apache.org>.
Github user brkyvz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14068#discussion_r70181303
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrix.scala ---
    @@ -426,16 +426,19 @@ class BlockMatrix @Since("1.3.0") (
           partitioner: GridPartitioner): (BlockDestinations, BlockDestinations) = {
         val leftMatrix = blockInfo.keys.collect() // blockInfo should already be cached
         val rightMatrix = other.blocks.keys.collect()
    +
    +    val rightCounterpartsHelper = rightMatrix.groupBy(_._1).mapValues(_.map(_._2))
         val leftDestinations = leftMatrix.map { case (rowIndex, colIndex) =>
    -      val rightCounterparts = rightMatrix.filter(_._1 == colIndex)
    -      val partitions = rightCounterparts.map(b => partitioner.getPartition((rowIndex, b._2)))
    -      ((rowIndex, colIndex), partitions.toSet)
    +      ((rowIndex, colIndex), rightCounterpartsHelper.getOrElse(colIndex, Array()).map(b =>
    +        partitioner.getPartition((rowIndex, b))).toSet)
         }.toMap
    +
    +    val leftCounterpartsHelper = leftMatrix.groupBy(_._2).mapValues(_.map(_._1))
         val rightDestinations = rightMatrix.map { case (rowIndex, colIndex) =>
    -      val leftCounterparts = leftMatrix.filter(_._2 == rowIndex)
    -      val partitions = leftCounterparts.map(b => partitioner.getPartition((b._1, colIndex)))
    -      ((rowIndex, colIndex), partitions.toSet)
    +      ((rowIndex, colIndex), leftCounterpartsHelper.getOrElse(rowIndex, Array()).map(b =>
    --- End diff --
    
    ditto


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #14068: [SPARK-16469] enhanced simulate multiply

Posted by brkyvz <gi...@git.apache.org>.
Github user brkyvz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14068#discussion_r70181291
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrix.scala ---
    @@ -426,16 +426,19 @@ class BlockMatrix @Since("1.3.0") (
           partitioner: GridPartitioner): (BlockDestinations, BlockDestinations) = {
         val leftMatrix = blockInfo.keys.collect() // blockInfo should already be cached
         val rightMatrix = other.blocks.keys.collect()
    +
    +    val rightCounterpartsHelper = rightMatrix.groupBy(_._1).mapValues(_.map(_._2))
         val leftDestinations = leftMatrix.map { case (rowIndex, colIndex) =>
    -      val rightCounterparts = rightMatrix.filter(_._1 == colIndex)
    -      val partitions = rightCounterparts.map(b => partitioner.getPartition((rowIndex, b._2)))
    -      ((rowIndex, colIndex), partitions.toSet)
    +      ((rowIndex, colIndex), rightCounterpartsHelper.getOrElse(colIndex, Array()).map(b =>
    --- End diff --
    
    nit: for readability could you assign this to a variable instead of inlining it?
    In addition, for multi-line expressions
    ```scala
    .map { b =>
      blah
    }
    ```
    is more preferred


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14068: enhanced simulate multiply

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/14068
  
    It does make sense. Please make a JIRA and connect this though.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14068: enhanced simulate multiply

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/14068
  
    I don't think this is trivial. You also need to explain the change in more detail.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14068: [SPARK-16469] enhanced simulate multiply

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14068
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14068: [SPARK-16469] enhanced simulate multiply

Posted by uzadude <gi...@git.apache.org>.
Github user uzadude commented on the issue:

    https://github.com/apache/spark/pull/14068
  
    Sure, what size are you talking about? we had to some internal code fixes to do that. right now the support for sparse matrices is pretty poor - mainly because Breeze doesn't support it.
    I'm not sure how can I privately send you my contact details here. could you send me a mean of communication and I'll contact you?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14068: [SPARK-16469] enhanced simulate multiply

Posted by brkyvz <gi...@git.apache.org>.
Github user brkyvz commented on the issue:

    https://github.com/apache/spark/pull/14068
  
    LGTM, just a minor nit! Thanks for this PR


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14068: [SPARK-16469] enhanced simulate multiply

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14068
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62234/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14068: enhanced simulate multiply

Posted by uzadude <gi...@git.apache.org>.
Github user uzadude commented on the issue:

    https://github.com/apache/spark/pull/14068
  
    Sure.
    The current method for multiplying distributed block matrices starts by deciding which block should be shuffled to which partition to do the actual multiplications. This stage is implemented in the function called "simulateMultiply" in BlockMatrix class.
    Specifically, it iterates over all the blocks in the left matrix (lets assume there are NxN blocks) and for each block iterates over all the blocks in the right matrix (also NxN) to check for potential matches. This process if obviously in the order of O(N^4). The nature of the inner iteration could be enhanced trivially. It currently filters on pre-determend condition:
    ```scala
     val leftDestinations = leftMatrix.map { case (rowIndex, colIndex) =>
          val rightCounterparts = rightMatrix.filter(_._1 == colIndex)
          val partitions = rightCounterparts.map(b => partitioner.getPartition((rowIndex, b._2)))
          ((rowIndex, colIndex), partitions.toSet)
        }.toMap
    ```
    more clearly this part:
    ```scala
          val rightCounterparts = rightMatrix.filter(_._1 == colIndex)
    ```
    So if we were to cache this check for example in a HashMap:
    ```scala
        val rightCounterpartsHelper = rightMatrix.groupBy(_._1).map { case (rowIndex, arr) =>
          (rowIndex, arr.map(b => b._2))
        }
    ```
    We can omit the inner filter and just use it:
    ```scala
        val leftDestinations = leftMatrix.map { case (rowIndex, colIndex) =>
          ((rowIndex, colIndex), rightCounterpartsHelper.getOrElse(colIndex, Array()).map(b =>
            partitioner.getPartition((rowIndex, b))).toSet)
        }.toMap
    ```
    And to put it al toghether:
    ```scala
        val rightCounterpartsHelper = rightMatrix.groupBy(_._1).map { case (rowIndex, arr) =>
          (rowIndex, arr.map(b => b._2))
        }
        val leftDestinations = leftMatrix.map { case (rowIndex, colIndex) =>
          ((rowIndex, colIndex), rightCounterpartsHelper.getOrElse(colIndex, Array()).map(b =>
            partitioner.getPartition((rowIndex, b))).toSet)
        }.toMap
    ```
    And the same trick also for the rightDestinations.
    As I mentioned above we encountered this while trying to multiply big sparse matrices and it got stuck on the driver for a very long time (~1.5 hours) and we had to do it iteratively so the all process took many hours. After the fix this part reduced to a few seconds.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14068: enhanced simulate multiply

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/14068
  
    Please read https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #14068: [SPARK-16469] enhanced simulate multiply

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/14068


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14068: [SPARK-16469] enhanced simulate multiply

Posted by uzadude <gi...@git.apache.org>.
Github user uzadude commented on the issue:

    https://github.com/apache/spark/pull/14068
  
    I have opened SPARK-16469.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14068: enhanced simulate multiply

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14068
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org