You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by rahulpalamuttam <gi...@git.apache.org> on 2015/10/02 03:24:35 UTC

[GitHub] spark pull request: [SPARK-10906][MLlib] More efficient SparseMatr...

GitHub user rahulpalamuttam opened a pull request:

    https://github.com/apache/spark/pull/8960

    [SPARK-10906][MLlib] More efficient SparseMatrix.equals

    @jkbradley
    Calls toBreeze.activeIterator which, for CSCMatrix, returns an iterator
    with only non zero values. The iterators are then converted to sets and
    checked for equality. Since the input parameter can be of type DenseMatrix
    the second set is filtered for non-zero values. This is because activeIterator
    for DenseMatrix returns all values. 

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/rahulpalamuttam/spark RahulP_SparseMatrixEquals

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/8960.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #8960
    
----
commit f3f7e650455600c0d266395ac549090b959ada95
Author: rahul <ra...@gmail.com>
Date:   2015-10-02T01:19:04Z

    updated SparseMatrix equal method

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10906][MLlib] More efficient SparseMatr...

Posted by rahulpalamuttam <gi...@git.apache.org>.
Github user rahulpalamuttam commented on the pull request:

    https://github.com/apache/spark/pull/8960#issuecomment-168816040
  
    @jkbradley 
    No worries. I'll submit a PR to Breeze with the changes.
    I've highlighted the function to be changed in the Breeze Matrix class in the following link.
    
    https://github.com/scalanlp/breeze/blob/3f978aa177ebc11a881379a97d0618400e0b1b49/math/src/main/scala/breeze/linalg/Matrix.scala#L134-L139.
    
    Do you want me to update the PR with the test code and remove the implementation code?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10906][MLlib] More efficient SparseMatr...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/8960


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10906][MLlib] More efficient SparseMatr...

Posted by rahulpalamuttam <gi...@git.apache.org>.
Github user rahulpalamuttam commented on the pull request:

    https://github.com/apache/spark/pull/8960#issuecomment-168925236
  
    Also on a related point to what I had here: using activeIte SparseMatrix equality only works when both matrices are Sparse. Breeze does not provide a numNonZero method and using it defeats the purpose of an "optimal" SparseMatrix equality. This is because it iterates through 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10906][MLlib] More efficient SparseMatr...

Posted by rahulpalamuttam <gi...@git.apache.org>.
Github user rahulpalamuttam commented on the pull request:

    https://github.com/apache/spark/pull/8960#issuecomment-168942081
  
    @jkbradley @mengxr 
    I have opened the new PR in Breeze here :
    https://github.com/scalanlp/breeze/pull/480



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10906][MLlib] More efficient SparseMatr...

Posted by rahulpalamuttam <gi...@git.apache.org>.
Github user rahulpalamuttam commented on the pull request:

    https://github.com/apache/spark/pull/8960#issuecomment-148956288
  
    @jkbradley 
    I have merged you're PR to my branch.
    Both of our tests for explicit zeros pass. (I have included both).
    Also I have taken out the unnecessary test involving the 10x10 identity matrix.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10906][MLlib] More efficient SparseMatr...

Posted by rahulpalamuttam <gi...@git.apache.org>.
GitHub user rahulpalamuttam reopened a pull request:

    https://github.com/apache/spark/pull/8960

    [SPARK-10906][MLlib] More efficient SparseMatrix.equals

    @jkbradley
    Calls toBreeze.activeIterator which, for CSCMatrix, returns an iterator
    with only non zero values. The iterators are then converted to sets and
    checked for equality. Since the input parameter can be of type DenseMatrix
    the second set is filtered for non-zero values. This is because activeIterator
    for DenseMatrix returns all values. 

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/rahulpalamuttam/spark RahulP_SparseMatrixEquals

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/8960.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #8960
    
----
commit f3f7e650455600c0d266395ac549090b959ada95
Author: rahul <ra...@gmail.com>
Date:   2015-10-02T01:19:04Z

    updated SparseMatrix equal method

commit f9c6ac0795919afedc85d66f246abf933dc0d092
Author: rahul <ra...@gmail.com>
Date:   2015-10-09T09:28:24Z

    updated SparseMatrix equal method to use iterator traversal.

commit 70016e465c68df1c566e2f5eb44ee19571321312
Author: rahul <ra...@gmail.com>
Date:   2015-10-10T00:33:19Z

    Updated tests to include 3x3 matrix with more zero elements.

commit 2fff7d74f95242f208874d4f5059e5d86198b8a5
Author: rahul <ra...@gmail.com>
Date:   2015-10-10T00:38:06Z

    Added tests with identity matrices

commit 7f3f888ff8eb02cdfcbf0f9cad359173154df413
Author: Joseph K. Bradley <jo...@databricks.com>
Date:   2015-10-10T02:05:34Z

    unit test which fails in MatricesSuite equals

commit b3c29cab43a21dd9626118ca16d0bd7ba2d2ddd7
Author: rahul <ra...@gmail.com>
Date:   2015-10-17T00:44:53Z

    Updated to handle explicit and implicit zero comparison. Added corresponding tests.

commit e8821627976936528e55cacf93885ab7ef1ca1cd
Author: rahul <ra...@gmail.com>
Date:   2015-10-17T22:27:40Z

    Merge branch 'rahulpalamuttam-RahulP_SparseMatrixEquals' of https://github.com/jkbradley/spark into jkbradley-rahulpalamuttam-RahulP_SparseMatrixEquals
    
    Conflicts:
    	mllib/src/test/scala/org/apache/spark/mllib/linalg/MatricesSuite.scala

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10906][MLlib] More efficient SparseMatr...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/8960#issuecomment-149738717
  
    @rahulpalamuttam May I ask your use case for `Matrix.equals`? It is not common to test strict equality between two double matrices. Checking the norm on their difference is more proper. The only issue we have observed is in Python serialization. But I'm not sure whether this applies to you.
    
    Btw, a better solution would be updating the implementation in breeze instead of Spark, which benefits both projects.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10906][MLlib] More efficient SparseMatr...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/8960#issuecomment-167914321
  
    @rahulpalamuttam Sorry for the delay in reviewing!  How would you feel about updating the implementation in Breeze, rather than in Spark?  I expect you could use much of the code you've already written, and I'd be happy to help review your PR to Breeze if it's helpful.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10906][MLlib] More efficient SparseMatr...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/8960#issuecomment-169156603
  
    Right, sorry, I was losing track of things.  In that case, let's close this issue and focus on improving the Breeze method.  Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10906][MLlib] More efficient SparseMatr...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/8960#issuecomment-174657229
  
    I closed this PR based on the discussion.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10906][MLlib] More efficient SparseMatr...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8960#issuecomment-144895074
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10906][MLlib] More efficient SparseMatr...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8960#discussion_r41596881
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala ---
    @@ -550,7 +550,10 @@ class SparseMatrix @Since("1.3.0") (
           values: Array[Double]) = this(numRows, numCols, colPtrs, rowIndices, values, false)
     
       override def equals(o: Any): Boolean = o match {
    -    case m: Matrix => toBreeze == m.toBreeze
    +    case m: Matrix =>
    +      val thisIteratorSet = toBreeze.activeIterator.toSet
    +      val mIteratorSet = m.toBreeze.activeIterator.toSet.filter(p => p._2 != 0.0)
    +      thisIteratorSet == mIteratorSet
    --- End diff --
    
    Rather than converting to sets (fairly expensive), how about comparing the iterators directly?  It's a bit more complex (handling zeros) but would then be linear time, with a small constant.  It would also allow early stopping as soon as a non-matching value was found.
    
    Also, this method should compare the matrix dimensions before looking at the values.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10906][MLlib] More efficient SparseMatr...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/8960#issuecomment-146941527
  
    You're right about them not using the same order in activeIterator; that's annoying.  I like the tests you added before using the iterators.  But I don't think the use of the iterators will handle zero values properly.  Can you go ahead and add one or more unit tests, particularly where there are both implicit and explicit zeros, in order to catch such issues?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10906][MLlib] More efficient SparseMatr...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/8960#issuecomment-169136389
  
    @rahulpalamuttam Yes, please, it'd be great to update this PR to be a test only (assuming it fails without the Breeze fix?).  I'll test and comment on the Breeze PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10906][MLlib] More efficient SparseMatr...

Posted by rahulpalamuttam <gi...@git.apache.org>.
Github user rahulpalamuttam commented on the pull request:

    https://github.com/apache/spark/pull/8960#issuecomment-149656950
  
    Is there any more edge conditions left to consider in this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10906][MLlib] More efficient SparseMatr...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/8960#issuecomment-148858254
  
    Ping!  Will you have a chance to update this soon?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10906][MLlib] More efficient SparseMatr...

Posted by rahulpalamuttam <gi...@git.apache.org>.
Github user rahulpalamuttam closed the pull request at:

    https://github.com/apache/spark/pull/8960


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10906][MLlib] More efficient SparseMatr...

Posted by rahulpalamuttam <gi...@git.apache.org>.
Github user rahulpalamuttam commented on the pull request:

    https://github.com/apache/spark/pull/8960#issuecomment-169139021
  
    Actually the test passes regardless of whether the fix is there or not.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10906][MLlib] More efficient SparseMatr...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/8960#issuecomment-147026311
  
    In a sparse representation, an implicit zero is not in values or indices arrays, but an explicit zero is (with value 0).  I'm sending your PR a PR to demonstrate the problem.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org