You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by da-steve101 <gi...@git.apache.org> on 2015/10/08 04:13:06 UTC

[GitHub] spark pull request: [SPARK-10989] [MLLIB] Added the dot and hadama...

GitHub user da-steve101 opened a pull request:

    https://github.com/apache/spark/pull/9020

    [SPARK-10989] [MLLIB] Added the dot and hadamard operators to mllib

    Added a basic implementation of dot and hadamard products between vectors for mllib

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/da-steve101/spark dotandhadamard

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/9020.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #9020
    
----
commit 9565f52dfb34f23376846ef157cf195debb2f43e
Author: Stephen Tridgell <st...@intel.com>
Date:   2015-10-08T01:41:41Z

    [SPARK-10989] [MLLIB] Added the dot and hadamard operators to Vectors.scala in mllib
    
    Added a basic implementation of dot and hadamard products between vectors

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10989] [MLLIB] Added the dot and hadama...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9020#discussion_r41539167
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/Vectors.scala ---
    @@ -512,6 +513,92 @@ object Vectors {
         squaredDistance
       }
     
    +  private def dot(a : DenseVector, b : DenseVector) : Double = {
    +    (a.toArray zip b.toArray).map(x => (x._1 * x._2)).sum
    +  }
    +
    +  private def dot(a : SparseVector, b : DenseVector) : Double = {
    +    (a.indices zip a.values).map(x => { b(x._1)*x._2 }).sum
    --- End diff --
    
    What I mean is that you can already just call Breeze for this without any code change in Spark, using `toBreeze`. It's easy; I get that it sometimes involves a copy. Do we have a compelling case for needing this enough and needing it to be fast in Spark that it should be added to `Vector`? I ask just because I don't see anything changed to call it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10989] [MLLIB] Added the dot and hadama...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9020#discussion_r41608745
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/Vectors.scala ---
    @@ -512,6 +513,92 @@ object Vectors {
         squaredDistance
       }
     
    +  private def dot(a : DenseVector, b : DenseVector) : Double = {
    +    (a.toArray zip b.toArray).map(x => (x._1 * x._2)).sum
    +  }
    +
    +  private def dot(a : SparseVector, b : DenseVector) : Double = {
    +    (a.indices zip a.values).map(x => { b(x._1)*x._2 }).sum
    --- End diff --
    
    Ah... I keep forgetting that part, of course. I think that `Vector` is really an API for Spark itself, and Spark doesn't have this problem to solve as a result since it can access Breeze, etc. However I can see that `Vector` is something user code reasonably uses and manipulates. While it may be best for apps to put these into their own desired representation for any serious manipulation, there's an argument that giving more than the bare essentials in an API is worth the effort. I suppose that's the question is SPARK-6442. dot-product seems like a legitimate question; Hadamard, not so sure.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10989] [MLLIB] Added the dot and hadama...

Posted by da-steve101 <gi...@git.apache.org>.
Github user da-steve101 commented on the pull request:

    https://github.com/apache/spark/pull/9020#issuecomment-146743352
  
    I just saw the other discussion https://issues.apache.org/jira/browse/SPARK-6442
    I guess perhaps this should be put on hold?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10989] [MLLIB] Added the dot and hadama...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9020#issuecomment-146395870
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10989] [MLLIB] Added the dot and hadama...

Posted by da-steve101 <gi...@git.apache.org>.
Github user da-steve101 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9020#discussion_r41537324
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/Vectors.scala ---
    @@ -512,6 +513,92 @@ object Vectors {
         squaredDistance
       }
     
    +  private def dot(a : DenseVector, b : DenseVector) : Double = {
    +    (a.toArray zip b.toArray).map(x => (x._1 * x._2)).sum
    +  }
    +
    +  private def dot(a : SparseVector, b : DenseVector) : Double = {
    +    (a.indices zip a.values).map(x => { b(x._1)*x._2 }).sum
    --- End diff --
    
    Fair enough, changed to breeze
    I was using hadamard for Kij * (yi*yj) in an SVM implementation and was more convenient to use vector element wise when storing as a row matrix


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10989] [MLLIB] Added the dot and hadama...

Posted by da-steve101 <gi...@git.apache.org>.
Github user da-steve101 closed the pull request at:

    https://github.com/apache/spark/pull/9020


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10989] [MLLIB] Added the dot and hadama...

Posted by da-steve101 <gi...@git.apache.org>.
Github user da-steve101 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9020#discussion_r41593650
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/Vectors.scala ---
    @@ -512,6 +513,92 @@ object Vectors {
         squaredDistance
       }
     
    +  private def dot(a : DenseVector, b : DenseVector) : Double = {
    +    (a.toArray zip b.toArray).map(x => (x._1 * x._2)).sum
    +  }
    +
    +  private def dot(a : SparseVector, b : DenseVector) : Double = {
    +    (a.indices zip a.values).map(x => { b(x._1)*x._2 }).sum
    --- End diff --
    
    problem is that toBreeze is private ( i don't think it should be but i felt i was overstepping to change that ). Also its just a bit annoying from programmers perspective. I can see what you are saying, I just think that extra step should be inside rather than outside. Really I think all the breeze operations should be defined for these vectors (or just use breeze vectors but I guess a single vector may need to be distributed).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10989] [MLLIB] Added the dot and hadama...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/9020#issuecomment-169112070
  
    @da-steve101 I do think this should be put on hold, but I hope we can revisit SPARK-6442 soon.  Could you please keep your branch, but close this issue for now?  Thank you!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10989] [MLLIB] Added the dot and hadama...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9020#discussion_r41490463
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/Vectors.scala ---
    @@ -512,6 +513,92 @@ object Vectors {
         squaredDistance
       }
     
    +  private def dot(a : DenseVector, b : DenseVector) : Double = {
    +    (a.toArray zip b.toArray).map(x => (x._1 * x._2)).sum
    +  }
    +
    +  private def dot(a : SparseVector, b : DenseVector) : Double = {
    +    (a.indices zip a.values).map(x => { b(x._1)*x._2 }).sum
    --- End diff --
    
    This probably won't work as it makes the sparse vector dense. In general I think these are slower than the Breeze impl. I also question the value of the Hadamard product as I've never actually had it come up. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org