You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by benradford <gi...@git.apache.org> on 2017/03/10 05:29:51 UTC

[GitHub] spark pull request #17234: [SPARK-19892][MLlib] Implement findAnalogies meth...

GitHub user benradford opened a pull request:

    https://github.com/apache/spark/pull/17234

    [SPARK-19892][MLlib] Implement findAnalogies method for Word2VecModel

    ## What changes were proposed in this pull request?
    
    Added findAnalogies method to Word2VecModel for performing vector-algebra-based queries (e.g. King + Woman - Man).
    
    ## How was this patch tested?
    
    Followed the contributor's guide for Spark and ran the run-tests. Compiled and tested functionality in spark-shell.
    
    This is an original work that I license to the project under the project's open source license.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/benradford/spark feature/findAnalogies

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/17234.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #17234
    
----
commit 2e7f1a3bd519d79ce9b08d388247e9a1d7f67635
Author: Benjamin Radford <be...@gmail.com>
Date:   2017-03-10T04:42:33Z

    Added findAnalogies method to Word2VecModel

commit 9aefebfcd2e6eaad117727901ad70d0d26b03a1a
Author: Benjamin Radford <be...@gmail.com>
Date:   2017-03-10T05:16:46Z

    Fixed comment indentation to conform to style guide.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17234: [SPARK-19892][MLlib] Implement findAnalogies meth...

Posted by benradford <gi...@git.apache.org>.
Github user benradford commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17234#discussion_r123727450
  
    --- Diff: R/pkg/DESCRIPTION ---
    @@ -54,5 +54,5 @@ Collate:
         'types.R'
         'utils.R'
         'window.R'
    -RoxygenNote: 5.0.1
    +RoxygenNote: 6.0.1
    --- End diff --
    
    I remedied this manually.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17234: [SPARK-19892][MLlib] Implement findAnalogies method for ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17234
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17234: [SPARK-19892][MLlib] Implement findAnalogies method for ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17234
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17234: [SPARK-19892][MLlib] Implement findAnalogies method for ...

Posted by benradford <gi...@git.apache.org>.
Github user benradford commented on the issue:

    https://github.com/apache/spark/pull/17234
  
    @srowen @felixcheung: I believe all concerns have been addressed. Please let me know if there are any remaining issues.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17234: [SPARK-19892][MLlib] Implement findAnalogies method for ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17234
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17234: [SPARK-19892][MLlib] Implement findAnalogies meth...

Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17234#discussion_r105564319
  
    --- Diff: R/pkg/DESCRIPTION ---
    @@ -54,5 +54,5 @@ Collate:
         'types.R'
         'utils.R'
         'window.R'
    -RoxygenNote: 5.0.1
    +RoxygenNote: 6.0.1
    --- End diff --
    
    It does it automatically when someone having a newer roxygen2 installed is running the R build.
    From looking at that code, there isn't an option to disable the behavior of updating this string automatically. And, I don't think there is a way to fix the version either, other than mucking with R packages that are installed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17234: [SPARK-19892][MLlib] Implement findAnalogies meth...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17234#discussion_r105351937
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -610,6 +610,71 @@ class Word2VecModel private[spark] (
       }
     
       /**
    +   * Find words similar to the words supplied to 'positive' and dissimilar
    +   * to the words supplied to 'negative'.
    +   * @param positive array of words similar to the results list
    +   * @param negative array of words dissimilar to the results list
    +   * @param num number of synonyms to find
    +   * @return array of (word, cosineSimilarity)
    +   */
    +  def findAnalogies(positive: Array[String] = Array(),
    +                            negative: Array[String] = Array(),
    +                            num: Int = 1): Array[(String, Double)] = {
    +    require(num > 0, "Number of similar words should be > 0")
    +    require(positive.length > 0 || negative.length > 0,
    +      "Either positive or negative argument must be supplied")
    +
    +    var positiveVectors = Array[Array[Double]]()
    +    var negativeVectors = Array[Array[Double]]()
    +
    +    for(pp <- positive)
    +      positiveVectors :+= transform(pp).toArray
    +    for(nn <- negative)
    +      negativeVectors :+= transform(nn).toArray
    +    // Normalize positive and negative vectors before summation
    +    positiveVectors = if (positiveVectors.size > 0) {
    +      positiveVectors.map(x => {
    +        val sumsqr = x.map(y => y * y).reduce((a, b) => a + b)
    +        x.map(y => y / math.pow(sumsqr, .5))
    +      })
    +    } else {
    +      Array(Array.fill(vectorSize)(0.0))
    +    }
    +    negativeVectors = if (negativeVectors.size > 0) {
    +      negativeVectors.map(x => {
    +        val sumsqr = x.map(y => y * y).reduce((a, b) => a + b)
    +        x.map(y => y / math.pow(sumsqr, .5))
    --- End diff --
    
    You just mean sqrt right? pow 0.5 is less readable


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17234: [SPARK-19892][MLlib] Implement findAnalogies method for ...

Posted by benradford <gi...@git.apache.org>.
Github user benradford commented on the issue:

    https://github.com/apache/spark/pull/17234
  
    ok to test
    Jenkins, add to whitelist


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17234: [SPARK-19892][MLlib] Implement findAnalogies meth...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17234#discussion_r105351821
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -610,6 +610,71 @@ class Word2VecModel private[spark] (
       }
     
       /**
    +   * Find words similar to the words supplied to 'positive' and dissimilar
    +   * to the words supplied to 'negative'.
    +   * @param positive array of words similar to the results list
    +   * @param negative array of words dissimilar to the results list
    +   * @param num number of synonyms to find
    +   * @return array of (word, cosineSimilarity)
    +   */
    +  def findAnalogies(positive: Array[String] = Array(),
    +                            negative: Array[String] = Array(),
    --- End diff --
    
    Nit: indentation. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17234: [SPARK-19892][MLlib] Implement findAnalogies method for ...

Posted by benradford <gi...@git.apache.org>.
Github user benradford commented on the issue:

    https://github.com/apache/spark/pull/17234
  
    @srowen I find this functionality useful and thought others might as well. It is a common use-case for word2vec and is suggested as a method for validating good model fit by Mikolov et al. I understand if you prefer not to include it in the core implementation, too. I did my best to address your suggestions in my latest commit. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17234: [SPARK-19892][MLlib] Implement findAnalogies meth...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17234#discussion_r105351890
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -610,6 +610,71 @@ class Word2VecModel private[spark] (
       }
     
       /**
    +   * Find words similar to the words supplied to 'positive' and dissimilar
    +   * to the words supplied to 'negative'.
    +   * @param positive array of words similar to the results list
    +   * @param negative array of words dissimilar to the results list
    +   * @param num number of synonyms to find
    +   * @return array of (word, cosineSimilarity)
    +   */
    +  def findAnalogies(positive: Array[String] = Array(),
    +                            negative: Array[String] = Array(),
    +                            num: Int = 1): Array[(String, Double)] = {
    +    require(num > 0, "Number of similar words should be > 0")
    +    require(positive.length > 0 || negative.length > 0,
    +      "Either positive or negative argument must be supplied")
    +
    +    var positiveVectors = Array[Array[Double]]()
    +    var negativeVectors = Array[Array[Double]]()
    +
    +    for(pp <- positive)
    +      positiveVectors :+= transform(pp).toArray
    +    for(nn <- negative)
    +      negativeVectors :+= transform(nn).toArray
    +    // Normalize positive and negative vectors before summation
    +    positiveVectors = if (positiveVectors.size > 0) {
    +      positiveVectors.map(x => {
    --- End diff --
    
    There is a fair bit of duplicated code that could be factored into methods.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17234: [SPARK-19892][MLlib] Implement findAnalogies method for ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17234
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17234: [SPARK-19892][MLlib] Implement findAnalogies method for ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17234
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17234: [SPARK-19892][MLlib] Implement findAnalogies meth...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17234#discussion_r105351636
  
    --- Diff: R/pkg/DESCRIPTION ---
    @@ -54,5 +54,5 @@ Collate:
         'types.R'
         'utils.R'
         'window.R'
    -RoxygenNote: 5.0.1
    +RoxygenNote: 6.0.1
    --- End diff --
    
    This happens automatically. Ideally the build doesn't modify source files -- do you know any way to make it not do this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17234: [SPARK-19892][MLlib] Implement findAnalogies meth...

Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17234#discussion_r105336208
  
    --- Diff: R/pkg/DESCRIPTION ---
    @@ -54,5 +54,5 @@ Collate:
         'types.R'
         'utils.R'
         'window.R'
    -RoxygenNote: 5.0.1
    +RoxygenNote: 6.0.1
    --- End diff --
    
    please revert this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17234: [SPARK-19892][MLlib] Implement findAnalogies method for ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17234
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org