You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by MechCoder <gi...@git.apache.org> on 2015/04/11 11:50:26 UTC

[GitHub] spark pull request: [SPARK-6065] [MLlib] Optimize word2vec.findSyn...

GitHub user MechCoder opened a pull request:

    https://github.com/apache/spark/pull/5467

    [SPARK-6065] [MLlib] Optimize word2vec.findSynonyms using blas calls

    1. Use blas calls to find the dot product between two vectors.
    2. Prevent re-computing the L2 norm of the given vector for each word in model.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/MechCoder/spark spark-6065

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/5467.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #5467
    
----
commit 66cf62ab18d3752a4cc1cdba0dfb5f5034f398dc
Author: MechCoder <ma...@gmail.com>
Date:   2015-04-11T09:40:49Z

    [SPARK-6065] Optimize word2vec.findSynonynms using blas calls

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6065] [MLlib] Optimize word2vec.findSyn...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/5467#issuecomment-94078344
  
    @MechCoder I agree that supplying a Map is more intuitive.  How about we support:
    * Private constructor: Take Matrix
    * Public constructor: Take Map


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6065] [MLlib] Optimize word2vec.findSyn...

Posted by MechCoder <gi...@git.apache.org>.
Github user MechCoder commented on the pull request:

    https://github.com/apache/spark/pull/5467#issuecomment-93969770
  
    @jkbradley I think I have addressed all your comments except the constructor.
    
    How about retaining the present Word2VecModel(Map: [String, Array(Float)]) and converting it internally to Word2VecModel(Map: [String, Int], Matrix) using something like
    
        this(Map: [String, Int], Matrix) = {
           // do map to indexedModel and Matrix conversion
          this(indexedModel: Map[String, Int], Matrix)
        }
    
    Supplying a Map[String, Int] seems much more intuitive from a user's point of view, 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6065] [MLlib] Optimize word2vec.findSyn...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5467#issuecomment-92712446
  
      [Test build #30232 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30232/consoleFull) for   PR 5467 at commit [`17210c3`](https://github.com/apache/spark/commit/17210c354488abc73adbcacbb485e2d402e5caa1).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.
     * This patch does not change any dependencies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6065] [MLlib] Optimize word2vec.findSyn...

Posted by MechCoder <gi...@git.apache.org>.
Github user MechCoder commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5467#discussion_r28450065
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -431,6 +431,14 @@ class Word2Vec extends Serializable with Logging {
     class Word2VecModel private[mllib] (
         private val model: Map[String, Array[Float]]) extends Serializable with Saveable {
    --- End diff --
    
    It seems to me that `Map[String, Int]` will just be `model.zip.(0 until model.size).toMap`. Is it right to expect that the ordering of keys in the model Map does not change?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6065] [MLlib] Optimize word2vec.findSyn...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5467#issuecomment-94948494
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30689/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6065] [MLlib] Optimize word2vec.findSyn...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5467#issuecomment-94973222
  
      [Test build #30700 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30700/consoleFull) for   PR 5467 at commit [`dd0b0b2`](https://github.com/apache/spark/commit/dd0b0b2c8d83a772cd9995dc89cece238f0f7ca0).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.
     * This patch does not change any dependencies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6065] [MLlib] Optimize word2vec.findSyn...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5467#issuecomment-94759896
  
      [Test build #30663 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30663/consoleFull) for   PR 5467 at commit [`6b74c81`](https://github.com/apache/spark/commit/6b74c81451a8523babade4ed8b9067e64a061dae).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.
     * This patch does not change any dependencies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6065] [MLlib] Optimize word2vec.findSyn...

Posted by MechCoder <gi...@git.apache.org>.
Github user MechCoder commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5467#discussion_r28811099
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -429,7 +429,25 @@ class Word2Vec extends Serializable with Logging {
      */
     @Experimental
     class Word2VecModel private[mllib] (
    -    private val model: Map[String, Array[Float]]) extends Serializable with Saveable {
    +    model: Map[String, Array[Float]]) extends Serializable with Saveable {
    +
    +  // Maintain a ordered list of words based on the index in the initial model.
    +  private val wordList: Array[String] = model.keys.toArray
    --- End diff --
    
    Where do I need to write this? As a comment?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6065] [MLlib] Optimize word2vec.findSyn...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5467#discussion_r28805370
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -429,7 +429,25 @@ class Word2Vec extends Serializable with Logging {
      */
     @Experimental
     class Word2VecModel private[mllib] (
    -    private val model: Map[String, Array[Float]]) extends Serializable with Saveable {
    +    model: Map[String, Array[Float]]) extends Serializable with Saveable {
    +
    +  // Maintain a ordered list of words based on the index in the initial model.
    +  private val wordList: Array[String] = model.keys.toArray
    +  private val wordIndex: Map[String, Int] = wordList.zip(0 until model.size).toMap
    +  private val numDim = model.head._2.size
    --- End diff --
    
    Rename to vectorSize (to match name in Word2Vec)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6065] [MLlib] Optimize word2vec.findSyn...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/5467#issuecomment-94642552
  
    Yes, I'm sorry about that.  Please do push back if you think my advice is incorrect.  How difficult would it be to check out an earlier version from that point, and then look at the Github commit diffs for your later commits to check through updates you made later which might still apply to the old version?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6065] [MLlib] Optimize word2vec.findSyn...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5467#issuecomment-93914210
  
      [Test build #30464 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30464/consoleFull) for   PR 5467 at commit [`64575b0`](https://github.com/apache/spark/commit/64575b0282b350facc93340fbf653b38b0121b1a).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6065] [MLlib] Optimize word2vec.findSyn...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5467#discussion_r28360430
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -479,9 +487,17 @@ class Word2VecModel private[mllib] (
        */
       def findSynonyms(vector: Vector, num: Int): Array[(String, Double)] = {
         require(num > 0, "Number of similar words should > 0")
    -    // TODO: optimize top-k
    -    val fVector = vector.toArray.map(_.toFloat)
    -    model.mapValues(vec => cosineSimilarity(fVector, vec))
    +
    +    val fVector = vector.toArray
    +
    +    val cosineVec = new DenseVector(Array.fill[Double](numWords)(0))
    +    BLAS.gemv(1.0, wordVecMat, vector.asInstanceOf[DenseVector], 0.0, cosineVec)
    +
    +    // Need not divide with the norm of the given vector since it is constant.
    +    val updatedCosines = model.zipWithIndex.map { case (vec, ind) =>
    --- End diff --
    
    "vec" is not used


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6065] [MLlib] Optimize word2vec.findSyn...

Posted by MechCoder <gi...@git.apache.org>.
Github user MechCoder commented on the pull request:

    https://github.com/apache/spark/pull/5467#issuecomment-93545241
  
    @jkbradley I've pushed some updates.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6065] [MLlib] Optimize word2vec.findSyn...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5467#issuecomment-91810391
  
      [Test build #30070 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30070/consoleFull) for   PR 5467 at commit [`66cf62a`](https://github.com/apache/spark/commit/66cf62ab18d3752a4cc1cdba0dfb5f5034f398dc).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6065] [MLlib] Optimize word2vec.findSyn...

Posted by MechCoder <gi...@git.apache.org>.
Github user MechCoder commented on the pull request:

    https://github.com/apache/spark/pull/5467#issuecomment-94949893
  
    @jkbradley I've fixed up your comment. It makes sense any way, since now the entire model is iterated across only once.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6065] [MLlib] Optimize word2vec.findSyn...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/5467


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6065] [MLlib] Optimize word2vec.findSyn...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5467#discussion_r29032366
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -479,9 +508,23 @@ class Word2VecModel private[mllib] (
        */
       def findSynonyms(vector: Vector, num: Int): Array[(String, Double)] = {
         require(num > 0, "Number of similar words should > 0")
    -    // TODO: optimize top-k
    --- End diff --
    
    This TODO was created to use `BoundedPriorityQueue` to compute top k:
    
    https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/BoundedPriorityQueue.scala


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6065] [MLlib] Optimize word2vec.findSyn...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5467#issuecomment-94951310
  
      [Test build #30700 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30700/consoleFull) for   PR 5467 at commit [`dd0b0b2`](https://github.com/apache/spark/commit/dd0b0b2c8d83a772cd9995dc89cece238f0f7ca0).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6065] [MLlib] Optimize word2vec.findSyn...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5467#discussion_r28632339
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -479,9 +492,16 @@ class Word2VecModel private[mllib] (
        */
       def findSynonyms(vector: Vector, num: Int): Array[(String, Double)] = {
         require(num > 0, "Number of similar words should > 0")
    -    // TODO: optimize top-k
    -    val fVector = vector.toArray.map(_.toFloat)
    -    model.mapValues(vec => cosineSimilarity(fVector, vec))
    +
    +    val numWords = wordVectors.numRows
    +    val cosineVec = Vectors.zeros(numWords).asInstanceOf[DenseVector]
    +    BLAS.gemv(1.0, wordVectors, vector.asInstanceOf[DenseVector], 0.0, cosineVec)
    +
    +    // Need not divide with the norm of the given vector since it is constant.
    +    val updatedCosines = indexedModel.map { case (_, ind) =>
    --- End diff --
    
    I'm not sure if an iterator over a Map has guarantees about the ordering of the elements.  It's probably better not to assume that will remain stable.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6065] [MLlib] Optimize word2vec.findSyn...

Posted by MechCoder <gi...@git.apache.org>.
Github user MechCoder commented on the pull request:

    https://github.com/apache/spark/pull/5467#issuecomment-94727805
  
    I've pushed some updates.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6065] [MLlib] Optimize word2vec.findSyn...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/5467#issuecomment-92519784
  
    Yep, that's pretty much what I had in mind, except that I'd recommend:
    * using MLlib's local Matrix type (and its BLAS call in mllib.linalg.BLAS)
    * computing and storing the Matrix once during initialization
      * In this case, we should store a matching Array of Strings so that ```getVectors``` can be computed (rather than storing duplicate info)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6065] [MLlib] Optimize word2vec.findSyn...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5467#discussion_r28550188
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -431,6 +431,14 @@ class Word2Vec extends Serializable with Logging {
     class Word2VecModel private[mllib] (
         private val model: Map[String, Array[Float]]) extends Serializable with Saveable {
    --- End diff --
    
    No, I don't think so.  I'll make an inline comment where it's used about handling that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6065] [MLlib] Optimize word2vec.findSyn...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5467#discussion_r28360427
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -479,9 +487,17 @@ class Word2VecModel private[mllib] (
        */
       def findSynonyms(vector: Vector, num: Int): Array[(String, Double)] = {
         require(num > 0, "Number of similar words should > 0")
    -    // TODO: optimize top-k
    -    val fVector = vector.toArray.map(_.toFloat)
    -    model.mapValues(vec => cosineSimilarity(fVector, vec))
    +
    +    val fVector = vector.toArray
    +
    +    val cosineVec = new DenseVector(Array.fill[Double](numWords)(0))
    --- End diff --
    
    Use ```Vectors.zeros```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6065] [MLlib] Optimize word2vec.findSyn...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5467#issuecomment-94917585
  
      [Test build #30689 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30689/consoleFull) for   PR 5467 at commit [`ffc9240`](https://github.com/apache/spark/commit/ffc92402872f18ce730ab47edf7b341673d4d871).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6065] [MLlib] Optimize word2vec.findSyn...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/5467#issuecomment-93025072
  
    Those updates might require significant changes, so I'll make another pass after updates.  Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6065] [MLlib] Optimize word2vec.findSyn...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5467#issuecomment-92679663
  
      [Test build #30232 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30232/consoleFull) for   PR 5467 at commit [`17210c3`](https://github.com/apache/spark/commit/17210c354488abc73adbcacbb485e2d402e5caa1).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6065] [MLlib] Optimize word2vec.findSyn...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/5467#issuecomment-94225128
  
    @MechCoder You would probably have to do the slicing (until MLlib's BLAS provides more of that functionality).  However, I think you could do the slicing using views so that it would not actually copy the data.  My main hope was that it would eliminate some extra code.  (Does it?)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6065] [MLlib] Optimize word2vec.findSyn...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5467#issuecomment-93570757
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30369/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6065] [MLlib] Optimize word2vec.findSyn...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5467#issuecomment-93570742
  
      [Test build #30369 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30369/consoleFull) for   PR 5467 at commit [`3b0d075`](https://github.com/apache/spark/commit/3b0d075be23aa3a59210c5eda1114aa342b47fb1).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.
     * This patch does not change any dependencies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6065] [MLlib] Optimize word2vec.findSyn...

Posted by MechCoder <gi...@git.apache.org>.
Github user MechCoder commented on the pull request:

    https://github.com/apache/spark/pull/5467#issuecomment-94534838
  
    @jkbradley I have problems in understanding how to write the code for this. I had this design in mind.
    
        class Word2VecModel private[mllib] (
            wordIndex: Map[String, Int],
            wordVectors: DenseMatrix) extends Serializable with Saveable {
      
            // Calculate wordVecNorms and wordList here
    
             private def this(model: Map[String, Array[Float]]) = {
                // convert to wordIndex and wordVectors here.
                this(wordIndex, wordVectors)
             }
    
    However, this does not work, since the first line after overriding the constructor should call the constructor itself. How to go about this?
    
    
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6065] [MLlib] Optimize word2vec.findSyn...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5467#issuecomment-92594382
  
      [Test build #30217 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30217/consoleFull) for   PR 5467 at commit [`a7237aa`](https://github.com/apache/spark/commit/a7237aabfdc4c8b64103725523f2dd7c7baf80e9).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6065] [MLlib] Optimize word2vec.findSyn...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5467#discussion_r28805366
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -429,7 +429,25 @@ class Word2Vec extends Serializable with Logging {
      */
     @Experimental
     class Word2VecModel private[mllib] (
    -    private val model: Map[String, Array[Float]]) extends Serializable with Saveable {
    +    model: Map[String, Array[Float]]) extends Serializable with Saveable {
    +
    +  // Maintain a ordered list of words based on the index in the initial model.
    +  private val wordList: Array[String] = model.keys.toArray
    --- End diff --
    
    Since this is the first place that an ordering on keys is defined, can you use this below when creating wordVectors (to make sure the ordering is exactly the same)?
    
    Also, please add a little doc saying what each of these 6 values are.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6065] [MLlib] Optimize word2vec.findSyn...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5467#issuecomment-94759921
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30663/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6065] [MLlib] Optimize word2vec.findSyn...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5467#discussion_r28550199
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -429,7 +429,20 @@ class Word2Vec extends Serializable with Logging {
      */
     @Experimental
     class Word2VecModel private[mllib] (
    -    private val model: Map[String, Array[Float]]) extends Serializable with Saveable {
    +    model: Map[String, Array[Float]]) extends Serializable with Saveable {
    +
    +  val indexedModel = model.keys.zip(0 until model.size).toMap
    --- End diff --
    
    State explicit types
    
    Rename: "wordIndex"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6065] [MLlib] Optimize word2vec.findSyn...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5467#discussion_r28550201
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -429,7 +429,20 @@ class Word2Vec extends Serializable with Logging {
      */
     @Experimental
     class Word2VecModel private[mllib] (
    -    private val model: Map[String, Array[Float]]) extends Serializable with Saveable {
    +    model: Map[String, Array[Float]]) extends Serializable with Saveable {
    +
    +  val indexedModel = model.keys.zip(0 until model.size).toMap
    +
    +  private val (wordVectors, wordVecNorms) = {
    --- End diff --
    
    State explicit types


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6065] [MLlib] Optimize word2vec.findSyn...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5467#discussion_r28360424
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -479,9 +487,17 @@ class Word2VecModel private[mllib] (
        */
       def findSynonyms(vector: Vector, num: Int): Array[(String, Double)] = {
         require(num > 0, "Number of similar words should > 0")
    -    // TODO: optimize top-k
    -    val fVector = vector.toArray.map(_.toFloat)
    -    model.mapValues(vec => cosineSimilarity(fVector, vec))
    +
    +    val fVector = vector.toArray
    --- End diff --
    
    Use ```new DenseVector(vector.toArray)``` here to avoid cast below.  (Explicit constructor is clearer to the reader.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6065] [MLlib] Optimize word2vec.findSyn...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5467#issuecomment-94729223
  
      [Test build #30663 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30663/consoleFull) for   PR 5467 at commit [`6b74c81`](https://github.com/apache/spark/commit/6b74c81451a8523babade4ed8b9067e64a061dae).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6065] [MLlib] Optimize word2vec.findSyn...

Posted by MechCoder <gi...@git.apache.org>.
Github user MechCoder commented on the pull request:

    https://github.com/apache/spark/pull/5467#issuecomment-95910746
  
    cool, will make the changes along with sprak-7045


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6065] [MLlib] Optimize word2vec.findSyn...

Posted by MechCoder <gi...@git.apache.org>.
Github user MechCoder commented on the pull request:

    https://github.com/apache/spark/pull/5467#issuecomment-94630413
  
    Alright, we can move those to another PR.
    
    > This PR should still be doable, but you would need to store an Array[Float] instead of the Matrix type. You would also need to use mllib.linalg.BLAS.nativeBLAS to make the BLAS calls.
    
    That was what I did initially :(


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6065] [MLlib] Optimize word2vec.findSyn...

Posted by MechCoder <gi...@git.apache.org>.
Github user MechCoder commented on the pull request:

    https://github.com/apache/spark/pull/5467#issuecomment-94197937
  
    @jkbradley  Thinking over it again, I'm not sure if it would offer a great advantage to do so. If you are talking about preventing this slicing (https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala#L408)?
    
    If yes, then even if I prevent this slicing and pass the Matrix, I would have to do the slicing again from `Matrix.values` to find the norms (https://github.com/apache/spark/pull/5467/files#diff-88f4b62c382b26ef8e856b23f5167ccdR446) .
    
    What other advantage did you have in mind?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6065] [MLlib] Optimize word2vec.findSyn...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5467#discussion_r29032437
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -429,7 +429,36 @@ class Word2Vec extends Serializable with Logging {
      */
     @Experimental
     class Word2VecModel private[mllib] (
    -    private val model: Map[String, Array[Float]]) extends Serializable with Saveable {
    +    model: Map[String, Array[Float]]) extends Serializable with Saveable {
    +
    +  // wordList: Ordered list of words obtained from model.
    +  private val wordList: Array[String] = model.keys.toArray
    +
    +  // wordIndex: Maps each word to an index, which can retrieve the corresponding
    +  //            vector from wordVectors (see below).
    +  private val wordIndex: Map[String, Int] = wordList.zip(0 until model.size).toMap
    --- End diff --
    
    `wordList.zipWithIndex.toMap`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6065] [MLlib] Optimize word2vec.findSyn...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5467#issuecomment-93966926
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30477/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6065] [MLlib] Optimize word2vec.findSyn...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/5467#issuecomment-95852537
  
    @MechCoder Sorry for my late comment! I made some minor comments. It would be good if you can submit a follow-up PR to address those issues. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6065] [MLlib] Optimize word2vec.findSyn...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5467#issuecomment-91828064
  
      [Test build #30070 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30070/consoleFull) for   PR 5467 at commit [`66cf62a`](https://github.com/apache/spark/commit/66cf62ab18d3752a4cc1cdba0dfb5f5034f398dc).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.
     * This patch does not change any dependencies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6065] [MLlib] Optimize word2vec.findSyn...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5467#discussion_r28360421
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -431,6 +431,14 @@ class Word2Vec extends Serializable with Logging {
     class Word2VecModel private[mllib] (
         private val model: Map[String, Array[Float]]) extends Serializable with Saveable {
     
    +  private val numDim = model.head._2.size
    --- End diff --
    
    Use "vectorSize" instead of "numDim" to mimic Word2Vec class.
    Better yet, move this and flatVec into a constructor or closure which sets up wordVecMat, wordVec (since this and flatVec are not used anywhere else)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6065] [MLlib] Optimize word2vec.findSyn...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5467#issuecomment-92618394
  
      [Test build #30217 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30217/consoleFull) for   PR 5467 at commit [`a7237aa`](https://github.com/apache/spark/commit/a7237aabfdc4c8b64103725523f2dd7c7baf80e9).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.
     * This patch does not change any dependencies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6065] [MLlib] Optimize word2vec.findSyn...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5467#discussion_r29032368
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -479,9 +508,23 @@ class Word2VecModel private[mllib] (
        */
       def findSynonyms(vector: Vector, num: Int): Array[(String, Double)] = {
         require(num > 0, "Number of similar words should > 0")
    -    // TODO: optimize top-k
    +
         val fVector = vector.toArray.map(_.toFloat)
    -    model.mapValues(vec => cosineSimilarity(fVector, vec))
    +    val cosineVec = Array.fill[Float](numWords)(0)
    +    val alpha: Float = 1
    +    val beta: Float = 0
    +
    +    blas.sgemv(
    +      "T", vectorSize, numWords, alpha, wordVectors, vectorSize, fVector, 1, beta, cosineVec, 1)
    +
    +    // Need not divide with the norm of the given vector since it is constant.
    +    val updatedCosines = new Array[Double](numWords)
    --- End diff --
    
    Should reuse `cosineVec`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6065] [MLlib] Optimize word2vec.findSyn...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5467#issuecomment-93942477
  
      [Test build #30464 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30464/consoleFull) for   PR 5467 at commit [`64575b0`](https://github.com/apache/spark/commit/64575b0282b350facc93340fbf653b38b0121b1a).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.
     * This patch does not change any dependencies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6065] [MLlib] Optimize word2vec.findSyn...

Posted by MechCoder <gi...@git.apache.org>.
Github user MechCoder commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5467#discussion_r28573221
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -479,9 +492,16 @@ class Word2VecModel private[mllib] (
        */
       def findSynonyms(vector: Vector, num: Int): Array[(String, Double)] = {
         require(num > 0, "Number of similar words should > 0")
    -    // TODO: optimize top-k
    -    val fVector = vector.toArray.map(_.toFloat)
    -    model.mapValues(vec => cosineSimilarity(fVector, vec))
    +
    +    val numWords = wordVectors.numRows
    +    val cosineVec = Vectors.zeros(numWords).asInstanceOf[DenseVector]
    +    BLAS.gemv(1.0, wordVectors, vector.asInstanceOf[DenseVector], 0.0, cosineVec)
    +
    +    // Need not divide with the norm of the given vector since it is constant.
    +    val updatedCosines = indexedModel.map { case (_, ind) =>
    --- End diff --
    
    Do you mean that when I do a `dict.map`, the ordering need not be the same as that of the dict?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6065] [MLlib] Optimize word2vec.findSyn...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5467#issuecomment-93942492
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30464/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6065] [MLlib] Optimize word2vec.findSyn...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5467#issuecomment-94948480
  
      [Test build #30689 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30689/consoleFull) for   PR 5467 at commit [`ffc9240`](https://github.com/apache/spark/commit/ffc92402872f18ce730ab47edf7b341673d4d871).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.
     * This patch does not change any dependencies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6065] [MLlib] Optimize word2vec.findSyn...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5467#discussion_r28805372
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -479,9 +497,23 @@ class Word2VecModel private[mllib] (
        */
       def findSynonyms(vector: Vector, num: Int): Array[(String, Double)] = {
         require(num > 0, "Number of similar words should > 0")
    -    // TODO: optimize top-k
    +
         val fVector = vector.toArray.map(_.toFloat)
    -    model.mapValues(vec => cosineSimilarity(fVector, vec))
    +    val cosineVec = Array.fill[Float](numWords)(0)
    +    val alpha: Float = 1
    +    val beta: Float = 1
    --- End diff --
    
    Use beta = 0 since we are setting cosineVec (not adding to it)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6065] [MLlib] Optimize word2vec.findSyn...

Posted by MechCoder <gi...@git.apache.org>.
Github user MechCoder commented on the pull request:

    https://github.com/apache/spark/pull/5467#issuecomment-91809293
  
    @jkbradley Was this what you had in mind?
    
    P.S: I prefer we finish off the other PR before discussion on this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6065] [MLlib] Optimize word2vec.findSyn...

Posted by MechCoder <gi...@git.apache.org>.
Github user MechCoder commented on the pull request:

    https://github.com/apache/spark/pull/5467#issuecomment-95020437
  
    @jkbradley Could you open a jira for the TODO?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6065] [MLlib] Optimize word2vec.findSyn...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5467#issuecomment-92618410
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30217/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6065] [MLlib] Optimize word2vec.findSyn...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/5467#issuecomment-94582162
  
    It just occurred to me that we're converting from Float to Double.  I'm not sure historically why Word2Vec used Float, but I'm worrying now about switching since it will double model sizes.  (I'm sorry I didn't think about this earlier!)
    
    This PR should still be doable, but you would need to store an Array[Float] instead of the Matrix type.  You would also need to use ```mllib.linalg.BLAS.nativeBLAS``` to make the BLAS calls.
    
    What do you think?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6065] [MLlib] Optimize word2vec.findSyn...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/5467#issuecomment-93840169
  
    @MechCoder 
    I just realized that Word2Vec already has a Matrix which could just be passed to Word2VecModel's constructor.  That might be easier and let you remove the code which converts to the map (in Word2Vec.fit) and converts back (in Word2VecModel).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6065] [MLlib] Optimize word2vec.findSyn...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5467#issuecomment-92712486
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30232/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6065] [MLlib] Optimize word2vec.findSyn...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5467#issuecomment-93545778
  
      [Test build #30369 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30369/consoleFull) for   PR 5467 at commit [`3b0d075`](https://github.com/apache/spark/commit/3b0d075be23aa3a59210c5eda1114aa342b47fb1).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6065] [MLlib] Optimize word2vec.findSyn...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5467#discussion_r28550205
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -479,9 +492,16 @@ class Word2VecModel private[mllib] (
        */
       def findSynonyms(vector: Vector, num: Int): Array[(String, Double)] = {
         require(num > 0, "Number of similar words should > 0")
    -    // TODO: optimize top-k
    -    val fVector = vector.toArray.map(_.toFloat)
    -    model.mapValues(vec => cosineSimilarity(fVector, vec))
    +
    +    val numWords = wordVectors.numRows
    +    val cosineVec = Vectors.zeros(numWords).asInstanceOf[DenseVector]
    +    BLAS.gemv(1.0, wordVectors, vector.asInstanceOf[DenseVector], 0.0, cosineVec)
    --- End diff --
    
    Is the cast ```vector.asInstanceOf[DenseVector]``` safe?  I'd use ```new DenseVector(vector.toArray)``` instead.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6065] [MLlib] Optimize word2vec.findSyn...

Posted by MechCoder <gi...@git.apache.org>.
Github user MechCoder commented on the pull request:

    https://github.com/apache/spark/pull/5467#issuecomment-92679362
  
    I've addressed your comments. I did not use the blas calls from linalg.blas initially since I thought there might be some overhead due to preprocessing.
    
    This should be faster at least for repeated calls to `findSynonyms`, for obvious reasons.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6065] [MLlib] Optimize word2vec.findSyn...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/5467#issuecomment-94348677
  
    @MechCoder  There are 2 pieces of code which we're talking about: (1) converting the Map to a Matrix and (2) converting the Matrix to a Map.  I suppose that either approach would need both (assuming we supported the constructor taking a Map in either approach).
    
    I'm OK with leaving it as a to-do.  I think the performance optimization will be worthwhile at some point since I've heard of people complaining about Word2Vec models getting too large to fit into memory.  Keeping everything as a Matrix without an intermediate Map representation would save some space for large vocabs (and a bunch of object creation).
    
    Would you prefer to leave it as a to-do item?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6065] [MLlib] Optimize word2vec.findSyn...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5467#discussion_r28550207
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -479,9 +492,16 @@ class Word2VecModel private[mllib] (
        */
       def findSynonyms(vector: Vector, num: Int): Array[(String, Double)] = {
         require(num > 0, "Number of similar words should > 0")
    -    // TODO: optimize top-k
    -    val fVector = vector.toArray.map(_.toFloat)
    -    model.mapValues(vec => cosineSimilarity(fVector, vec))
    +
    +    val numWords = wordVectors.numRows
    +    val cosineVec = Vectors.zeros(numWords).asInstanceOf[DenseVector]
    +    BLAS.gemv(1.0, wordVectors, vector.asInstanceOf[DenseVector], 0.0, cosineVec)
    +
    +    // Need not divide with the norm of the given vector since it is constant.
    +    val updatedCosines = indexedModel.map { case (_, ind) =>
    --- End diff --
    
    Here's where you're assuming something about the ordering of Map items.  Instead of this, I'd recommend just operating on cosineVec and wordVecNorms.  After finding the top word indices, you can look them up in the wordIndex (indexedModel).  I'd also use a while loop instead of a map since it should be faster.
    
    I also wonder if we should store the word index in both directions (since we need both in the 2 findSynonyms methods):
    * ```wordList: Array[String]``` (words ordered by indices)
    * ```wordIndex: Map[String, Int]``` (map from word to index)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6065] [MLlib] Optimize word2vec.findSyn...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5467#discussion_r28360417
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -431,6 +431,14 @@ class Word2Vec extends Serializable with Logging {
     class Word2VecModel private[mllib] (
         private val model: Map[String, Array[Float]]) extends Serializable with Saveable {
    --- End diff --
    
    Don't store this any more.  Store wordVecMat, plus a matching collection of words; that collection should probably be a Map[String, Int] mapping word to index in wordVecMat (so that transform() is still fast).
    
    Naming: How about "wordVectors" instead of "wordVecMat"?
    
    You'll need to update getVectors to construct the map.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6065] [MLlib] Optimize word2vec.findSyn...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5467#issuecomment-93966910
  
      [Test build #30477 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30477/consoleFull) for   PR 5467 at commit [`da1642d`](https://github.com/apache/spark/commit/da1642deb67dde65bb55b08ae47bd5ce0d29d545).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.
     * This patch does not change any dependencies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6065] [MLlib] Optimize word2vec.findSyn...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5467#discussion_r28813287
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -429,7 +429,25 @@ class Word2Vec extends Serializable with Logging {
      */
     @Experimental
     class Word2VecModel private[mllib] (
    -    private val model: Map[String, Array[Float]]) extends Serializable with Saveable {
    +    model: Map[String, Array[Float]]) extends Serializable with Saveable {
    +
    +  // Maintain a ordered list of words based on the index in the initial model.
    +  private val wordList: Array[String] = model.keys.toArray
    --- End diff --
    
    Yes, please, a little one-line comment for each value would be fine.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6065] [MLlib] Optimize word2vec.findSyn...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5467#issuecomment-94973251
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30700/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6065] [MLlib] Optimize word2vec.findSyn...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/5467#issuecomment-94975553
  
    LGTM, merging into master.  Thanks very much!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6065] [MLlib] Optimize word2vec.findSyn...

Posted by MechCoder <gi...@git.apache.org>.
Github user MechCoder commented on the pull request:

    https://github.com/apache/spark/pull/5467#issuecomment-94917250
  
    @jkbradley fixed, hopefully should be it


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6065] [MLlib] Optimize word2vec.findSyn...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/5467#issuecomment-94891481
  
    @MechCoder I think that's it.  Thanks very much for updating & putting up with the re-do.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6065] [MLlib] Optimize word2vec.findSyn...

Posted by MechCoder <gi...@git.apache.org>.
Github user MechCoder commented on the pull request:

    https://github.com/apache/spark/pull/5467#issuecomment-94233690
  
    On the contrary it would add more code, since I would have to support both cases.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6065] [MLlib] Optimize word2vec.findSyn...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5467#issuecomment-91828077
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30070/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6065] [MLlib] Optimize word2vec.findSyn...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5467#issuecomment-93951771
  
      [Test build #30477 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30477/consoleFull) for   PR 5467 at commit [`da1642d`](https://github.com/apache/spark/commit/da1642deb67dde65bb55b08ae47bd5ce0d29d545).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6065] [MLlib] Optimize word2vec.findSyn...

Posted by MechCoder <gi...@git.apache.org>.
Github user MechCoder commented on the pull request:

    https://github.com/apache/spark/pull/5467#issuecomment-94642777
  
    yes, on it. by the way, it would be great if you could give me some advice on this PR, https://github.com/apache/spark/pull/5455 I'm not sure how to proceed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6065] [MLlib] Optimize word2vec.findSyn...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/5467#issuecomment-94581278
  
    @MechCoder  That looks correct, except you'll need to call this() immediately.  I'd write helper methods for constructing wordIndex and wordVectors:
    ```
      private def this(model: Map[String, Array[Float]]) = this(buildWordIndex(model), buildWordVectors(model))
    ```
    
    You'll need to add unit tests and stuff too.  Would you actually mind if we made this another JIRA and PR?  I'm starting to worry about mission creep.  : )


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org