You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by Ishiihara <gi...@git.apache.org> on 2014/08/14 01:25:41 UTC

[GitHub] spark pull request: [SPARK-2907][MLlib] Word2Vec performance impro...

GitHub user Ishiihara opened a pull request:

    https://github.com/apache/spark/pull/1932

    [SPARK-2907][MLlib] Word2Vec performance improve

    @mengxr Please review the code. Adding weights in reduceByKey soon. 
    
    Only output model entry for words appeared in the partition before merging and use reduceByKey to combine model. In general, this implementation is 30s or so faster than implementation using big array.  

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/Ishiihara/spark Word2Vec-improve2

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/1932.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1932
    
----
commit aa2ab36c6a9fa22e759ccb99352394dd6d6317e0
Author: Liquan Pei <li...@gmail.com>
Date:   2014-08-13T12:45:17Z

    use reduceByKey to combine models

commit 9075e1cba5ae64add2986514be99dc51083ff177
Author: Liquan Pei <li...@gmail.com>
Date:   2014-08-13T23:17:58Z

    combine syn0Global and syn1Global to synGlobal

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLlib] Word2Vec performance improvement

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1932#issuecomment-52452252
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18732/consoleFull) for   PR 1932 at commit [`d5377a9`](https://github.com/apache/spark/commit/d5377a9ea607d015fce4a2ac7eebdb467db5f46f).
     * This patch **passes** unit tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3097][MLlib] Word2Vec performance impro...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/1932


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLlib] Word2Vec performance improvement

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1932#issuecomment-52220608
  
    QA results for PR 1932:<br>- This patch FAILED unit tests.<br>- This patch merges cleanly<br>- This patch adds no public classes<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18547/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLlib] Word2Vec performance improvement

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1932#discussion_r16222200
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -284,16 +284,15 @@ class Word2Vec extends Serializable with Logging {
         
         val newSentences = sentences.repartition(numPartitions).cache()
         val initRandom = new XORShiftRandom(seed)
    -    var syn0Global =
    -      Array.fill[Float](vocabSize * vectorSize)((initRandom.nextFloat() - 0.5f) / vectorSize)
    -    var syn1Global = new Array[Float](vocabSize * vectorSize)
    -
    +    var synGlobal =
    --- End diff --
    
    There is no slicing across both syn0 and syn1, right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLlib] Word2Vec performance improvement

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1932#discussion_r16221788
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -284,16 +284,15 @@ class Word2Vec extends Serializable with Logging {
         
         val newSentences = sentences.repartition(numPartitions).cache()
         val initRandom = new XORShiftRandom(seed)
    -    var syn0Global =
    -      Array.fill[Float](vocabSize * vectorSize)((initRandom.nextFloat() - 0.5f) / vectorSize)
    -    var syn1Global = new Array[Float](vocabSize * vectorSize)
    -
    +    var synGlobal =
    --- End diff --
    
    Do we want to keep `syn0` and `syn1` in order to have an easy mapping from/to the original C implementation? It reduces the code maintenance cost.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLlib] Word2Vec performance improvement

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1932#issuecomment-52408585
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18683/consoleFull) for   PR 1932 at commit [`cad2011`](https://github.com/apache/spark/commit/cad201140970df237fc5492691c774e4d2d83763).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLlib] Word2Vec performance improvement

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/1932#issuecomment-52449340
  
    Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLlib] Word2Vec performance improvement

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1932#discussion_r16221793
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -321,42 +320,46 @@ class Word2Vec extends Serializable with Logging {
                         // Hierarchical softmax
                         var d = 0
                         while (d < bcVocab.value(word).codeLen) {
    -                      val l2 = bcVocab.value(word).point(d) * vectorSize
    +                      val ind = bcVocab.value(word).point(d)
    +                      val l2 = ind * vectorSize
                           // Propagate hidden -> output
    -                      var f = blas.sdot(vectorSize, syn0, l1, 1, syn1, l2, 1)
    +                      synModify(ind) += 1
    +                      var f = blas.sdot(vectorSize, syn, l1, 1, syn, l2, 1)
                           if (f > -MAX_EXP && f < MAX_EXP) {
                             val ind = ((f + MAX_EXP) * (EXP_TABLE_SIZE / MAX_EXP / 2.0)).toInt
                             f = expTable.value(ind)
                             val g = ((1 - bcVocab.value(word).code(d) - f) * alpha).toFloat
    -                        blas.saxpy(vectorSize, g, syn1, l2, 1, neu1e, 0, 1)
    -                        blas.saxpy(vectorSize, g, syn0, l1, 1, syn1, l2, 1)
    +                        blas.saxpy(vectorSize, g, syn, l2, 1, neu1e, 0, 1)
    +                        blas.saxpy(vectorSize, g, syn, l1, 1, syn, l2, 1)
                           }
                           d += 1
                         }
    -                    blas.saxpy(vectorSize, 1.0f, neu1e, 0, 1, syn0, l1, 1)
    +                    blas.saxpy(vectorSize, 1.0f, neu1e, 0, 1, syn, l1, 1)
    +                    synModify(lastWord) += 1
                       }
                     }
                     a += 1
                   }
                   pos += 1
                 }
    -            (syn0, syn1, lwc, wc)
    +            (syn, lwc, wc)
             }
    -        Iterator(model)
    -      }
    -      val (aggSyn0, aggSyn1, _, _) =
    -        partial.treeReduce { case ((syn0_1, syn1_1, lwc_1, wc_1), (syn0_2, syn1_2, lwc_2, wc_2)) =>
    -          val n = syn0_1.length
    -          val weight1 = 1.0f * wc_1 / (wc_1 + wc_2)
    -          val weight2 = 1.0f * wc_2 / (wc_1 + wc_2)
    -          blas.sscal(n, weight1, syn0_1, 1)
    -          blas.sscal(n, weight1, syn1_1, 1)
    -          blas.saxpy(n, weight2, syn0_2, 1, syn0_1, 1)
    -          blas.saxpy(n, weight2, syn1_2, 1, syn1_1, 1)
    -          (syn0_1, syn1_1, lwc_1 + lwc_2, wc_1 + wc_2)
    +        val synLocal = model._1
    +        val synOut = new PrimitiveKeyOpenHashMap[Int, Array[Float]](vocabSize * 2)
    +        var index = 0
    +        while(index < 2 * vocabSize) {
    +          if (synModify(index) != 0) {
    +            synOut.update(index, synLocal.slice(index * vectorSize, (index + 1) * vectorSize))
    +          }
    +          index += 1
             }
    -      syn0Global = aggSyn0
    -      syn1Global = aggSyn1
    +        Iterator(synOut)
    +      }
    +      synGlobal = partial.flatMap(x => x).reduceByKey {
    +        case (v1,v2) => 
    +          blas.saxpy(vectorSize, 1.0f, v2, 1, v1, 1)   
    +          v1
    +      }.collect().sortBy(_._1).flatMap(x => x._2)
    --- End diff --
    
    updating `synGlobal` in-place is more memory-efficient. We don't need to allocate new storage and sort.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLlib] Word2Vec performance improvement

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1932#issuecomment-52409509
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18683/consoleFull) for   PR 1932 at commit [`cad2011`](https://github.com/apache/spark/commit/cad201140970df237fc5492691c774e4d2d83763).
     * This patch **passes** unit tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLlib] Word2Vec performance improvement

Posted by Ishiihara <gi...@git.apache.org>.

Github user Ishiihara commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1932#discussion_r16222127
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -284,16 +284,15 @@ class Word2Vec extends Serializable with Logging {
         
         val newSentences = sentences.repartition(numPartitions).cache()
         val initRandom = new XORShiftRandom(seed)
    -    var syn0Global =
    -      Array.fill[Float](vocabSize * vectorSize)((initRandom.nextFloat() - 0.5f) / vectorSize)
    -    var syn1Global = new Array[Float](vocabSize * vectorSize)
    -
    +    var synGlobal =
    --- End diff --
    
    We can keep syn0 and syn1, but it adds some unnecessary slicing operations on array. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLlib] Word2Vec performance improvement

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1932#discussion_r16223658
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -284,16 +284,15 @@ class Word2Vec extends Serializable with Logging {
         
         val newSentences = sentences.repartition(numPartitions).cache()
         val initRandom = new XORShiftRandom(seed)
    -    var syn0Global =
    -      Array.fill[Float](vocabSize * vectorSize)((initRandom.nextFloat() - 0.5f) / vectorSize)
    -    var syn1Global = new Array[Float](vocabSize * vectorSize)
    -
    +    var synGlobal =
    --- End diff --
    
    Making composite key in the output RDD so reduceByKey can distinguish whether the update is for syn0 or syn1?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLlib] Word2Vec performance improvement

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1932#discussion_r16221784
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -34,7 +34,7 @@ import org.apache.spark.mllib.rdd.RDDFunctions._
     import org.apache.spark.rdd._
     import org.apache.spark.util.Utils
     import org.apache.spark.util.random.XORShiftRandom
    -
    +import org.apache.spark.util.collection.PrimitiveKeyOpenHashMap
     /**
    --- End diff --
    
    add an empty line after imports


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLlib] Word2Vec performance improvement

Posted by Ishiihara <gi...@git.apache.org>.

Github user Ishiihara commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1932#discussion_r16222465
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -284,16 +284,15 @@ class Word2Vec extends Serializable with Logging {
         
         val newSentences = sentences.repartition(numPartitions).cache()
         val initRandom = new XORShiftRandom(seed)
    -    var syn0Global =
    -      Array.fill[Float](vocabSize * vectorSize)((initRandom.nextFloat() - 0.5f) / vectorSize)
    -    var syn1Global = new Array[Float](vocabSize * vectorSize)
    -
    +    var synGlobal =
    --- End diff --
    
    We need to perform reduceByKey on both syn0 and syn1 and we have different updated keys  for syn0 and syn1. To perform reduceByKey of syn0 and syn1 together, we need to have a unique key and one way to achieve this is to treat i + vocabSize as the key for syn1(i). Then after we collect, we need to slice to update syn0Global and syn1Global.  Any better idea?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLlib] Word2Vec performance improvement

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/1932#issuecomment-52144922
  
    Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3097][MLlib] Word2Vec performance impro...

Posted by loveconan1988 <gi...@git.apache.org>.

Github user loveconan1988 commented on the pull request:

    https://github.com/apache/spark/pull/1932#issuecomment-52456557
  
    ------------------ 原始邮件 ------------------
      发件人: "asfgit";<no...@github.com>;
     发送时间: 2014年8月18日(星期一) 下午2:31
     收件人: "apache/spark"<sp...@noreply.github.com>; 
     
     主题: Re: [spark] [SPARK-3097][MLlib] Word2Vec performance improvement(#1932)
    
     
    
     
    Closed #1932 via 3c8fa50.
     
    —
    Reply to this email directly or view it on GitHub.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLlib] Word2Vec performance improvement

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1932#discussion_r16221791
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -321,42 +320,46 @@ class Word2Vec extends Serializable with Logging {
                         // Hierarchical softmax
                         var d = 0
                         while (d < bcVocab.value(word).codeLen) {
    -                      val l2 = bcVocab.value(word).point(d) * vectorSize
    +                      val ind = bcVocab.value(word).point(d)
    +                      val l2 = ind * vectorSize
                           // Propagate hidden -> output
    -                      var f = blas.sdot(vectorSize, syn0, l1, 1, syn1, l2, 1)
    +                      synModify(ind) += 1
    +                      var f = blas.sdot(vectorSize, syn, l1, 1, syn, l2, 1)
                           if (f > -MAX_EXP && f < MAX_EXP) {
                             val ind = ((f + MAX_EXP) * (EXP_TABLE_SIZE / MAX_EXP / 2.0)).toInt
                             f = expTable.value(ind)
                             val g = ((1 - bcVocab.value(word).code(d) - f) * alpha).toFloat
    -                        blas.saxpy(vectorSize, g, syn1, l2, 1, neu1e, 0, 1)
    -                        blas.saxpy(vectorSize, g, syn0, l1, 1, syn1, l2, 1)
    +                        blas.saxpy(vectorSize, g, syn, l2, 1, neu1e, 0, 1)
    +                        blas.saxpy(vectorSize, g, syn, l1, 1, syn, l2, 1)
                           }
                           d += 1
                         }
    -                    blas.saxpy(vectorSize, 1.0f, neu1e, 0, 1, syn0, l1, 1)
    +                    blas.saxpy(vectorSize, 1.0f, neu1e, 0, 1, syn, l1, 1)
    +                    synModify(lastWord) += 1
                       }
                     }
                     a += 1
                   }
                   pos += 1
                 }
    -            (syn0, syn1, lwc, wc)
    +            (syn, lwc, wc)
             }
    -        Iterator(model)
    -      }
    -      val (aggSyn0, aggSyn1, _, _) =
    -        partial.treeReduce { case ((syn0_1, syn1_1, lwc_1, wc_1), (syn0_2, syn1_2, lwc_2, wc_2)) =>
    -          val n = syn0_1.length
    -          val weight1 = 1.0f * wc_1 / (wc_1 + wc_2)
    -          val weight2 = 1.0f * wc_2 / (wc_1 + wc_2)
    -          blas.sscal(n, weight1, syn0_1, 1)
    -          blas.sscal(n, weight1, syn1_1, 1)
    -          blas.saxpy(n, weight2, syn0_2, 1, syn0_1, 1)
    -          blas.saxpy(n, weight2, syn1_2, 1, syn1_1, 1)
    -          (syn0_1, syn1_1, lwc_1 + lwc_2, wc_1 + wc_2)
    +        val synLocal = model._1
    +        val synOut = new PrimitiveKeyOpenHashMap[Int, Array[Float]](vocabSize * 2)
    +        var index = 0
    +        while(index < 2 * vocabSize) {
    +          if (synModify(index) != 0) {
    +            synOut.update(index, synLocal.slice(index * vectorSize, (index + 1) * vectorSize))
    +          }
    +          index += 1
             }
    -      syn0Global = aggSyn0
    -      syn1Global = aggSyn1
    +        Iterator(synOut)
    +      }
    +      synGlobal = partial.flatMap(x => x).reduceByKey {
    +        case (v1,v2) => 
    --- End diff --
    
    move `case (v1, v2) =>` to previous line and add a space after `,`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLlib] Word2Vec performance improvement

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1932#issuecomment-52449489
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18732/consoleFull) for   PR 1932 at commit [`d5377a9`](https://github.com/apache/spark/commit/d5377a9ea607d015fce4a2ac7eebdb467db5f46f).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLlib] Word2Vec performance improvement

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1932#issuecomment-52214890
  
    QA tests have started for PR 1932. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18547/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [MLlib] Word2Vec performance improvement

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/1932#issuecomment-52408516
  
    Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3097][MLlib] Word2Vec performance impro...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/1932#issuecomment-52456438
  
    LGTM. Merged into master and branch-1.1. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org