You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by ygcao <gi...@git.apache.org> on 2015/12/05 01:39:07 UTC

[GitHub] spark pull request: add support of arbitrary length sentence by us...

GitHub user ygcao opened a pull request:

    https://github.com/apache/spark/pull/10152

    add support of arbitrary length sentence by using the nature representation of sentences in the input and other minor tunings

    add support of arbitrary length sentence by using the nature representation of sentences in the input.
    
    add new similarity functions and add normalization option for distances in synonym finding
    add new accessor for internal structure(the vocabulary and wordindex) for convenience
    
    jira link: https://issues.apache.org/jira/browse/SPARK-12153

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/ygcao/spark improvementForSentenceBoundary

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/10152.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #10152
    
----
commit dab8425114696790bb254350b6c528852cf978c0
Author: Yong Gang Cao <yg...@amazon.com>
Date:   2015-12-05T00:00:48Z

    add support of arbitrary length sentence by using the nature representation of sentences in the input.
    add new similarity functions and add normalization option for distances in synonym finding
    add new accessor for internal structure(the vocabulary and wordindex) for convenience

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10152#discussion_r50683565
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -289,17 +301,28 @@ class Word2Vec extends Serializable with Logging {
         val expTable = sc.broadcast(createExpTable())
         val bcVocab = sc.broadcast(vocab)
         val bcVocabHash = sc.broadcast(vocabHash)
    -
    -    val sentences: RDD[Array[Int]] = words.mapPartitions { iter =>
    +    // each partition is a collection of sentences, will be translated into arrays of Index integer
    +    val sentences: RDD[Array[Int]] = dataset.mapPartitions { sentenceIter =>
           new Iterator[Array[Int]] {
    -        def hasNext: Boolean = iter.hasNext
    +        var wordIter: Iterator[String] = null
    +
    +        def hasNext: Boolean = sentenceIter.hasNext || (wordIter != null && wordIter.hasNext)
     
             def next(): Array[Int] = {
               val sentence = ArrayBuilder.make[Int]
               var sentenceLength = 0
    -          while (iter.hasNext && sentenceLength < MAX_SENTENCE_LENGTH) {
    -            val word = bcVocabHash.value.get(iter.next())
    -            word match {
    +          // do translation of each word into its index in the vocabulary,
    --- End diff --
    
    I also looked at a roughly similar approach. I assumed that the initial formulation was done for efficiency reasons (while loop etc) - hence why I didn't suggest it or explore it much further - but we'd have to get the original author to weigh in on that.
    
    I agree your approach is simpler and more succinct.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10152#issuecomment-173621659
  
    **[Test build #2431 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2431/consoleFull)** for PR 10152 at commit [`e938208`](https://github.com/apache/spark/commit/e938208d9c85515f62b41635a8445b8ab31f55f2).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

Posted by ygcao <gi...@git.apache.org>.

Github user ygcao commented on the pull request:

    https://github.com/apache/spark/pull/10152#issuecomment-177855098
  
    FYI: did a small scale perf-test and also checked logic correctness. please check my comment for the review text for details of my experiment done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][MLlib]add support of arbitrary l...

Posted by ygcao <gi...@git.apache.org>.

Github user ygcao commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10152#discussion_r47822243
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -281,17 +295,28 @@ class Word2Vec extends Serializable with Logging {
         val expTable = sc.broadcast(createExpTable())
         val bcVocab = sc.broadcast(vocab)
         val bcVocabHash = sc.broadcast(vocabHash)
    -
    -    val sentences: RDD[Array[Int]] = words.mapPartitions { iter =>
    +    // each partition is a collection of sentences, will be translated into arrays of Index integer
    +    val sentences: RDD[Array[Int]] = dataset.mapPartitions { sentenceIter =>
           new Iterator[Array[Int]] {
    -        def hasNext: Boolean = iter.hasNext
    +        var wordIter: Iterator[String] = null
    +
    +        def hasNext: Boolean = sentenceIter.hasNext || (wordIter != null && wordIter.hasNext)
     
             def next(): Array[Int] = {
               val sentence = ArrayBuilder.make[Int]
               var sentenceLength = 0
    -          while (iter.hasNext && sentenceLength < MAX_SENTENCE_LENGTH) {
    -            val word = bcVocabHash.value.get(iter.next())
    -            word match {
    +          // do translation of each word into its index in the vocabulary,
    +          // do cutting only when the sentence is larger than maxSentenceLength
    +          if ((wordIter == null || !wordIter.hasNext) && sentenceIter.hasNext) {
    +            do {
    --- End diff --
    
    This is just for the edge case when the input contains empty sentences, the do while loop will skip empty sentences instead of generate an empty array as a result for the empty sentence.
    of course, if the sentence splitter will skip empty sentences, this is not necessary, but that's out of this class's control. So, I just recommend to have the loop, it won't introduce much latency since condition check is so cheap and the loop will only run once for 99.999% cases.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10152#discussion_r52530869
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -289,24 +301,20 @@ class Word2Vec extends Serializable with Logging {
         val expTable = sc.broadcast(createExpTable())
         val bcVocab = sc.broadcast(vocab)
         val bcVocabHash = sc.broadcast(vocabHash)
    -
    -    val sentences: RDD[Array[Int]] = words.mapPartitions { iter =>
    -      new Iterator[Array[Int]] {
    -        def hasNext: Boolean = iter.hasNext
    -
    -        def next(): Array[Int] = {
    -          val sentence = ArrayBuilder.make[Int]
    -          var sentenceLength = 0
    -          while (iter.hasNext && sentenceLength < MAX_SENTENCE_LENGTH) {
    -            val word = bcVocabHash.value.get(iter.next())
    -            word match {
    -              case Some(w) =>
    -                sentence += w
    -                sentenceLength += 1
    -              case None =>
    -            }
    +    // each partition is a collection of sentences,
    +    // will be translated into arrays of Index integer
    +    val sentences: RDD[Array[Int]] = dataset.mapPartitions { sentenceIter =>
    +      // Each sentence will map to 0 or more Array[Int]
    +      sentenceIter.flatMap { sentence => {
    +          // Sentence of words, some of which map to a word index
    +          val wordIndexes = sentence.flatMap(bcVocabHash.value.get)
    +          if (wordIndexes.nonEmpty) {
    --- End diff --
    
    This `if ... else` is not necessary. `Seq.empty.grouped` returns an empty iterator.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10152#issuecomment-172489168
  
    **[Test build #2394 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2394/consoleFull)** for PR 10152 at commit [`76e8266`](https://github.com/apache/spark/commit/76e82667bd04bc359a336c7538d2595aade8384e).
     * This patch **fails PySpark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

Posted by ygcao <gi...@git.apache.org>.

Github user ygcao commented on the pull request:

    https://github.com/apache/spark/pull/10152#issuecomment-172468729
  
    ouch, we finally decided to make backward incompatible changes for synonyms~~. That caused test case failure.
    I adjusted the expected value according our new logic: always return normalized similarity value for findSyonyms function
    Also added defense code for potential divide by zero accordingly referring to @srowen 's hot fix of another similar situation.
    I have 90% confidence that these change is enough for passing tests.
    The reason why I can't be 100%, is that spark sql can't compile and run_tests script is not working in my local env. I hacked out the expected value, but I verified it against before and after change.
    FYI: the compiling issue on master branch about Spark SQL, not relevant to our changes, but is a blocker somebody should look into.
    [error] spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/CatalystQl.scala:311: reference to Rollup is ambiguous;
    [error] it is imported twice in the same scope by
    [error] import org.apache.spark.sql.catalyst.plans.logical._
    [error] and import org.apache.spark.sql.catalyst.expressions._
    [error]                     Seq(Rollup(children.map(nodeToExpr))),
    [error]                         ^
    [error] spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/CatalystQl.scala:319: reference to Cube is ambiguous;
    [error] it is imported twice in the same scope by
    [error] import org.apache.spark.sql.catalyst.plans.logical._
    [error] and import org.apache.spark.sql.catalyst.expressions._
    [error]                     Seq(Cube(children.map(nodeToExpr))),
    [error]                         ^
    [error] spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:191: reference to Rollup is ambiguous;
    [error] it is imported twice in the same scope by
    [error] import org.apache.spark.sql.catalyst.plans.logical._
    [error] and import org.apache.spark.sql.catalyst.expressions._
    [error]     def bitmasks(r: Rollup): Seq[Int] = {
    [error]                     ^
    [error] spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:204: reference to Cube is ambiguous;
    [error] it is imported twice in the same scope by
    [error] import org.apache.spark.sql.catalyst.plans.logical._
    [error] and import org.apache.spark.sql.catalyst.expressions._
    [error]     def bitmasks(c: Cube): Seq[Int] = {
    [error]                     ^
    [error] spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:210: reference to Cube is ambiguous;
    [error] it is imported twice in the same scope by
    [error] import org.apache.spark.sql.catalyst.plans.logical._
    [error] and import org.apache.spark.sql.catalyst.expressions._
    [error]       case Aggregate(Seq(c @ Cube(groupByExprs)), aggregateExpressions, child) =>
    [error]    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on the pull request:

    https://github.com/apache/spark/pull/10152#issuecomment-166215753
  
    Pinging @jkbradley @mengxr @MechCoder again for a final review - could you give this a look and confirm you're in agreement with my comments above. 
    
    Also thoughts on whether this should target `1.6.1` - as it is actually a fairly major yet subtle bug in the implementation. Or even be backported to `1.5.3`? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

Posted by ygcao <gi...@git.apache.org>.

Github user ygcao commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10152#discussion_r52573698
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -289,24 +301,20 @@ class Word2Vec extends Serializable with Logging {
         val expTable = sc.broadcast(createExpTable())
         val bcVocab = sc.broadcast(vocab)
         val bcVocabHash = sc.broadcast(vocabHash)
    -
    -    val sentences: RDD[Array[Int]] = words.mapPartitions { iter =>
    -      new Iterator[Array[Int]] {
    -        def hasNext: Boolean = iter.hasNext
    -
    -        def next(): Array[Int] = {
    -          val sentence = ArrayBuilder.make[Int]
    -          var sentenceLength = 0
    -          while (iter.hasNext && sentenceLength < MAX_SENTENCE_LENGTH) {
    -            val word = bcVocabHash.value.get(iter.next())
    -            word match {
    -              case Some(w) =>
    -                sentence += w
    -                sentenceLength += 1
    -              case None =>
    -            }
    +    // each partition is a collection of sentences,
    +    // will be translated into arrays of Index integer
    +    val sentences: RDD[Array[Int]] = dataset.mapPartitions { sentenceIter =>
    +      // Each sentence will map to 0 or more Array[Int]
    +      sentenceIter.flatMap { sentence => {
    +          // Sentence of words, some of which map to a word index
    +          val wordIndexes = sentence.flatMap(bcVocabHash.value.get)
    +          if (wordIndexes.nonEmpty) {
    --- End diff --
    
    will empty iterator makes flatMap skip it just like skipping None?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10152#discussion_r51088952
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -289,17 +301,28 @@ class Word2Vec extends Serializable with Logging {
         val expTable = sc.broadcast(createExpTable())
         val bcVocab = sc.broadcast(vocab)
         val bcVocabHash = sc.broadcast(vocabHash)
    -
    -    val sentences: RDD[Array[Int]] = words.mapPartitions { iter =>
    +    // each partition is a collection of sentences, will be translated into arrays of Index integer
    +    val sentences: RDD[Array[Int]] = dataset.mapPartitions { sentenceIter =>
           new Iterator[Array[Int]] {
    -        def hasNext: Boolean = iter.hasNext
    +        var wordIter: Iterator[String] = null
    +
    +        def hasNext: Boolean = sentenceIter.hasNext || (wordIter != null && wordIter.hasNext)
     
             def next(): Array[Int] = {
               val sentence = ArrayBuilder.make[Int]
               var sentenceLength = 0
    -          while (iter.hasNext && sentenceLength < MAX_SENTENCE_LENGTH) {
    -            val word = bcVocabHash.value.get(iter.next())
    -            word match {
    +          // do translation of each word into its index in the vocabulary,
    --- End diff --
    
    Agreed, that is more succinct. @ygcao can you look into making that change? Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

Posted by ygcao <gi...@git.apache.org>.

Github user ygcao commented on the pull request:

    https://github.com/apache/spark/pull/10152#issuecomment-181240271
  
    added braces to make lint happy. Jenkins should happy now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on the pull request:

    https://github.com/apache/spark/pull/10152#issuecomment-187119717
  
    @srowen thanks for merging it.
    
    @ygcao thanks for the PR!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/10152#issuecomment-187098446
  
    Merged to master


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][MLlib]add support of arbitrary l...

Posted by ygcao <gi...@git.apache.org>.

Github user ygcao commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10152#discussion_r46763811
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -281,16 +280,17 @@ class Word2Vec extends Serializable with Logging {
         val expTable = sc.broadcast(createExpTable())
         val bcVocab = sc.broadcast(vocab)
         val bcVocabHash = sc.broadcast(vocabHash)
    -
    -    val sentences: RDD[Array[Int]] = words.mapPartitions { iter =>
    +    //each partition is a collection of sentences, will be translated into arrays of Index integer
    +    val sentences: RDD[Array[Int]] = dataset.mapPartitions { iter =>
    --- End diff --
    
    This is the essential change, even make the max configurable is not reasonable.
    as I mentioned in the Jira issue:
    sentence boundary matters for sliding window, we shouldn't train model from a window across sentences. the current 100 word as a hard split for sentences doesn't really make sense. 
    any hard setting is not a good choice.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][MLlib]add support of arbitrary l...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10152#discussion_r47428877
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -77,6 +77,19 @@ class Word2Vec extends Serializable with Logging {
       private var numIterations = 1
       private var seed = Utils.random.nextLong()
       private var minCount = 5
    +  private var maxSentenceLength = 1000
    +
    +  /**
    +   * set the maxSentenceLength for cutting purpose
    --- End diff --
    
    Please match the style of this comment to the other parameter setters (e.g. https://github.com/apache/spark/pull/10152/files#diff-88f4b62c382b26ef8e856b23f5167ccdL129), including the default parameter setting.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][MLlib]add support of arbitrary l...

Posted by ygcao <gi...@git.apache.org>.

Github user ygcao commented on the pull request:

    https://github.com/apache/spark/pull/10152#issuecomment-163512458
  
    I have to say word2vec or skip gram can be absolutely affected by training data just like any other  ML algorithm. I also can tell you I observed big differences when I apply different massage for the data at the scale of millions to billions sentences.
    Researcher often tries to simplify engineering details, one obvious example is that the phrase2vec is highly simplified to show the algorithm's effectiveness instead of relying on complex entity recognition engine,that doesn't mean we should not do more advanced phrase construction for the training.
    It's quite intuitive about the benefit of taking use of sentence boundaries when you thinking in term of expected output of skip gram and how back propagation works. The beginning words of the next sentence is a garbage input as context words for the sentence tail word at the training stage. From another side, in the original version,The words around the document fixed-size cutting point are losing semantically meaningful context words. Those words at sentence ending or around cutting points are minority, so you may not notice huge impact for the hard cut version, but that's not a reason for us don't improve it further. Again, model building is absolutely sensitive to input data, just we human don't sensitive to minority caused issues without deep dive, but a good thing is that we still have theory to think about.what harm can be brought with respecting sentence boundary?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][MLlib]add support of arbitrary l...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on the pull request:

    https://github.com/apache/spark/pull/10152#issuecomment-162800260
  
    @jkbradley as far as I can see:
    
    The input to `fit` is `dataset: RDD[Iterable[String]]`. However, in [L273](https://github.com/apache/spark/pull/10152/files#diff-88f4b62c382b26ef8e856b23f5167ccdL273), we create `val words = dataset.flatMap(x => x)`, flattening out the RDD of sentences to an RDD of words, i.e. `RDD[Iterable[String]] -> RDD[String]`.
    
    Then in [L285](https://github.com/apache/spark/pull/10152/files#diff-88f4b62c382b26ef8e856b23f5167ccdL285), `words` is used in `mapPartitions` (as opposed to `dataset`). In the `mapPartitions` block, the behaviour of `next` is to advance through the iterator (which is now the flatMapped stream of words, *not* a stream of sentences), and have a hard cut off to create each sentence `Array[Int]` after `MAX_SENTENCE_LENGTH` is reached.
    
    So the way I read the current code, it indeed does just treat the input as a stream of words, discards sentence boundaries, and uses a hard 1000 word limit on "sentences". Let me know if I missed something here.
    
    This is in fact matching what the Google impl does (from my quick look through [the C code](http://word2vec.googlecode.com/svn/trunk/word2vec.c), e.g. L373-405 and L70 in `ReadWord`).
    
    So purely technically the current code is "correct" as it matches the original, but I'm not sure if it was intentional or not to use `words` in the `mapPartitions` block. But to me it's an open question whether this approach, or keeping sentence structure / boundaries, is better. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][MLlib]add support of arbitrary l...

Posted by ygcao <gi...@git.apache.org>.

Github user ygcao commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10152#discussion_r46923499
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -534,8 +588,13 @@ class Word2VecModel private[spark] (
         // Need not divide with the norm of the given vector since it is constant.
         val cosVec = cosineVec.map(_.toDouble)
         var ind = 0
    +    var vecNorm = 1f
    +    if (norm)
    +      vecNorm = blas.snrm2(vectorSize, fVector, 1)
    --- End diff --
    
    Glad to see another pull request demanded the same normalization. I think my change is more backward compatible and leave user choice of speed or normalized metrics for further operations.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

Posted by ygcao <gi...@git.apache.org>.

Github user ygcao commented on the pull request:

    https://github.com/apache/spark/pull/10152#issuecomment-165959134
  
    Thanks @MLnick! Added the @Since annotation in the code accordingly and updated issue to make type as bug and consistent with what we finalized for changing.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10152#discussion_r48229609
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -77,6 +77,18 @@ class Word2Vec extends Serializable with Logging {
       private var numIterations = 1
       private var seed = Utils.random.nextLong()
       private var minCount = 5
    +  private var maxSentenceLength = 1000
    +
    +  /**
    +   * Sets the maximum length of each sentence in the input data.
    +   * Any sentence longer than this threshold will be divided into chunks of
    +   * up to `maxSentenceLength` size (default: 1000)
    +   */
    +  @Since("1.6.1")
    --- End diff --
    
    @ygcao actually could we target `2.0.0` instead of `1.6.1` in this PR, since it's against master? We'll consider backporting after it is merged.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on the pull request:

    https://github.com/apache/spark/pull/10152#issuecomment-179858977
  
    @ygcao sorry for the delay. I'm trying to run a few `spark-perf` tests and try larger scale if possible. Will revert ASAP.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/10152#issuecomment-182317367
  
    Agree, I'm ready to merge this. I'll CC @mengxr or @jkbradley in case they want a final comment today


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10152#discussion_r52530851
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -76,6 +76,18 @@ class Word2Vec extends Serializable with Logging {
       private var numIterations = 1
       private var seed = Utils.random.nextLong()
       private var minCount = 5
    +  private var maxSentenceLength = 1000
    +
    +  /**
    +   * Sets the maximum length of each sentence in the input data.
    +   * Any sentence longer than this threshold will be divided into chunks of
    +   * up to `maxSentenceLength` size (default: 1000)
    +   */
    +  @Since("2.0.0")
    +  def setMaxSentenceLength(maxSentenceLength: Int): this.type = {
    --- End diff --
    
    It is not clear from the doc what "sentence length" means, number of words or number of characters. We can either update the doc or change the param name to `maxWordsPerSentence` to make this clear from the name.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10152#issuecomment-180820176
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10152#issuecomment-181271212
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10152#discussion_r52142152
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -551,12 +553,17 @@ class Word2VecModel private[spark] (
           }
           ind += 1
         }
    -    wordList.zip(cosVec)
    +    var topResults = wordList.zip(cosVec)
           .toSeq
    -      .sortBy(- _._2)
    +      .sortBy(-_._2)
           .take(num + 1)
           .tail
    -      .toArray
    +    if (vecNorm != 0.0f) {
    +      topResults = topResults.map {
    +        case (word: String, cosVec: Double) => (word, cosVec / vecNorm)
    --- End diff --
    
    I also generally write that way; you don't need types here on the case match statement


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][MLlib]add support of arbitrary l...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10152#discussion_r46806287
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -555,6 +614,14 @@ class Word2VecModel private[spark] (
           (word, wordVectors.slice(vectorSize * ind, vectorSize * ind + vectorSize))
         }
       }
    +
    +  /**
    +   * get word vector array. length is vocabularySize*vectorSize
    +   * @return the array of word vectors, need to split it by vectorSize to get each individual vector
    +   */
    +  def getWordVectors: Array[Float] = {
    --- End diff --
    
    Agree, don't see the value in exposing this when `getVectors` is there already.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][MLlib]add support of arbitrary l...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on the pull request:

    https://github.com/apache/spark/pull/10152#issuecomment-162503043
  
    @ygcao can we remove the code changes related to the `cosineSimilarity` and `findSynonyms`? This will make it cleaner so we can focus on the core issue (see my next comment).
    
    Also, I think that things like distance metrics should live in a centralized place, and so changes to these are outside the scope of this particular model and the PR. In fact, the original `cosineSimilarity` method (https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala#L472) is no longer used, as it's been replaced by a more efficient version, so we should consider deprecating it (or indeed just removing it as it's private and always has been).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10152#issuecomment-181270729
  
    **[Test build #50915 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50915/consoleFull)** for PR 10152 at commit [`84a0bc4`](https://github.com/apache/spark/commit/84a0bc4c73ca0915f6e6291442c8e60f6d80d879).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on the pull request:

    https://github.com/apache/spark/pull/10152#issuecomment-174452391
  
    As Sean says this hasn't been merged yet, I was waiting for the latest test build to pass before making a final pass over this. 
    
    Ideally I'd like to just get one of @srowen @jkbradley @mengxr to give this a pass through too before merging.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10152#discussion_r52843218
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -289,24 +301,20 @@ class Word2Vec extends Serializable with Logging {
         val expTable = sc.broadcast(createExpTable())
         val bcVocab = sc.broadcast(vocab)
         val bcVocabHash = sc.broadcast(vocabHash)
    -
    -    val sentences: RDD[Array[Int]] = words.mapPartitions { iter =>
    -      new Iterator[Array[Int]] {
    -        def hasNext: Boolean = iter.hasNext
    -
    -        def next(): Array[Int] = {
    -          val sentence = ArrayBuilder.make[Int]
    -          var sentenceLength = 0
    -          while (iter.hasNext && sentenceLength < MAX_SENTENCE_LENGTH) {
    -            val word = bcVocabHash.value.get(iter.next())
    -            word match {
    -              case Some(w) =>
    -                sentence += w
    -                sentenceLength += 1
    -              case None =>
    -            }
    +    // each partition is a collection of sentences,
    +    // will be translated into arrays of Index integer
    +    val sentences: RDD[Array[Int]] = dataset.mapPartitions { sentenceIter =>
    +      // Each sentence will map to 0 or more Array[Int]
    +      sentenceIter.flatMap { sentence => {
    +          // Sentence of words, some of which map to a word index
    +          val wordIndexes = sentence.flatMap(bcVocabHash.value.get)
    +          if (wordIndexes.nonEmpty) {
    --- End diff --
    
    @ygcao I think you're mixing up an empty seq/iterator, with a seq/iterator over one "empty" element (like an empty string). We are talking about the former but your example above is the latter.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10152#discussion_r51088866
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -556,6 +571,7 @@ class Word2VecModel private[spark] (
           .sortBy(- _._2)
           .take(num + 1)
           .tail
    +      .map(v => (if (vecNorm == 0) v else (v._1, v._2 / vecNorm)))
    --- End diff --
    
    @ygcao please change this line to `.map { case (word, cosVec) => (word, if (vecNorm == 0.0) 0.0 else (cosVec / vecNorm)) }` as per discussion above.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10152#issuecomment-180820174
  
    **[Test build #50873 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50873/consoleFull)** for PR 10152 at commit [`443ec06`](https://github.com/apache/spark/commit/443ec06d8e6381a9df8f69b89fcd4955095e6b6c).
     * This patch **fails Scala style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10152#issuecomment-172778827
  
    **[Test build #2407 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2407/consoleFull)** for PR 10152 at commit [`141d7a2`](https://github.com/apache/spark/commit/141d7a2bff8c8e462a9a2aff33b1f16bebc30b25).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10152#discussion_r51361374
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -289,17 +301,28 @@ class Word2Vec extends Serializable with Logging {
         val expTable = sc.broadcast(createExpTable())
         val bcVocab = sc.broadcast(vocab)
         val bcVocabHash = sc.broadcast(vocabHash)
    -
    -    val sentences: RDD[Array[Int]] = words.mapPartitions { iter =>
    +    // each partition is a collection of sentences, will be translated into arrays of Index integer
    +    val sentences: RDD[Array[Int]] = dataset.mapPartitions { sentenceIter =>
           new Iterator[Array[Int]] {
    -        def hasNext: Boolean = iter.hasNext
    +        var wordIter: Iterator[String] = null
    +
    +        def hasNext: Boolean = sentenceIter.hasNext || (wordIter != null && wordIter.hasNext)
     
             def next(): Array[Int] = {
               val sentence = ArrayBuilder.make[Int]
               var sentenceLength = 0
    -          while (iter.hasNext && sentenceLength < MAX_SENTENCE_LENGTH) {
    -            val word = bcVocabHash.value.get(iter.next())
    -            word match {
    +          // do translation of each word into its index in the vocabulary,
    --- End diff --
    
    Oops I keep imagining there is a flatMapPartitions. mapPartitions and flatMap.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10152#discussion_r50671686
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -556,6 +571,7 @@ class Word2VecModel private[spark] (
           .sortBy(- _._2)
           .take(num + 1)
           .tail
    +      .map(v => (if (vecNorm == 0) v else (v._1, v._2 / vecNorm)))
    --- End diff --
    
    @srowen if either of the vector norms are 0, the cosine similarity should be 0. Though by definition I think here `v._2` will be 0 if the norm is 0, what do you think about making this `if vecNorm == 0) 0 ...` to be explicit?
    
    (follows what you did in https://github.com/apache/spark/commit/94b39f7777ecff3794727c186bd681fa4c6af4fd#diff-88f4b62c382b26ef8e856b23f5167ccdR542)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/10152#issuecomment-174446659
  
    @ygcao this has not been merged. You can see the PR is open and there is no message about merging into master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

Posted by ygcao <gi...@git.apache.org>.

Github user ygcao commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10152#discussion_r48324725
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -77,6 +77,18 @@ class Word2Vec extends Serializable with Logging {
       private var numIterations = 1
       private var seed = Utils.random.nextLong()
       private var minCount = 5
    +  private var maxSentenceLength = 1000
    +
    +  /**
    +   * Sets the maximum length of each sentence in the input data.
    +   * Any sentence longer than this threshold will be divided into chunks of
    +   * up to `maxSentenceLength` size (default: 1000)
    +   */
    +  @Since("1.6.1")
    --- End diff --
    
    Done!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][MLlib]add support of arbitrary l...

Posted by ygcao <gi...@git.apache.org>.

Github user ygcao commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10152#discussion_r47435369
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -264,7 +276,8 @@ class Word2Vec extends Serializable with Logging {
     
       /**
        * Computes the vector representation of each word in vocabulary.
    -   * @param dataset an RDD of words
    +   * @param dataset a RDD of sentences,
    +   *                each sentence is expressed as an iterable collection of words
    --- End diff --
    
    Good point. Will do


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][MLlib]add support of arbitrary l...

Posted by ygcao <gi...@git.apache.org>.

Github user ygcao commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10152#discussion_r47742736
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -469,13 +495,13 @@ class Word2VecModel private[spark] (
         this(Word2VecModel.buildWordIndex(model), Word2VecModel.buildWordVectors(model))
       }
     
    -  private def cosineSimilarity(v1: Array[Float], v2: Array[Float]): Double = {
    -    require(v1.length == v2.length, "Vectors should have the same length")
    -    val n = v1.length
    -    val norm1 = blas.snrm2(n, v1, 1)
    -    val norm2 = blas.snrm2(n, v2, 1)
    -    if (norm1 == 0 || norm2 == 0) return 0.0
    -    blas.sdot(n, v1, 1, v2, 1) / norm1 / norm2
    +  /**
    +   * get the built vocabulary from the input
    +   * this is useful for getting the whole vocabulary to join with other data or filtering other data
    +   * @return a map of word to its index
    +   */
    +  def getVocabulary: Map[String, Int] = {
    --- End diff --
    
    never mind, by looking carefully, I found I was confused by scala syntax sugar in the before.
    When I use getVectors("word"), it will return a vector for me after a couple of seconds, actually, it was doing two things implicitly, outputting the entire vocabulary first and then lookup the map. The performance issue I found was also a illusion then, since it is actually doing a heavy job.
    removed those unnecessary getters designed for working around a fake problem~~


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10152#discussion_r52576421
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -272,15 +285,14 @@ class Word2Vec extends Serializable with Logging {
     
       /**
        * Computes the vector representation of each word in vocabulary.
    -   * @param dataset an RDD of words
    +   * @param dataset a RDD of sentences,
    --- End diff --
    
    we should use "an" here.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/10152#issuecomment-180817180
  
    Jenkins, retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][MLlib]add support of arbitrary l...

Posted by ygcao <gi...@git.apache.org>.

Github user ygcao commented on the pull request:

    https://github.com/apache/spark/pull/10152#issuecomment-163517944
  
    To see is to believe, comparison is the key. You are encouraged to use my version(using a simple sentence splitter by dot and question mark. Btw:if your data is not text, I want to say Any sequence data has its natural boundary just like sentence.e.g user session's natural boundary is time span of continuous operations), and the old version to build models from the same set of text/data set and then compare them to see differences.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10152#discussion_r52722338
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -289,24 +301,20 @@ class Word2Vec extends Serializable with Logging {
         val expTable = sc.broadcast(createExpTable())
         val bcVocab = sc.broadcast(vocab)
         val bcVocabHash = sc.broadcast(vocabHash)
    -
    -    val sentences: RDD[Array[Int]] = words.mapPartitions { iter =>
    -      new Iterator[Array[Int]] {
    -        def hasNext: Boolean = iter.hasNext
    -
    -        def next(): Array[Int] = {
    -          val sentence = ArrayBuilder.make[Int]
    -          var sentenceLength = 0
    -          while (iter.hasNext && sentenceLength < MAX_SENTENCE_LENGTH) {
    -            val word = bcVocabHash.value.get(iter.next())
    -            word match {
    -              case Some(w) =>
    -                sentence += w
    -                sentenceLength += 1
    -              case None =>
    -            }
    +    // each partition is a collection of sentences,
    +    // will be translated into arrays of Index integer
    +    val sentences: RDD[Array[Int]] = dataset.mapPartitions { sentenceIter =>
    +      // Each sentence will map to 0 or more Array[Int]
    +      sentenceIter.flatMap { sentence => {
    +          // Sentence of words, some of which map to a word index
    +          val wordIndexes = sentence.flatMap(bcVocabHash.value.get)
    +          if (wordIndexes.nonEmpty) {
    --- End diff --
    
    That is because `"".split(" ") = Array("")`, which has nothing to do with `grouped`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][MLlib]add support of arbitrary l...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10152#discussion_r46912131
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -281,16 +280,17 @@ class Word2Vec extends Serializable with Logging {
         val expTable = sc.broadcast(createExpTable())
         val bcVocab = sc.broadcast(vocab)
         val bcVocabHash = sc.broadcast(vocabHash)
    -
    -    val sentences: RDD[Array[Int]] = words.mapPartitions { iter =>
    +    //each partition is a collection of sentences, will be translated into arrays of Index integer
    +    val sentences: RDD[Array[Int]] = dataset.mapPartitions { iter =>
    --- End diff --
    
    As far as I can tell, there isn't a need for us to limit sentence length in our implementation; I don't see anything which will blow up without the limit.  It looks like gensim uses a limit in order to collect a buffer/batch of sentences.  Also, it looks like gensim doesn't throw out the rest of the sentence, unlike the MLlib implementation: [https://github.com/piskvorky/gensim/blob/839513f81e3aa42f490331fa80a28d13b7b7026f/gensim/models/word2vec.py#L1548]
    
    This is arguably a bug in MLlib, so I'd prefer to remove the limit by default, but provide a Param for setting it to a particular value to emulate previous behavior.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10152#issuecomment-185685961
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/51483/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][MLlib]add support of arbitrary l...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10152#discussion_r47429085
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -534,8 +577,15 @@ class Word2VecModel private[spark] (
         // Need not divide with the norm of the given vector since it is constant.
         val cosVec = cosineVec.map(_.toDouble)
         var ind = 0
    +    var vecNorm = 1f
    +    if (norm) {
    --- End diff --
    
    Cosine similarity is by definition normalized. I view the current implementation that doesn't normalize it correctly as a bug. Also, there is no real performance gain for not normalizing, since only `fVector` is currently not normalized so that is one additional operation. What is the use case for "non-normalized almost cosine similarity" similarity vs using standard cosine similarity?
    
    @jkbradley @srowen what are your thoughts on this? There is an outstanding JIRA [SPARK-7617](https://issues.apache.org/jira/browse/SPARK-7617) and PR #6245, addressing this issue.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/10152#issuecomment-181253115
  
    Jenkins, retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][MLlib]add support of arbitrary l...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on the pull request:

    https://github.com/apache/spark/pull/10152#issuecomment-163587959
  
    Ok, so after some more digging into this and the original Google code and mailing list, actually I was incorrect in my original read of the Google impl - sorry for the confusion. It does in fact treat newlines (`\n`) as a sentence boundary, and additionally imposes the limit of 1000 for `MAX_SENTENCE_LENGTH`.
    
    This happens in L81-84, where `\n` is allocated a special sentence delimiter of `</s>`, which is later kept at position `0` in the vocab (see comment L148), and in turn results in breaking out of the sentence construction loop in L393.
    
    See also these two Google group posts which make it more clear that sentences are newline-delimited - https://groups.google.com/forum/#!searchin/word2vec-toolkit/line$20braks/word2vec-toolkit/2elkT3cOMqo/DL_CsF1p8H8J and https://groups.google.com/forum/#!topic/word2vec-toolkit/3LAooMdrCl0
    
    Given this, in fact our current implementation is not correct and using `words` instead of `dataset` in the `mapPartitions` block is a bug.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][MLlib]add support of arbitrary l...

Posted by ygcao <gi...@git.apache.org>.

Github user ygcao commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10152#discussion_r47429267
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -469,13 +495,13 @@ class Word2VecModel private[spark] (
         this(Word2VecModel.buildWordIndex(model), Word2VecModel.buildWordVectors(model))
       }
     
    -  private def cosineSimilarity(v1: Array[Float], v2: Array[Float]): Double = {
    -    require(v1.length == v2.length, "Vectors should have the same length")
    -    val n = v1.length
    -    val norm1 = blas.snrm2(n, v1, 1)
    -    val norm2 = blas.snrm2(n, v2, 1)
    -    if (norm1 == 0 || norm2 == 0) return 0.0
    -    blas.sdot(n, v1, 1, v2, 1) / norm1 / norm2
    +  /**
    +   * get the built vocabulary from the input
    +   * this is useful for getting the whole vocabulary to join with other data or filtering other data
    +   * @return a map of word to its index
    +   */
    +  def getVocabulary: Map[String, Int] = {
    --- End diff --
    
    sure. Let's say, you want to use your built vectors for mapping WordNet into vectors, both are big vocabulary but not have one-to-one relationship, you'll have to join them to convert only trained words in wordnet into vectors. doing the getVectors for each word in WordNet could cost you days while it can be solved in minutes by using join.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10152#discussion_r50961556
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -289,17 +301,28 @@ class Word2Vec extends Serializable with Logging {
         val expTable = sc.broadcast(createExpTable())
         val bcVocab = sc.broadcast(vocab)
         val bcVocabHash = sc.broadcast(vocabHash)
    -
    -    val sentences: RDD[Array[Int]] = words.mapPartitions { iter =>
    +    // each partition is a collection of sentences, will be translated into arrays of Index integer
    +    val sentences: RDD[Array[Int]] = dataset.mapPartitions { sentenceIter =>
           new Iterator[Array[Int]] {
    -        def hasNext: Boolean = iter.hasNext
    +        var wordIter: Iterator[String] = null
    +
    +        def hasNext: Boolean = sentenceIter.hasNext || (wordIter != null && wordIter.hasNext)
     
             def next(): Array[Int] = {
               val sentence = ArrayBuilder.make[Int]
               var sentenceLength = 0
    -          while (iter.hasNext && sentenceLength < MAX_SENTENCE_LENGTH) {
    -            val word = bcVocabHash.value.get(iter.next())
    -            word match {
    +          // do translation of each word into its index in the vocabulary,
    --- End diff --
    
    Good point; I'm not sure if it was for efficiency or simply to copy the C++ implementation. Hm, I don't suppose we have benchmarks here. I'm OK with keeping the longer more efficient implementation. but now that input sentence boundaries matter, this can still be simplified. This part can still be used:
    
    ```
    // Each input sentence will produce 1 or more Array[Int], so flatMapPartitions
    dataset.flatMapPartitions { sentenceIter => 
      // Each sentence will map to 1 or more Array[Int], so map
      sentenceIter.map { sentence =>
        ...
      }
    }
    ```
    
    That is you can still `flatMapPartitions` over `Iterator`s for each sentence to join them without trying to manage both the `Iterator` over sentences and words.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][MLlib]add support of arbitrary l...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10152#discussion_r47767717
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -281,17 +295,28 @@ class Word2Vec extends Serializable with Logging {
         val expTable = sc.broadcast(createExpTable())
         val bcVocab = sc.broadcast(vocab)
         val bcVocabHash = sc.broadcast(vocabHash)
    -
    -    val sentences: RDD[Array[Int]] = words.mapPartitions { iter =>
    +    // each partition is a collection of sentences, will be translated into arrays of Index integer
    +    val sentences: RDD[Array[Int]] = dataset.mapPartitions { sentenceIter =>
           new Iterator[Array[Int]] {
    -        def hasNext: Boolean = iter.hasNext
    +        var wordIter: Iterator[String] = null
    +
    +        def hasNext: Boolean = sentenceIter.hasNext || (wordIter != null && wordIter.hasNext)
     
             def next(): Array[Int] = {
               val sentence = ArrayBuilder.make[Int]
               var sentenceLength = 0
    -          while (iter.hasNext && sentenceLength < MAX_SENTENCE_LENGTH) {
    -            val word = bcVocabHash.value.get(iter.next())
    -            word match {
    +          // do translation of each word into its index in the vocabulary,
    +          // do cutting only when the sentence is larger than maxSentenceLength
    +          if ((wordIter == null || !wordIter.hasNext) && sentenceIter.hasNext) {
    +            do {
    --- End diff --
    
    is the `do ... while` block strictly necessary? Have you not already checked the preconditions for the while block in the if statement?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10152#discussion_r52141876
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -551,12 +553,17 @@ class Word2VecModel private[spark] (
           }
           ind += 1
         }
    -    wordList.zip(cosVec)
    +    var topResults = wordList.zip(cosVec)
           .toSeq
    -      .sortBy(- _._2)
    +      .sortBy(-_._2)
           .take(num + 1)
           .tail
    -      .toArray
    +    if (vecNorm != 0.0f) {
    +      topResults = topResults.map {
    +        case (word: String, cosVec: Double) => (word, cosVec / vecNorm)
    --- End diff --
    
    same here for style, prefer 
    ```
    topResults = topResults.map { case (word: String, cosVec: Double) =>
      ...
    ```
    
    (as long as it fits within line length restriction)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][MLlib]add support of arbitrary l...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10152#discussion_r46768439
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -555,6 +614,14 @@ class Word2VecModel private[spark] (
           (word, wordVectors.slice(vectorSize * ind, vectorSize * ind + vectorSize))
         }
       }
    +
    +  /**
    +   * get word vector array. length is vocabularySize*vectorSize
    +   * @return the array of word vectors, need to split it by vectorSize to get each individual vector
    +   */
    +  def getWordVectors: Array[Float] = {
    --- End diff --
    
    Yes, let's remove other changes that are not related to this issue. I don't think this class should expose its internal data structures and utility functions "just because". Just the sentence length change should be here


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

Posted by ygcao <gi...@git.apache.org>.

Github user ygcao commented on the pull request:

    https://github.com/apache/spark/pull/10152#issuecomment-172777566
  
    PySpark's test case fixed. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][MLlib]add support of arbitrary l...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10152#discussion_r47542677
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -77,6 +77,19 @@ class Word2Vec extends Serializable with Logging {
       private var numIterations = 1
       private var seed = Utils.random.nextLong()
       private var minCount = 5
    +  private var maxSentenceLength = 1000
    +
    +  /**
    +   * set the maxSentenceLength for cutting purpose
    --- End diff --
    
    ygcao could you update the comment style here to match the parameter setters in this class?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][MLlib]add support of arbitrary l...

Posted by ygcao <gi...@git.apache.org>.

Github user ygcao commented on the pull request:

    https://github.com/apache/spark/pull/10152#issuecomment-165015765
  
    found doing after trimming normalization is actually straight-forward, so did it as the other pull request is stale, we still can use that pull request for further optimization purpose.
    adjusted comments with default value as advised and passed lint check again.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][MLlib]add support of arbitrary l...

Posted by ygcao <gi...@git.apache.org>.

Github user ygcao commented on the pull request:

    https://github.com/apache/spark/pull/10152#issuecomment-162253645
  
    Hi Sean, thanks for the comment. Sorry for not notice the title requirement for the pull request. I did check the guideline, but didn't get time to check each line of it.
    As to the style, I'm pretty much following existing code, could you be more specific about what you found violated the style guide?
    As to the distance function, agreed that separation is preferred. since cosineSimilarity is already in place in the class,  I added more in place just to minimize my footprint of changes. If you can point out any existing class for placing those functions, that'll be great. I don't mind to add a new class just for distance functions, that means bigger changes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: add support of arbitrary length sentence by us...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10152#issuecomment-162118957
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/10152#issuecomment-185659555
  
    Jenkins, retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10152#discussion_r52530874
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -551,12 +551,17 @@ class Word2VecModel private[spark] (
           }
           ind += 1
         }
    -    wordList.zip(cosVec)
    +    var topResults = wordList.zip(cosVec)
           .toSeq
    -      .sortBy(- _._2)
    +      .sortBy(-_._2)
           .take(num + 1)
           .tail
    -      .toArray
    +    if (vecNorm != 0.0f) {
    +      topResults = topResults.map { case (word, cosVec) =>
    --- End diff --
    
    `cosVec` shadows the `cosVec` outside. We can rename it to `cos` since it is not a vector.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10152#discussion_r50676720
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -556,6 +571,7 @@ class Word2VecModel private[spark] (
           .sortBy(- _._2)
           .take(num + 1)
           .tail
    +      .map(v => (if (vecNorm == 0) v else (v._1, v._2 / vecNorm)))
    --- End diff --
    
    I agree that we should define cosine similarity with a zero vector to be 0. In this case the results are pretty meaningless anyway, since the dot product was already 0 for everything, and so the top N are random. I'd say:
    
    ```
    .map { case (word, cosVec) => (word, if (vecNorm == 0.0) 0.0 else (cosVec / vecNorm)) }
    ```
    For more efficiency we could only apply this map in the corner case that `vecNorm` is 0.0; not sure if it's simpler but it's a little less work:
    
    ```
    val result = wordList....tail
    if (vecNorm != 0.0) [
      result = result.map { case (word, cosVec) => (word, cosVec / vecNorm) }
    }
    result.toArray
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][MLlib]add support of arbitrary l...

Posted by ygcao <gi...@git.apache.org>.

Github user ygcao commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10152#discussion_r47739429
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -534,8 +577,15 @@ class Word2VecModel private[spark] (
         // Need not divide with the norm of the given vector since it is constant.
         val cosVec = cosineVec.map(_.toDouble)
         var ind = 0
    +    var vecNorm = 1f
    +    if (norm) {
    --- End diff --
    
    OK. Let me pull back the changes about normalization then. we can use the other pull request for follow up of normalization. 
    As @srowen mentioned, normalization can be done after getting top K in the application as well.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][MLlib]add support of arbitrary l...

Posted by ygcao <gi...@git.apache.org>.

Github user ygcao commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10152#discussion_r47435244
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -534,8 +577,15 @@ class Word2VecModel private[spark] (
         // Need not divide with the norm of the given vector since it is constant.
         val cosVec = cosineVec.map(_.toDouble)
         var ind = 0
    +    var vecNorm = 1f
    +    if (norm) {
    --- End diff --
    
    I don't mind much about make it always normalized. Just FYI: for current brute-force kNN implementation in findSynonyms, the unnormalized version does save potentially millions of division operation when the vocabulary is millions, of course it is still in at most seconds saved. When caller only care about top K without needs of metrics and call it for all their interested words, the minor save of time could be multiplied again.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

Posted by ygcao <gi...@git.apache.org>.

Github user ygcao commented on the pull request:

    https://github.com/apache/spark/pull/10152#issuecomment-182237525
  
    It's getting to personal tastes now~~, still adopted suggestion though. Personally, I would like always to let machine to do the formatting and length limits(even adding braces for the if statement when we want to make it as a rule), if we don't like machine's default way, we can create template for Spark project to let machine do what the majority of spark community want(support eclipse is enough, intellij and others can adopt eclipse formatter's template), the key is that machine should be the guy to do the 'stupid' and repetitive work for us;) 
    seems we can create an issue for Spark about this: create an template for IDE formatter for Spark contributors.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/10152#issuecomment-182591988
  
    @srowen Thanks! I will make a quick pass.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][MLlib]add support of arbitrary l...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on the pull request:

https://github.com/apache/spark/pull/10152#issuecomment-165093398

@ygcao thanks for making all the changes, this PR is a lot clearer now focusing in on the core issue. I think you can also link https://issues.apache.org/jira/browse/SPARK-7617 to this as your latest changes also fixes that. Could you also perhaps update your JIRA description to reflect the new scope of changes in this PR?

Pinging @jkbradley @mengxr @MechCoder - can you just double check that you're happy with this - i.e. the core issue as I mentioned previously is that the current implementation is incorrect and discards all sentence boundaries, instead simply clipping "sentences" every 1000 words. This does not match the original Google impl (as well as others e.g. Gensim and DL4J). These changes now respect sentence boundaries in the input RDD, and imposes a max sentence length (default the previous hard-coded `1000`).

Can we also whitelist to run the tests - I'm interested to see if this impacts the Word2Vec tests in any way.

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on the pull request:

    https://github.com/apache/spark/pull/10152#issuecomment-170854032
  
    @ygcao sorry for the delay, I was on vacation and not checking email regularly. I'm pretty much happy with this, but I'd like to just get one of @srowen @jkbradley @mengxr to give it a final review.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10152#issuecomment-172480884
  
    **[Test build #2394 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2394/consoleFull)** for PR 10152 at commit [`76e8266`](https://github.com/apache/spark/commit/76e82667bd04bc359a336c7538d2595aade8384e).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10152#discussion_r52575653
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -272,15 +285,14 @@ class Word2Vec extends Serializable with Logging {
     
       /**
        * Computes the vector representation of each word in vocabulary.
    -   * @param dataset an RDD of words
    +   * @param dataset a RDD of sentences,
    --- End diff --
    
    That's right, though RDD effectively starts with a vowel sound: arr-dee-dee. A native speaker would certainly say "an RDD" like "an hour". In a similar way, people disagree over "a SQL database" vs "an SQL database" but it's really a disagreement over whether you say "a _sequel_ database" or "an _ess-cyoo-ell_ database.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

Posted by ygcao <gi...@git.apache.org>.

Github user ygcao commented on the pull request:

    https://github.com/apache/spark/pull/10152#issuecomment-180188955
  
    cool，Thanks. It will be helpful to see the large scale test result.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10152#issuecomment-180820036
  
    **[Test build #50873 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50873/consoleFull)** for PR 10152 at commit [`443ec06`](https://github.com/apache/spark/commit/443ec06d8e6381a9df8f69b89fcd4955095e6b6c).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10152#discussion_r52722467
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -289,24 +301,20 @@ class Word2Vec extends Serializable with Logging {
         val expTable = sc.broadcast(createExpTable())
         val bcVocab = sc.broadcast(vocab)
         val bcVocabHash = sc.broadcast(vocabHash)
    -
    -    val sentences: RDD[Array[Int]] = words.mapPartitions { iter =>
    -      new Iterator[Array[Int]] {
    -        def hasNext: Boolean = iter.hasNext
    -
    -        def next(): Array[Int] = {
    -          val sentence = ArrayBuilder.make[Int]
    -          var sentenceLength = 0
    -          while (iter.hasNext && sentenceLength < MAX_SENTENCE_LENGTH) {
    -            val word = bcVocabHash.value.get(iter.next())
    -            word match {
    -              case Some(w) =>
    -                sentence += w
    -                sentenceLength += 1
    -              case None =>
    -            }
    +    // each partition is a collection of sentences,
    +    // will be translated into arrays of Index integer
    +    val sentences: RDD[Array[Int]] = dataset.mapPartitions { sentenceIter =>
    +      // Each sentence will map to 0 or more Array[Int]
    +      sentenceIter.flatMap { sentence => {
    +          // Sentence of words, some of which map to a word index
    +          val wordIndexes = sentence.flatMap(bcVocabHash.value.get)
    +          if (wordIndexes.nonEmpty) {
    +            // break wordIndexes into trunks of maxSentenceLength when has more
    +            val sentenceSplit = wordIndexes.grouped(maxSentenceLength)
    +            sentenceSplit.map(_.toArray)
    --- End diff --
    
    `sentenceSplit` should be an `Iterator[Array[Int]]`. So this line might be unnecessary.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][MLlib]add support of arbitrary l...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10152#discussion_r46806131
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -281,16 +280,17 @@ class Word2Vec extends Serializable with Logging {
         val expTable = sc.broadcast(createExpTable())
         val bcVocab = sc.broadcast(vocab)
         val bcVocabHash = sc.broadcast(vocabHash)
    -
    -    val sentences: RDD[Array[Int]] = words.mapPartitions { iter =>
    +    //each partition is a collection of sentences, will be translated into arrays of Index integer
    +    val sentences: RDD[Array[Int]] = dataset.mapPartitions { iter =>
           new Iterator[Array[Int]] {
             def hasNext: Boolean = iter.hasNext
     
             def next(): Array[Int] = {
               val sentence = ArrayBuilder.make[Int]
               var sentenceLength = 0
    -          while (iter.hasNext && sentenceLength < MAX_SENTENCE_LENGTH) {
    -            val word = bcVocabHash.value.get(iter.next())
    +          //do translation of each word into its index in the vocabulary, not constraint by fixed length anymore
    +          for (wd <- iter.next()) {
    --- End diff --
    
    If making the `maxSentenceLength` a configurable parameter, this doesn't need to change - and the `while` loop is preferable


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10152#discussion_r52590418
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -76,6 +76,18 @@ class Word2Vec extends Serializable with Logging {
       private var numIterations = 1
       private var seed = Utils.random.nextLong()
       private var minCount = 5
    +  private var maxSentenceLength = 1000
    +
    +  /**
    +   * Sets the maximum length of each sentence in the input data.
    +   * Any sentence longer than this threshold will be divided into chunks of
    +   * up to `maxSentenceLength` size (default: 1000)
    +   */
    +  @Since("2.0.0")
    +  def setMaxSentenceLength(maxSentenceLength: Int): this.type = {
    --- End diff --
    
    The param name comes from the original Google implementation. Either option (or both) works, but I guess I'd be marginally more in favour of amending the first line of doc to read `... maximum length (in words) of each ...`, or something similar.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10152#issuecomment-173607998
  
    **[Test build #2431 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2431/consoleFull)** for PR 10152 at commit [`e938208`](https://github.com/apache/spark/commit/e938208d9c85515f62b41635a8445b8ab31f55f2).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][MLlib]add support of arbitrary l...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/10152#issuecomment-162750381
  
    @MLnick 
    
    > So let's focus this PR in on making the max sentence size configurable, if this is desirable?
    
    +1
    
    > It does seem a little strange to me thinking about it to discard sentence boundaries.
    
    How are they being discarded?  Each row in the data given to MLlib is treated as a separate sentence; i.e., we aren't trying to model similarity across sentences (as far as I can tell glancing at it).  If people want to work with a unit other than sentences, then each unit can be on a separate row.  Am I misunderstanding?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10152#discussion_r52975344
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -289,24 +301,19 @@ class Word2Vec extends Serializable with Logging {
         val expTable = sc.broadcast(createExpTable())
         val bcVocab = sc.broadcast(vocab)
         val bcVocabHash = sc.broadcast(vocabHash)
    -
    -    val sentences: RDD[Array[Int]] = words.mapPartitions { iter =>
    -      new Iterator[Array[Int]] {
    -        def hasNext: Boolean = iter.hasNext
    -
    -        def next(): Array[Int] = {
    -          val sentence = ArrayBuilder.make[Int]
    -          var sentenceLength = 0
    -          while (iter.hasNext && sentenceLength < MAX_SENTENCE_LENGTH) {
    -            val word = bcVocabHash.value.get(iter.next())
    -            word match {
    -              case Some(w) =>
    -                sentence += w
    -                sentenceLength += 1
    -              case None =>
    -            }
    -          }
    -          sentence.result()
    +    // each partition is a collection of sentences,
    +    // will be translated into arrays of Index integer
    +    val sentences: RDD[Array[Int]] = dataset.mapPartitions { sentenceIter =>
    +      // Each sentence will map to 0 or more Array[Int]
    +      sentenceIter.flatMap { sentence =>
    +        // Sentence of words, some of which map to a word index
    +        val wordIndexes = sentence.flatMap(bcVocabHash.value.get)
    +        if (wordIndexes.nonEmpty) {
    --- End diff --
    
    @ygcao you have kept the if statement here, which I believe both @mengxr and @srowen have shown is not necessary.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10152#issuecomment-181271217
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/50915/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][MLlib]add support of arbitrary l...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/10152#issuecomment-162974386
  
    @MLnick Oops, you're right; we are completely ignoring sentence boundaries.  I didn't look carefully enough at the code.  I'll correct my comment above too.
    
    >  it's an open question whether this approach, or keeping sentence structure / boundaries, is better
    
    Same here; I haven't looked at the follow-up literature sufficiently to say.  If anyone watching has references, that'd be useful.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][MLlib]add support of arbitrary l...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10152#discussion_r47429145
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -264,7 +276,8 @@ class Word2Vec extends Serializable with Logging {
     
       /**
        * Computes the vector representation of each word in vocabulary.
    -   * @param dataset an RDD of words
    +   * @param dataset a RDD of sentences,
    +   *                each sentence is expressed as an iterable collection of words
    --- End diff --
    
    I think we should remove L286 and change `learnVocab` to `learnVocab(dataset)`. The `val words = dataset.flatMap(x => x)` can the be moved into `learnVocab`. This seems clearer to me, to avoid confusion, since `words` is no longer used apart from the learn vocab step.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][MLlib]add support of arbitrary l...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10152#discussion_r47547211
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -534,8 +577,15 @@ class Word2VecModel private[spark] (
         // Need not divide with the norm of the given vector since it is constant.
         val cosVec = cosineVec.map(_.toDouble)
         var ind = 0
    +    var vecNorm = 1f
    +    if (norm) {
    --- End diff --
    
    I think we should make it return proper cosine similarity. @jkbradley thoughts? 
    
    Note that there is an outstanding JIRA SPARK-7617 and PR #6245 that addresses the normalization, as well as caching the normalized vectors. Though it is a bit stale now - if @ezli will be updating that PR soon we can make those changes there, otherwise leave them for a follow-up PR?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][MLlib]add support of arbitrary l...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10152#discussion_r46992991
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -469,7 +469,32 @@ class Word2VecModel private[spark] (
         this(Word2VecModel.buildWordIndex(model), Word2VecModel.buildWordVectors(model))
       }
     
    -  private def cosineSimilarity(v1: Array[Float], v2: Array[Float]): Double = {
    +  /**
    +   * get cosineSimilarity of two word. assumed to be from the vocabulary and used after model built, otherwise will error out
    +   * @param v1 one word from the vocabulary
    +   * @param v2 the other word from the vocabulary
    +   * @return the cosinesimilarity score in the vector space of the given two words
    +   */
    +  def cosineSimilarity(v1: String, v2: String): Double = {
    +    return cosineSimilarity(getVectors(v1), getVectors(v2))
    +  }
    +
    +  /**
    +   * get the built vocabulary from the input
    +   * @return a map of word to its index
    +   */
    +  def getVocabulary:Map[String,Int]={
    --- End diff --
    
    Regardless, as others have said, these extra methods should come in a separate PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10152#discussion_r52530860
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -272,15 +285,14 @@ class Word2Vec extends Serializable with Logging {
     
       /**
        * Computes the vector representation of each word in vocabulary.
    -   * @param dataset an RDD of words
    +   * @param dataset a RDD of sentences,
    --- End diff --
    
    `a RDD` -> `an RDD`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

Posted by ygcao <gi...@git.apache.org>.

Github user ygcao commented on the pull request:

    https://github.com/apache/spark/pull/10152#issuecomment-187313458
  
    Thanks everybody for the review and help! Cheers!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10152#discussion_r52867732
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -289,24 +301,20 @@ class Word2Vec extends Serializable with Logging {
         val expTable = sc.broadcast(createExpTable())
         val bcVocab = sc.broadcast(vocab)
         val bcVocabHash = sc.broadcast(vocabHash)
    -
    -    val sentences: RDD[Array[Int]] = words.mapPartitions { iter =>
    -      new Iterator[Array[Int]] {
    -        def hasNext: Boolean = iter.hasNext
    -
    -        def next(): Array[Int] = {
    -          val sentence = ArrayBuilder.make[Int]
    -          var sentenceLength = 0
    -          while (iter.hasNext && sentenceLength < MAX_SENTENCE_LENGTH) {
    -            val word = bcVocabHash.value.get(iter.next())
    -            word match {
    -              case Some(w) =>
    -                sentence += w
    -                sentenceLength += 1
    -              case None =>
    -            }
    +    // each partition is a collection of sentences,
    +    // will be translated into arrays of Index integer
    +    val sentences: RDD[Array[Int]] = dataset.mapPartitions { sentenceIter =>
    +      // Each sentence will map to 0 or more Array[Int]
    +      sentenceIter.flatMap { sentence => {
    +          // Sentence of words, some of which map to a word index
    +          val wordIndexes = sentence.flatMap(bcVocabHash.value.get)
    +          if (wordIndexes.nonEmpty) {
    --- End diff --
    
    Agreed - @ygcao if we can make this final change I think we should be good to go


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][MLlib]add support of arbitrary l...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10152#discussion_r47428935
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -469,13 +495,13 @@ class Word2VecModel private[spark] (
         this(Word2VecModel.buildWordIndex(model), Word2VecModel.buildWordVectors(model))
       }
     
    -  private def cosineSimilarity(v1: Array[Float], v2: Array[Float]): Double = {
    -    require(v1.length == v2.length, "Vectors should have the same length")
    -    val n = v1.length
    -    val norm1 = blas.snrm2(n, v1, 1)
    -    val norm2 = blas.snrm2(n, v2, 1)
    -    if (norm1 == 0 || norm2 == 0) return 0.0
    -    blas.sdot(n, v1, 1, v2, 1) / norm1 / norm2
    +  /**
    +   * get the built vocabulary from the input
    +   * this is useful for getting the whole vocabulary to join with other data or filtering other data
    +   * @return a map of word to its index
    +   */
    +  def getVocabulary: Map[String, Int] = {
    --- End diff --
    
    It's still not clear to me why we need to expose `getVocabulary` and `getWordVectors`. You mentioned "joins", but could you perhaps provide a concrete use case and example of why these are required?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][MLlib]add support of arbitrary l...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on the pull request:

https://github.com/apache/spark/pull/10152#issuecomment-162526464

Pinging @mengxr @MechCoder @jkbradley (and I think @Ishiihara was the original author of Word2Vec?)

So let's focus this PR in on making the max sentence size configurable, if this is desirable?

Looking a bit deeper, the sentence structure of the input is essentially discarded in https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala#L273. This dates back to the original implementation, and it does match the original Google implementation that treats end-of-line as a word boundary, then imposes a `MAX_SENTENCE_LENGTH` of 1000 when processing the word stream.

It's interesting to note that e.g. Gensim's implementation respects the sentence structure of the input data (https://github.com/piskvorky/gensim/blob/develop/gensim/models/word2vec.py#L120). Deeplearning4j seems to do the same.

It does seem a little strange to me thinking about it to discard sentence boundaries. It does make sense for very large text corpuses. But Word2Vec is more general than that, and can be applied e.g. in recommendation settings, where the boundary between "sentences" as, say, a "user activity history", is more patently "discontinuos".

Thoughts? On the face of it we can leave the implementation as is (as it is true to the original), optionally making the max sentence length a configurable param. Or we can look at using the "sentence" structure of the input data (perhaps making the behaviour configurable between this and the original impl).

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

Posted by ygcao <gi...@git.apache.org>.

Github user ygcao commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10152#discussion_r48324747
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -77,6 +77,18 @@ class Word2Vec extends Serializable with Logging {
       private var numIterations = 1
       private var seed = Utils.random.nextLong()
       private var minCount = 5
    +  private var maxSentenceLength = 1000
    +
    +  /**
    +   * Sets the maximum length of each sentence in the input data.
    +   * Any sentence longer than this threshold will be divided into chunks of
    +   * up to `maxSentenceLength` size (default: 1000)
    +   */
    +  @Since("1.6.1")
    --- End diff --
    
    Done!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10152#discussion_r47876214
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -77,6 +77,20 @@ class Word2Vec extends Serializable with Logging {
       private var numIterations = 1
       private var seed = Utils.random.nextLong()
       private var minCount = 5
    +  private var maxSentenceLength = 1000
    +
    +  /**
    +   * sets the maxSentenceLength, maxSentenceLength is used as the threshold for cutting sentence
    +   * into chunks when it is too long. (default: 1000)
    +   * @param maxSentenceLength the maxSentenceLength allowed.
    +   *                          recommend to set it large enough to respect reasonable long sentences
    +   *                          while not overflow memory
    +   * @return this object
    +   */
    +  def setMaxSentenceLength(maxSentenceLength: Int): this.type = {
    --- End diff --
    
    I think we should target `1.6.1` here: `@Since("1.6.1")` - overall I'd view this PR as a bugfix (though adding the parameter is a minor extra feature). I think we'd want to include this in branch-1.6, and possibly even think about backporting the core changes to branch-1.5


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

Posted by ygcao <gi...@git.apache.org>.

Github user ygcao commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10152#discussion_r52708705
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -289,24 +301,20 @@ class Word2Vec extends Serializable with Logging {
         val expTable = sc.broadcast(createExpTable())
         val bcVocab = sc.broadcast(vocab)
         val bcVocabHash = sc.broadcast(vocabHash)
    -
    -    val sentences: RDD[Array[Int]] = words.mapPartitions { iter =>
    -      new Iterator[Array[Int]] {
    -        def hasNext: Boolean = iter.hasNext
    -
    -        def next(): Array[Int] = {
    -          val sentence = ArrayBuilder.make[Int]
    -          var sentenceLength = 0
    -          while (iter.hasNext && sentenceLength < MAX_SENTENCE_LENGTH) {
    -            val word = bcVocabHash.value.get(iter.next())
    -            word match {
    -              case Some(w) =>
    -                sentence += w
    -                sentenceLength += 1
    -              case None =>
    -            }
    +    // each partition is a collection of sentences,
    +    // will be translated into arrays of Index integer
    +    val sentences: RDD[Array[Int]] = dataset.mapPartitions { sentenceIter =>
    +      // Each sentence will map to 0 or more Array[Int]
    +      sentenceIter.flatMap { sentence => {
    +          // Sentence of words, some of which map to a word index
    +          val wordIndexes = sentence.flatMap(bcVocabHash.value.get)
    +          if (wordIndexes.nonEmpty) {
    --- End diff --
    
    Sorry, still not quite sure about this. did a test, turns out I am right :grinning: 
    scala> val sentences=List("test sen 1","","testsen 2")
    sentences: List[String] = List(test sen 1, "", testsen 2)
    
    scala> val rdd=sc.parallelize(sentences)
    rdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize at <console>:23
    
    scala> val results=rdd.flatMap(sen=>sen.split(" ").grouped(1))
    results: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[1] at flatMap at <console>:25
    
    scala> results.collect
    res0: Array[Array[String]] = Array(Array(test), Array(sen), Array(1), **Array("")**, Array(testsen), Array(2))
    
    if we don't have the if statement, we'll result empty things which could cause trouble for following steps. I'd like to be on the safe side. if statement is cheap enough.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][MLlib]add support of arbitrary l...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10152#discussion_r47765417
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -77,6 +77,20 @@ class Word2Vec extends Serializable with Logging {
       private var numIterations = 1
       private var seed = Utils.random.nextLong()
       private var minCount = 5
    +  private var maxSentenceLength = 1000
    +
    +  /**
    +   * sets the maxSentenceLength, maxSentenceLength is used as the threshold for cutting sentence
    --- End diff --
    
    Sorry to be pedantic, but can we capitalize the `sets` and make the comment something like:
    
    ```
    Sets the maximum length of each sentence in the input data. Any sentence longer than this threshold will be divided into chunks of up to `maxSentenceLength` size (default: 1000)
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10152#discussion_r52142120
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -289,26 +301,24 @@ class Word2Vec extends Serializable with Logging {
         val expTable = sc.broadcast(createExpTable())
         val bcVocab = sc.broadcast(vocab)
         val bcVocabHash = sc.broadcast(vocabHash)
    -
    -    val sentences: RDD[Array[Int]] = words.mapPartitions { iter =>
    -      new Iterator[Array[Int]] {
    -        def hasNext: Boolean = iter.hasNext
    -
    -        def next(): Array[Int] = {
    -          val sentence = ArrayBuilder.make[Int]
    -          var sentenceLength = 0
    -          while (iter.hasNext && sentenceLength < MAX_SENTENCE_LENGTH) {
    -            val word = bcVocabHash.value.get(iter.next())
    -            word match {
    -              case Some(w) =>
    -                sentence += w
    -                sentenceLength += 1
    -              case None =>
    +    // each partition is a collection of sentences,
    +    // will be translated into arrays of Index integer
    +    val sentences: RDD[Array[Int]] = dataset.mapPartitions {
    +      // Each sentence will map to 0 or more Array[Int]
    +      sentenceIter =>
    --- End diff --
    
    This could even be `dataset.mapPartitions { _.flatMap { sentence =>`, which I kind of like, but I don't know how much it matters.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10152#issuecomment-185669428
  
    **[Test build #51483 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/51483/consoleFull)** for PR 10152 at commit [`a4abd40`](https://github.com/apache/spark/commit/a4abd40f5a553fa95795e9da5fff0b4ac0256188).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][MLlib]add support of arbitrary l...

Posted by ygcao <gi...@git.apache.org>.

Github user ygcao commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10152#discussion_r46925097
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -281,16 +280,17 @@ class Word2Vec extends Serializable with Logging {
         val expTable = sc.broadcast(createExpTable())
         val bcVocab = sc.broadcast(vocab)
         val bcVocabHash = sc.broadcast(vocabHash)
    -
    -    val sentences: RDD[Array[Int]] = words.mapPartitions { iter =>
    +    //each partition is a collection of sentences, will be translated into arrays of Index integer
    +    val sentences: RDD[Array[Int]] = dataset.mapPartitions { iter =>
    --- End diff --
    
    Agreed. Will do.
    Btw: the usage of maxsentencelenth in  the python code is different than the original mllib word2vec. It's actually respect sentence structure on condition that it's shorter than the max allowance, which is a mixture of my change with the adress of @srowen too long sentence concern. But as you mentioned and I mentioned below, the too long sentence should not be a concern in normal situation. The only exception I can think of is to use a poor sentence splitter plus using concatenated large portion of a big corpus as one individual document in the input.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][MLlib]add support of arbitrary l...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10152#discussion_r46806041
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -281,16 +280,17 @@ class Word2Vec extends Serializable with Logging {
         val expTable = sc.broadcast(createExpTable())
         val bcVocab = sc.broadcast(vocab)
         val bcVocabHash = sc.broadcast(vocabHash)
    -
    -    val sentences: RDD[Array[Int]] = words.mapPartitions { iter =>
    +    //each partition is a collection of sentences, will be translated into arrays of Index integer
    +    val sentences: RDD[Array[Int]] = dataset.mapPartitions { iter =>
    --- End diff --
    
    Agree that Word2Vec can be used in settings where the "sentence" is just a sequence of "things" (e.g. recommendation where the "sentences" could be "page views" for a user, and could be longer than 1000).
    
    So while I'd tend to agree that the caller should really be responsible for setting up the "sentence" structure, as @srowen says rather make it configurable with a default of `1000` to keep backward compatible behavior. You can create a `set` method for the new parameter and add a `@Since` annotation (see e.g. the `setMinCount` method added)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10152#issuecomment-185685958
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

Posted by ygcao <gi...@git.apache.org>.

Github user ygcao commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10152#discussion_r52574384
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -551,12 +551,17 @@ class Word2VecModel private[spark] (
           }
           ind += 1
         }
    -    wordList.zip(cosVec)
    +    var topResults = wordList.zip(cosVec)
           .toSeq
    -      .sortBy(- _._2)
    +      .sortBy(-_._2)
           .take(num + 1)
           .tail
    -      .toArray
    +    if (vecNorm != 0.0f) {
    +      topResults = topResults.map { case (word, cosVec) =>
    --- End diff --
    
    Good point!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][MLlib]add support of arbitrary l...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10152#discussion_r47545680
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -469,13 +495,13 @@ class Word2VecModel private[spark] (
         this(Word2VecModel.buildWordIndex(model), Word2VecModel.buildWordVectors(model))
       }
     
    -  private def cosineSimilarity(v1: Array[Float], v2: Array[Float]): Double = {
    -    require(v1.length == v2.length, "Vectors should have the same length")
    -    val n = v1.length
    -    val norm1 = blas.snrm2(n, v1, 1)
    -    val norm2 = blas.snrm2(n, v2, 1)
    -    if (norm1 == 0 || norm2 == 0) return 0.0
    -    blas.sdot(n, v1, 1, v2, 1) / norm1 / norm2
    +  /**
    +   * get the built vocabulary from the input
    +   * this is useful for getting the whole vocabulary to join with other data or filtering other data
    +   * @return a map of word to its index
    +   */
    +  def getVocabulary: Map[String, Int] = {
    --- End diff --
    
    @ygcao I'm still not certain what the benefit of exposing these is. `getVectors` returns the map of vectors for all words in the vocab, so you only need to call it once. Then you could either broadcast it or parallelize it to join up to a set of words. How is that different from calling a combination of `getVocabulary` and `getWordVectors`?
    
    If the performance of `getVectors` can be substantially sped up then that is something we should do (in which case it would be great to have some relative timings for comparison for different vocabulary size and vector dimension).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

Posted by ygcao <gi...@git.apache.org>.

Github user ygcao commented on the pull request:

    https://github.com/apache/spark/pull/10152#issuecomment-173094302
  
    adjusted python doctest format. Did the heavy job of running pyspark test locally. Now should be OK.
    Finished test(python): pyspark.ml.feature (35s)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10152#discussion_r52835896
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -289,24 +301,20 @@ class Word2Vec extends Serializable with Logging {
         val expTable = sc.broadcast(createExpTable())
         val bcVocab = sc.broadcast(vocab)
         val bcVocabHash = sc.broadcast(vocabHash)
    -
    -    val sentences: RDD[Array[Int]] = words.mapPartitions { iter =>
    -      new Iterator[Array[Int]] {
    -        def hasNext: Boolean = iter.hasNext
    -
    -        def next(): Array[Int] = {
    -          val sentence = ArrayBuilder.make[Int]
    -          var sentenceLength = 0
    -          while (iter.hasNext && sentenceLength < MAX_SENTENCE_LENGTH) {
    -            val word = bcVocabHash.value.get(iter.next())
    -            word match {
    -              case Some(w) =>
    -                sentence += w
    -                sentenceLength += 1
    -              case None =>
    -            }
    +    // each partition is a collection of sentences,
    +    // will be translated into arrays of Index integer
    +    val sentences: RDD[Array[Int]] = dataset.mapPartitions { sentenceIter =>
    +      // Each sentence will map to 0 or more Array[Int]
    +      sentenceIter.flatMap { sentence => {
    +          // Sentence of words, some of which map to a word index
    +          val wordIndexes = sentence.flatMap(bcVocabHash.value.get)
    +          if (wordIndexes.nonEmpty) {
    --- End diff --
    
    They should be equivalent Scala code. We don't need to mix in their meanings. Let us try providing an example of  `sentenceIter: Iterator[Array[String]]` and `vocabHash: Map[String, Int]` such that
    
    ~~~scala     
        sentenceIter.flatMap { sentence =>
            val wordIndexes = sentence.flatMap(vocabHash.get)
            wordIndexes.grouped(maxSentenceLength).map(_.toArray)
        }
    ~~~~
    
    returns a different result from your code:
    
    ~~~scala
          sentenceIter.flatMap { sentence =>
            val wordIndexes = sentence.flatMap(vocabHash.get)
            if (wordIndexes.nonEmpty) {
              val sentenceSplit = wordIndexes.grouped(maxSentenceLength)
              sentenceSplit.map(_.toArray)
            } else {
              None
            }
         }
    ~~~
    
    Essentially we are comparing the behavior when `wordIndexes` is an empty `Iterator[Int]`. In this case, no matter what value `maxSentenceLength` takes, ` wordIndexes.grouped(maxSentenceLength).map(_.toArray)` returns an empty iterator, which will be skipped by `sentenceIter.flatMap`. So it is the same as `None` being skipped by `sentenceIter.flatMap` in the current implementation.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][MLlib]add support of arbitrary l...

Posted by ygcao <gi...@git.apache.org>.

Github user ygcao commented on the pull request:

    https://github.com/apache/spark/pull/10152#issuecomment-164112022
  
    update codes:
    1. passed lint-scala check by adding spaces and cut comment line lengths
    2. removed distance functions as we discussed, including the existing zombie one(not used and not accessible from outside)
    3. made the changes to add back maxSentenceSize as a configurable variable while still respect sentence boundary by default, just to meet demands of people who will construct sentences in a unimaginable way. I'll personally always set it large enough(up to document size) to never affect nature sentence boundary. Anyway, it's good to have an option for different people's need.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

Posted by ygcao <gi...@git.apache.org>.

Github user ygcao commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10152#discussion_r51356725
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -289,17 +301,28 @@ class Word2Vec extends Serializable with Logging {
         val expTable = sc.broadcast(createExpTable())
         val bcVocab = sc.broadcast(vocab)
         val bcVocabHash = sc.broadcast(vocabHash)
    -
    -    val sentences: RDD[Array[Int]] = words.mapPartitions { iter =>
    +    // each partition is a collection of sentences, will be translated into arrays of Index integer
    +    val sentences: RDD[Array[Int]] = dataset.mapPartitions { sentenceIter =>
           new Iterator[Array[Int]] {
    -        def hasNext: Boolean = iter.hasNext
    +        var wordIter: Iterator[String] = null
    +
    +        def hasNext: Boolean = sentenceIter.hasNext || (wordIter != null && wordIter.hasNext)
     
             def next(): Array[Int] = {
               val sentence = ArrayBuilder.make[Int]
               var sentenceLength = 0
    -          while (iter.hasNext && sentenceLength < MAX_SENTENCE_LENGTH) {
    -            val word = bcVocabHash.value.get(iter.next())
    -            word match {
    +          // do translation of each word into its index in the vocabulary,
    --- End diff --
    
    sorry, I can't do much about spark-perf thing, I even still didn't get time to figure out what's blocking me from running the whole test locally.
    I(and my compiler) also don't aware of the existence of flatMapPartitions function, but I do made a version suppose to do whatever Sean suggested. Please review and help to do the perf-test for two latest versions.
    BTW: I don't worry much about the perf difference since it shouldn't be much, even minutes(not quite possible) of difference for each partition, just mean possibly tens minutes of overall penalty, shouldn't matter much for a job runs hours. Of course, seeing the data is more convincing. My point is that, If diff is minor, we'd better optimize for readability.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

Posted by ygcao <gi...@git.apache.org>.

Github user ygcao commented on the pull request:

    https://github.com/apache/spark/pull/10152#issuecomment-166544099
  
    modified comment accordingly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: add support of arbitrary length sentence by us...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/10152#issuecomment-162206668
  
    @ygcao back up a moment and read https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark first. You need to update this PR and connect it to the JIRA. You have some code style problems here, and I'm not clear some of the ancillary changes make sense. This class is not intended to provide Euclidean distance, cosine measure.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

Posted by ygcao <gi...@git.apache.org>.

Github user ygcao commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10152#discussion_r53122840
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -289,24 +301,20 @@ class Word2Vec extends Serializable with Logging {
         val expTable = sc.broadcast(createExpTable())
         val bcVocab = sc.broadcast(vocab)
         val bcVocabHash = sc.broadcast(vocabHash)
    -
    -    val sentences: RDD[Array[Int]] = words.mapPartitions { iter =>
    -      new Iterator[Array[Int]] {
    -        def hasNext: Boolean = iter.hasNext
    -
    -        def next(): Array[Int] = {
    -          val sentence = ArrayBuilder.make[Int]
    -          var sentenceLength = 0
    -          while (iter.hasNext && sentenceLength < MAX_SENTENCE_LENGTH) {
    -            val word = bcVocabHash.value.get(iter.next())
    -            word match {
    -              case Some(w) =>
    -                sentence += w
    -                sentenceLength += 1
    -              case None =>
    -            }
    +    // each partition is a collection of sentences,
    +    // will be translated into arrays of Index integer
    +    val sentences: RDD[Array[Int]] = dataset.mapPartitions { sentenceIter =>
    +      // Each sentence will map to 0 or more Array[Int]
    +      sentenceIter.flatMap { sentence => {
    +          // Sentence of words, some of which map to a word index
    +          val wordIndexes = sentence.flatMap(bcVocabHash.value.get)
    +          if (wordIndexes.nonEmpty) {
    --- End diff --
    
    You guys are right. didn't quite get @mengxr 's previous explanation. The splitting function is doing something unexpected to me, split an empty String will result in a non empty array.
    FYI. I verified the logic using following codes to be more explicit test, which proved your assertions.
    scala> sentences
    res4: List[List[String]] = List(List(a, b, c), List(b), List(c, d))
    
    scala> dict
    res5: scala.collection.immutable.Map[String,Int] = Map(a -> 1, c -> 2)
    
    scala> sentences.flatMap(sen=>{val indexes=sen.flatMap(dict.get);indexes.grouped(2).map(_.toArray)})
    res6: List[Array[Int]] = List(Array(1, 2), Array(2))
    
    scala> "".split(" ")
    res7: Array[String] = Array("")
    
    scala> "".split(" ").size
    res8: Int = 1



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][MLlib]add support of arbitrary l...

Posted by ygcao <gi...@git.apache.org>.

Github user ygcao commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10152#discussion_r46922797
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -281,16 +280,17 @@ class Word2Vec extends Serializable with Logging {
         val expTable = sc.broadcast(createExpTable())
         val bcVocab = sc.broadcast(vocab)
         val bcVocabHash = sc.broadcast(vocabHash)
    -
    -    val sentences: RDD[Array[Int]] = words.mapPartitions { iter =>
    +    //each partition is a collection of sentences, will be translated into arrays of Index integer
    +    val sentences: RDD[Array[Int]] = dataset.mapPartitions { iter =>
           new Iterator[Array[Int]] {
             def hasNext: Boolean = iter.hasNext
     
             def next(): Array[Int] = {
               val sentence = ArrayBuilder.make[Int]
               var sentenceLength = 0
    -          while (iter.hasNext && sentenceLength < MAX_SENTENCE_LENGTH) {
    -            val word = bcVocabHash.value.get(iter.next())
    +          //do translation of each word into its index in the vocabulary, not constraint by fixed length anymore
    +          for (wd <- iter.next()) {
    --- End diff --
    
    I think my change got some understanding now. The problem is not much about having a maxSentenceLength, the problem of original version is the discarding of nature sentences by flatmap it and then cut the document into chunks of maxSentenceLength(except for the last one which could be less),no sentence boundaries will be respected as a result. 
    This change is designed to use the sentence structure of the input. As to the max sentence length limitations, don't really see a need for it.the worst case is that the entire document is one sentence, it will still work since each document is small enough to fit in memory and skip gram model is built around the max size-constraint sliding window instead of sentence or document. Respect sentence is to avoid the sliding window take words as context of target word from other adjacent sentences which could be irrelevant.  
    Agreed people may use word2vec in very different scenario, we can make two logics for choices. I'll make it an option than arguing which one is absolutely correct although I don't see the good justification of using fixed size chunks instead of structure.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

Posted by ygcao <gi...@git.apache.org>.

Github user ygcao commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10152#discussion_r52823895
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -289,24 +301,20 @@ class Word2Vec extends Serializable with Logging {
         val expTable = sc.broadcast(createExpTable())
         val bcVocab = sc.broadcast(vocab)
         val bcVocabHash = sc.broadcast(vocabHash)
    -
    -    val sentences: RDD[Array[Int]] = words.mapPartitions { iter =>
    -      new Iterator[Array[Int]] {
    -        def hasNext: Boolean = iter.hasNext
    -
    -        def next(): Array[Int] = {
    -          val sentence = ArrayBuilder.make[Int]
    -          var sentenceLength = 0
    -          while (iter.hasNext && sentenceLength < MAX_SENTENCE_LENGTH) {
    -            val word = bcVocabHash.value.get(iter.next())
    -            word match {
    -              case Some(w) =>
    -                sentence += w
    -                sentenceLength += 1
    -              case None =>
    -            }
    +    // each partition is a collection of sentences,
    +    // will be translated into arrays of Index integer
    +    val sentences: RDD[Array[Int]] = dataset.mapPartitions { sentenceIter =>
    +      // Each sentence will map to 0 or more Array[Int]
    +      sentenceIter.flatMap { sentence => {
    +          // Sentence of words, some of which map to a word index
    +          val wordIndexes = sentence.flatMap(bcVocabHash.value.get)
    +          if (wordIndexes.nonEmpty) {
    +            // break wordIndexes into trunks of maxSentenceLength when has more
    +            val sentenceSplit = wordIndexes.grouped(maxSentenceLength)
    +            sentenceSplit.map(_.toArray)
    --- End diff --
    
    Sorry again, we can't do the change. Compiler will complain, Iterator can't be used for flatMap, it expects GenTraversableOnce


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10152#issuecomment-170885940
  
    **[Test build #2372 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2372/consoleFull)** for PR 10152 at commit [`214d0d9`](https://github.com/apache/spark/commit/214d0d9d3350a624f98f36aff2393b7b61fd6176).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: add support of arbitrary length sentence by us...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10152#discussion_r46759159
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -520,7 +537,7 @@ class Word2VecModel private[spark] (
        * @return array of (word, cosineSimilarity)
        */
       @Since("1.1.0")
    -  def findSynonyms(vector: Vector, num: Int): Array[(String, Double)] = {
    +  def findSynonyms(vector: Vector, num: Int, norm: Boolean): Array[(String, Double)] = {
    --- End diff --
    
    You can't change the API. This also isn't documented, or motivated.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][MLlib]add support of arbitrary l...

Posted by ygcao <gi...@git.apache.org>.

Github user ygcao commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10152#discussion_r47429372
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -469,13 +495,13 @@ class Word2VecModel private[spark] (
         this(Word2VecModel.buildWordIndex(model), Word2VecModel.buildWordVectors(model))
       }
     
    -  private def cosineSimilarity(v1: Array[Float], v2: Array[Float]): Double = {
    -    require(v1.length == v2.length, "Vectors should have the same length")
    -    val n = v1.length
    -    val norm1 = blas.snrm2(n, v1, 1)
    -    val norm2 = blas.snrm2(n, v2, 1)
    -    if (norm1 == 0 || norm2 == 0) return 0.0
    -    blas.sdot(n, v1, 1, v2, 1) / norm1 / norm2
    +  /**
    +   * get the built vocabulary from the input
    +   * this is useful for getting the whole vocabulary to join with other data or filtering other data
    +   * @return a map of word to its index
    +   */
    +  def getVocabulary: Map[String, Int] = {
    --- End diff --
    
    Another thing I want to raise your attention is that the slice function in scala (used in getVectors) is super slow for whatever reason. We can output needed vector for join purpose from getWordVectors's return value by using indexes 100x faster than getVectors function call for each single word.
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][MLlib]add support of arbitrary l...

Posted by ygcao <gi...@git.apache.org>.

Github user ygcao commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10152#discussion_r46763785
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -520,7 +537,7 @@ class Word2VecModel private[spark] (
        * @return array of (word, cosineSimilarity)
        */
       @Since("1.1.0")
    -  def findSynonyms(vector: Vector, num: Int): Array[(String, Double)] = {
    +  def findSynonyms(vector: Vector, num: Int, norm: Boolean): Array[(String, Double)] = {
    --- End diff --
    
    will add java doc to explain it.
    for backward compatibility, I can add default value in this function instead of the newly introduced one.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

Posted by ygcao <gi...@git.apache.org>.

Github user ygcao commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10152#discussion_r51391185
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -289,17 +301,28 @@ class Word2Vec extends Serializable with Logging {
         val expTable = sc.broadcast(createExpTable())
         val bcVocab = sc.broadcast(vocab)
         val bcVocabHash = sc.broadcast(vocabHash)
    -
    -    val sentences: RDD[Array[Int]] = words.mapPartitions { iter =>
    +    // each partition is a collection of sentences, will be translated into arrays of Index integer
    +    val sentences: RDD[Array[Int]] = dataset.mapPartitions { sentenceIter =>
           new Iterator[Array[Int]] {
    -        def hasNext: Boolean = iter.hasNext
    +        var wordIter: Iterator[String] = null
    +
    +        def hasNext: Boolean = sentenceIter.hasNext || (wordIter != null && wordIter.hasNext)
     
             def next(): Array[Int] = {
               val sentence = ArrayBuilder.make[Int]
               var sentenceLength = 0
    -          while (iter.hasNext && sentenceLength < MAX_SENTENCE_LENGTH) {
    -            val word = bcVocabHash.value.get(iter.next())
    -            word match {
    +          // do translation of each word into its index in the vocabulary,
    --- End diff --
    
    I finally made up mind to do a hacky simple perf-test just for proof of concept: the 5x runs' perf diff of different implementation is quite ignorable since it's within variance of each run of the same version.
    Some details:
    I prepared a 32k document from two arbitrary picked wikipedia pages( for "machine learning" and "Adversarial machine learning", didn't include reference section), which contains 341 lines and can be split into 442 sentences by simply using dot+space for sentence boundary). I injected following test case into Word2VecSuite class and run it against three different implementations(the old one which is in the master branch, my final two versions before adopting Sean's suggestion, and after adopted Sean's suggestion) of fit function in mllib.feature.Word2Vec class.
    
    <pre>
    test("testSpeed") {
        val lines = sc.parallelize(Source.fromFile(new File("/home/ygcao/machinelearning.txt")).getLines().toSeq)
        val sentences = lines.flatMap(_.split("\\. ")).map(line => line.split(" ").toSeq)
        println("read file into rdd, lines=", sentences.count())
        var builtModel: org.apache.spark.mllib.feature.Word2VecModel = null
        var duration = 0l
        for (i <- 1 to 5) {
          val start = System.currentTimeMillis()
          val model = new org.apache.spark.mllib.feature.Word2Vec().setVectorSize(3).setSeed(42l)
          builtModel = model.fit(sentences)
          duration += (System.currentTimeMillis() - start)
        }
        println(s"builtModel take ${duration},vocabulary size:${builtModel.getVectors.size}, learning's synonyms:${builtModel.findSynonyms("learning", 4).mkString("\n")}")
      }
    </pre>
    
     the vocabulary size from the model is 155. and here are the time taking three runs of each version and the average of the final two runs of them. As you can see from the code, each run actually run the model building 5 times to magnify the potential diff.
    <pre>
    		masterVersion	PR-useIter	PR-useCollection
    run1	2232	2107	1933
    run2	2085	1986	1987
    run3	2005	2123	2004
    avarage(run2, run3)	2045	2054.5	1995.5
    </pre>
    BTW: Following is not relevant for perf-test, just FYI. the two versions in this pull request will produce exact the same result, which proves the correctness of both. and the result is interesting as well although the dataset is quite tiny, new versions(un-merged ones) looks better than the old version(the one in master branch right now). Of course, you can screw new version up by a bad sentence splitter(we can make a hard cut splitter to do exact the same thing as the old version). the simple splitter used for the test case can't deal with abbreviations, that's why I removed references section in the text.
    Here are the top synonyms of learning using the tiny dataset, please keep in mind, it's just for fun, not a solid proof of which is definitely better since dataset is tiny.
    New versions:  learning's synonyms:
    (network,0.9990321742854605)
    (related,0.9966140511173031)
    (sparse,0.9965729586431097)
    (algorithms,0.99376379497485)
    
    Old version, learning's synonyms:
    (against,0.9895162633562077)
    (Support,0.9547255372896342)
    (Association,0.9499811242788365)
    (Attacks,0.9321700815006693)
    
    
    
    
    
    
    
    
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][MLlib]add support of arbitrary l...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10152#discussion_r47767629
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -281,17 +294,28 @@ class Word2Vec extends Serializable with Logging {
         val expTable = sc.broadcast(createExpTable())
         val bcVocab = sc.broadcast(vocab)
         val bcVocabHash = sc.broadcast(vocabHash)
    -
    -    val sentences: RDD[Array[Int]] = words.mapPartitions { iter =>
    +    // each partition is a collection of sentences, will be translated into arrays of Index integer
    +    val sentences: RDD[Array[Int]] = dataset.mapPartitions { sentenceIter =>
    --- End diff --
    
    Ah right, yes you're correct. We should chunk up the sentences.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

Posted by ygcao <gi...@git.apache.org>.

Github user ygcao commented on the pull request:

    https://github.com/apache/spark/pull/10152#issuecomment-177429666
  
    Adopted all Sean's suggestions with minor editing to address missed edge case and make compiler happy, I think these suggestions are good ones, thanks!.   please try to trigger the QA build and see whether everything goes fine and potentially do perf-test to see whether there is big differences.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10152#issuecomment-181257924
  
    **[Test build #50915 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50915/consoleFull)** for PR 10152 at commit [`84a0bc4`](https://github.com/apache/spark/commit/84a0bc4c73ca0915f6e6291442c8e60f6d80d879).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][MLlib]add support of arbitrary l...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10152#discussion_r47430147
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -534,8 +577,15 @@ class Word2VecModel private[spark] (
         // Need not divide with the norm of the given vector since it is constant.
         val cosVec = cosineVec.map(_.toDouble)
         var ind = 0
    +    var vecNorm = 1f
    +    if (norm) {
    --- End diff --
    
    As the comment says it's not normalized just because it doesn't affect ordering. However, the API promises a "cosine similarity", and that's not what it is. Since it need only be applied to the result, I think it should be an actual cosine similarity. I do not think it should have a flag, no.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10152#discussion_r50684006
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -556,6 +571,7 @@ class Word2VecModel private[spark] (
           .sortBy(- _._2)
           .take(num + 1)
           .tail
    +      .map(v => (if (vecNorm == 0) v else (v._1, v._2 / vecNorm)))
    --- End diff --
    
    Strictly speaking we could throw an error or warning if passing in a zero vector as it's meaningless as you say. Still, I think in real usage that's unlikely, so I'd just go for the first option. I don't think there are real world efficiency gains to be had with the second option.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

Posted by ygcao <gi...@git.apache.org>.

Github user ygcao commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10152#discussion_r52773740
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -289,24 +301,20 @@ class Word2Vec extends Serializable with Logging {
         val expTable = sc.broadcast(createExpTable())
         val bcVocab = sc.broadcast(vocab)
         val bcVocabHash = sc.broadcast(vocabHash)
    -
    -    val sentences: RDD[Array[Int]] = words.mapPartitions { iter =>
    -      new Iterator[Array[Int]] {
    -        def hasNext: Boolean = iter.hasNext
    -
    -        def next(): Array[Int] = {
    -          val sentence = ArrayBuilder.make[Int]
    -          var sentenceLength = 0
    -          while (iter.hasNext && sentenceLength < MAX_SENTENCE_LENGTH) {
    -            val word = bcVocabHash.value.get(iter.next())
    -            word match {
    -              case Some(w) =>
    -                sentence += w
    -                sentenceLength += 1
    -              case None =>
    -            }
    +    // each partition is a collection of sentences,
    +    // will be translated into arrays of Index integer
    +    val sentences: RDD[Array[Int]] = dataset.mapPartitions { sentenceIter =>
    +      // Each sentence will map to 0 or more Array[Int]
    +      sentenceIter.flatMap { sentence => {
    +          // Sentence of words, some of which map to a word index
    +          val wordIndexes = sentence.flatMap(bcVocabHash.value.get)
    +          if (wordIndexes.nonEmpty) {
    +            // break wordIndexes into trunks of maxSentenceLength when has more
    +            val sentenceSplit = wordIndexes.grouped(maxSentenceLength)
    +            sentenceSplit.map(_.toArray)
    --- End diff --
    
    Good catch! I'll double check and change.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][MLlib]add support of arbitrary l...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10152#discussion_r46811287
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -534,8 +588,13 @@ class Word2VecModel private[spark] (
         // Need not divide with the norm of the given vector since it is constant.
         val cosVec = cosineVec.map(_.toDouble)
         var ind = 0
    +    var vecNorm = 1f
    +    if (norm)
    +      vecNorm = blas.snrm2(vectorSize, fVector, 1)
    --- End diff --
    
    @ygcao this is actually a good catch but I believe is covered by [SPARK-7617](https://issues.apache.org/jira/browse/SPARK-7617) and #6245.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10152#discussion_r48119953
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -77,6 +77,20 @@ class Word2Vec extends Serializable with Logging {
       private var numIterations = 1
       private var seed = Utils.random.nextLong()
       private var minCount = 5
    +  private var maxSentenceLength = 1000
    +
    +  /**
    +   * sets the maxSentenceLength, maxSentenceLength is used as the threshold for cutting sentence
    --- End diff --
    
    One final thing - can you address the comment above? And I think we can actually remove the `@param` and `@return` to match the comments for the other setters in this class.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10152#issuecomment-185685501
  
    **[Test build #51483 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/51483/consoleFull)** for PR 10152 at commit [`a4abd40`](https://github.com/apache/spark/commit/a4abd40f5a553fa95795e9da5fff0b4ac0256188).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][MLlib]add support of arbitrary l...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10152#discussion_r46807883
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -469,7 +469,32 @@ class Word2VecModel private[spark] (
         this(Word2VecModel.buildWordIndex(model), Word2VecModel.buildWordVectors(model))
       }
     
    -  private def cosineSimilarity(v1: Array[Float], v2: Array[Float]): Double = {
    +  /**
    +   * get cosineSimilarity of two word. assumed to be from the vocabulary and used after model built, otherwise will error out
    +   * @param v1 one word from the vocabulary
    +   * @param v2 the other word from the vocabulary
    +   * @return the cosinesimilarity score in the vector space of the given two words
    +   */
    +  def cosineSimilarity(v1: String, v2: String): Double = {
    +    return cosineSimilarity(getVectors(v1), getVectors(v2))
    +  }
    +
    +  /**
    +   * get the built vocabulary from the input
    +   * @return a map of word to its index
    +   */
    +  def getVocabulary:Map[String,Int]={
    --- End diff --
    
    @ygcao what is the specific need for `getVocabulary`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/10152#issuecomment-186546462
  
    At long last I think ready to go. @mengxr any more comments? or @MLnick 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][MLlib]add support of arbitrary l...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10152#discussion_r46768448
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -281,16 +280,17 @@ class Word2Vec extends Serializable with Logging {
         val expTable = sc.broadcast(createExpTable())
         val bcVocab = sc.broadcast(vocab)
         val bcVocabHash = sc.broadcast(vocabHash)
    -
    -    val sentences: RDD[Array[Int]] = words.mapPartitions { iter =>
    +    //each partition is a collection of sentences, will be translated into arrays of Index integer
    +    val sentences: RDD[Array[Int]] = dataset.mapPartitions { iter =>
    --- End diff --
    
    I suspect the max length here is mostly because it was in the original implementation. The problem here is: what if the 'sentence' ends up being very very long?
    
    Although a real "sentence" would not ever reasonably be 1000+ words, word2vec is used in other contexts where a "sentence" is something else. I can see making this configurable, but not removing it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

Posted by ygcao <gi...@git.apache.org>.

Github user ygcao commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10152#discussion_r52773573
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -289,24 +301,20 @@ class Word2Vec extends Serializable with Logging {
         val expTable = sc.broadcast(createExpTable())
         val bcVocab = sc.broadcast(vocab)
         val bcVocabHash = sc.broadcast(vocabHash)
    -
    -    val sentences: RDD[Array[Int]] = words.mapPartitions { iter =>
    -      new Iterator[Array[Int]] {
    -        def hasNext: Boolean = iter.hasNext
    -
    -        def next(): Array[Int] = {
    -          val sentence = ArrayBuilder.make[Int]
    -          var sentenceLength = 0
    -          while (iter.hasNext && sentenceLength < MAX_SENTENCE_LENGTH) {
    -            val word = bcVocabHash.value.get(iter.next())
    -            word match {
    -              case Some(w) =>
    -                sentence += w
    -                sentenceLength += 1
    -              case None =>
    -            }
    +    // each partition is a collection of sentences,
    +    // will be translated into arrays of Index integer
    +    val sentences: RDD[Array[Int]] = dataset.mapPartitions { sentenceIter =>
    +      // Each sentence will map to 0 or more Array[Int]
    +      sentenceIter.flatMap { sentence => {
    +          // Sentence of words, some of which map to a word index
    +          val wordIndexes = sentence.flatMap(bcVocabHash.value.get)
    +          if (wordIndexes.nonEmpty) {
    --- End diff --
    
    The ppurpose for if statement is that an empty sentence(after lookup) should not result in an empty element. Whether we use grouped or not won't change the fact that it could generate empty element which could be harmful and wasteful for late steps.
    My test proves the if statement is needed for get rid of empty element.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: add support of arbitrary length sentence by us...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10152#discussion_r46759178
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -281,16 +280,17 @@ class Word2Vec extends Serializable with Logging {
         val expTable = sc.broadcast(createExpTable())
         val bcVocab = sc.broadcast(vocab)
         val bcVocabHash = sc.broadcast(vocabHash)
    -
    -    val sentences: RDD[Array[Int]] = words.mapPartitions { iter =>
    +    //each partition is a collection of sentences, will be translated into arrays of Index integer
    +    val sentences: RDD[Array[Int]] = dataset.mapPartitions { iter =>
    --- End diff --
    
    I'm not sure it makes sense to remove a max entirely, no. What problem is it causing you? maybe it could be configurable.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on the pull request:

    https://github.com/apache/spark/pull/10152#issuecomment-170927597
  
    @ygcao the changes have impacted the tests - could you take a look at the failure? I think we may need to update the test suite. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][MLlib]add support of arbitrary l...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10152#discussion_r46768436
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -469,7 +469,32 @@ class Word2VecModel private[spark] (
         this(Word2VecModel.buildWordIndex(model), Word2VecModel.buildWordVectors(model))
       }
     
    -  private def cosineSimilarity(v1: Array[Float], v2: Array[Float]): Double = {
    +  /**
    +   * get cosineSimilarity of two word. assumed to be from the vocabulary and used after model built, otherwise will error out
    +   * @param v1 one word from the vocabulary
    +   * @param v2 the other word from the vocabulary
    +   * @return the cosinesimilarity score in the vector space of the given two words
    +   */
    +  def cosineSimilarity(v1: String, v2: String): Double = {
    +    return cosineSimilarity(getVectors(v1), getVectors(v2))
    +  }
    +
    +  /**
    +   * get the built vocabulary from the input
    +   * @return a map of word to its index
    +   */
    +  def getVocabulary:Map[String,Int]={
    --- End diff --
    
    Run `dev/lint-scala` to see errors. For instance, this is missing spaces. You don't need `return`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10152#issuecomment-180820177
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/50873/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/10152#issuecomment-182595810
  
    made one pass and only minor comments


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10152#discussion_r50675972
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -289,17 +301,28 @@ class Word2Vec extends Serializable with Logging {
         val expTable = sc.broadcast(createExpTable())
         val bcVocab = sc.broadcast(vocab)
         val bcVocabHash = sc.broadcast(vocabHash)
    -
    -    val sentences: RDD[Array[Int]] = words.mapPartitions { iter =>
    +    // each partition is a collection of sentences, will be translated into arrays of Index integer
    +    val sentences: RDD[Array[Int]] = dataset.mapPartitions { sentenceIter =>
           new Iterator[Array[Int]] {
    -        def hasNext: Boolean = iter.hasNext
    +        var wordIter: Iterator[String] = null
    +
    +        def hasNext: Boolean = sentenceIter.hasNext || (wordIter != null && wordIter.hasNext)
     
             def next(): Array[Int] = {
               val sentence = ArrayBuilder.make[Int]
               var sentenceLength = 0
    -          while (iter.hasNext && sentenceLength < MAX_SENTENCE_LENGTH) {
    -            val word = bcVocabHash.value.get(iter.next())
    -            word match {
    +          // do translation of each word into its index in the vocabulary,
    --- End diff --
    
    I understand that this part of the change intends to respect the implied sentence boundaries in the input. I think it can be simpler? One input sentence maps to 1 or more arrays, and the result should be flattened. Something like?
    
    ```
    // Each input sentence will produce 1 or more Array[Int], so flatMapPartitions
    dataset.flatMapPartitions { sentenceIter => 
      // Each sentence will map to 1 or more Array[Int], so map
      sentenceIter.map { sentence =>
        // Sentence of words, some of which map to a hash, so flatMap
        val hashes = sentence.flatMap(bcVocabHash.value.get)
        // break into sequence of at most maxSentenceLength
        hashes.grouped(maxSentenceLength).map(_.toArray)
      }
    }
    ```
    
    I haven't tested it but does that seem like the intent?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10152#discussion_r52575825
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -289,24 +301,20 @@ class Word2Vec extends Serializable with Logging {
         val expTable = sc.broadcast(createExpTable())
         val bcVocab = sc.broadcast(vocab)
         val bcVocabHash = sc.broadcast(vocabHash)
    -
    -    val sentences: RDD[Array[Int]] = words.mapPartitions { iter =>
    -      new Iterator[Array[Int]] {
    -        def hasNext: Boolean = iter.hasNext
    -
    -        def next(): Array[Int] = {
    -          val sentence = ArrayBuilder.make[Int]
    -          var sentenceLength = 0
    -          while (iter.hasNext && sentenceLength < MAX_SENTENCE_LENGTH) {
    -            val word = bcVocabHash.value.get(iter.next())
    -            word match {
    -              case Some(w) =>
    -                sentence += w
    -                sentenceLength += 1
    -              case None =>
    -            }
    +    // each partition is a collection of sentences,
    +    // will be translated into arrays of Index integer
    +    val sentences: RDD[Array[Int]] = dataset.mapPartitions { sentenceIter =>
    +      // Each sentence will map to 0 or more Array[Int]
    +      sentenceIter.flatMap { sentence => {
    +          // Sentence of words, some of which map to a word index
    +          val wordIndexes = sentence.flatMap(bcVocabHash.value.get)
    +          if (wordIndexes.nonEmpty) {
    --- End diff --
    
    Yes, `flatMap` would flatten an empty iterator to nothing.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][MLlib]add support of arbitrary l...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10152#discussion_r47542276
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -281,17 +294,28 @@ class Word2Vec extends Serializable with Logging {
         val expTable = sc.broadcast(createExpTable())
         val bcVocab = sc.broadcast(vocab)
         val bcVocabHash = sc.broadcast(vocabHash)
    -
    -    val sentences: RDD[Array[Int]] = words.mapPartitions { iter =>
    +    // each partition is a collection of sentences, will be translated into arrays of Index integer
    +    val sentences: RDD[Array[Int]] = dataset.mapPartitions { sentenceIter =>
    --- End diff --
    
    I think we can simply do:
    
    ```scala
        val sentences: RDD[Array[Int]] = dataset.mapPartitions { iter =>
          new Iterator[Array[Int]] {
            def hasNext: Boolean = iter.hasNext
    
            def next(): Array[Int] = {
              val sentence = ArrayBuilder.make[Int]
              var sentenceLength = 0
              val wordIter = iter.next().iterator
              while (wordIter.hasNext && sentenceLength < MAX_SENTENCE_LENGTH) {
                val word = bcVocabHash.value.get(wordIter.next())
                word match {
                  case Some(w) =>
                    sentence += w
                    sentenceLength += 1
                  case None =>
                }
              }
              sentence.result()
            }
          }
        }
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on the pull request:

    https://github.com/apache/spark/pull/10152#issuecomment-166215472
  
    @ygcao just one final comment on the `setMaxSentenceLength` setter comment to address, thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10152#discussion_r52867596
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -289,24 +301,20 @@ class Word2Vec extends Serializable with Logging {
         val expTable = sc.broadcast(createExpTable())
         val bcVocab = sc.broadcast(vocab)
         val bcVocabHash = sc.broadcast(vocabHash)
    -
    -    val sentences: RDD[Array[Int]] = words.mapPartitions { iter =>
    -      new Iterator[Array[Int]] {
    -        def hasNext: Boolean = iter.hasNext
    -
    -        def next(): Array[Int] = {
    -          val sentence = ArrayBuilder.make[Int]
    -          var sentenceLength = 0
    -          while (iter.hasNext && sentenceLength < MAX_SENTENCE_LENGTH) {
    -            val word = bcVocabHash.value.get(iter.next())
    -            word match {
    -              case Some(w) =>
    -                sentence += w
    -                sentenceLength += 1
    -              case None =>
    -            }
    +    // each partition is a collection of sentences,
    +    // will be translated into arrays of Index integer
    +    val sentences: RDD[Array[Int]] = dataset.mapPartitions { sentenceIter =>
    +      // Each sentence will map to 0 or more Array[Int]
    +      sentenceIter.flatMap { sentence => {
    --- End diff --
    
    Final (hopefully) style change - the second brace is not required (i.e. it should be `sentenceIter.flatMap { sentence =>`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][MLlib]add support of arbitrary l...

Posted by ygcao <gi...@git.apache.org>.

Github user ygcao commented on the pull request:

    https://github.com/apache/spark/pull/10152#issuecomment-164184139
  
    realized an issue, did a quick fix. sentence boundary was not really respected in the previous version. 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

Posted by ygcao <gi...@git.apache.org>.

Github user ygcao commented on the pull request:

    https://github.com/apache/spark/pull/10152#issuecomment-185036851
  
    Done! sorry for missing the comments.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][MLlib]add support of arbitrary l...

Posted by ygcao <gi...@git.apache.org>.

Github user ygcao commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10152#discussion_r46921737
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -469,7 +469,32 @@ class Word2VecModel private[spark] (
         this(Word2VecModel.buildWordIndex(model), Word2VecModel.buildWordVectors(model))
       }
     
    -  private def cosineSimilarity(v1: Array[Float], v2: Array[Float]): Double = {
    +  /**
    +   * get cosineSimilarity of two word. assumed to be from the vocabulary and used after model built, otherwise will error out
    +   * @param v1 one word from the vocabulary
    +   * @param v2 the other word from the vocabulary
    +   * @return the cosinesimilarity score in the vector space of the given two words
    +   */
    +  def cosineSimilarity(v1: String, v2: String): Double = {
    +    return cosineSimilarity(getVectors(v1), getVectors(v2))
    +  }
    +
    +  /**
    +   * get the built vocabulary from the input
    +   * @return a map of word to its index
    +   */
    +  def getVocabulary:Map[String,Int]={
    --- End diff --
    
    GetVocabulary and getwordvectors is useful when you need to join or iterator the built vectors in batch. While getvectors is only useful to lookup vector for one specific known word in vocabulary which throws exception when word is out of vocabulary. The usages are very different.
    Will look into the dataframe version to see whether it can cover the batch usrcase to decide whether we can work around without adding getter here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

Posted by ygcao <gi...@git.apache.org>.

Github user ygcao commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10152#discussion_r52574236
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -272,15 +285,14 @@ class Word2Vec extends Serializable with Logging {
     
       /**
        * Computes the vector representation of each word in vocabulary.
    -   * @param dataset an RDD of words
    +   * @param dataset a RDD of sentences,
    --- End diff --
    
    This is an interesting topic, seem r is not a vowel, not sounds like vowel either, why 'an'?
    I found this from web:You use the article “a” before words that start with a consonant sound and “an” before words that start with a vowel sound.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

Posted by ygcao <gi...@git.apache.org>.

Github user ygcao commented on the pull request:

    https://github.com/apache/spark/pull/10152#issuecomment-168907161
  
    @MLnick Happy new year. I think I've addressed all you comments last year, could you help to do the merge this year? If anyone still have other concerns, please let me know. an unfinished pull request without a good reason will be a huge waste. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

Posted by ygcao <gi...@git.apache.org>.

Github user ygcao commented on the pull request:

    https://github.com/apache/spark/pull/10152#issuecomment-184528879
  
    addressed the 'final' comment, and checked lint and test cases. shall we do the merge then? Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10152#issuecomment-172787971
  
    **[Test build #2407 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2407/consoleFull)** for PR 10152 at commit [`141d7a2`](https://github.com/apache/spark/commit/141d7a2bff8c8e462a9a2aff33b1f16bebc30b25).
     * This patch **fails PySpark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/10152


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10152#issuecomment-170878810
  
    **[Test build #2372 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2372/consoleFull)** for PR 10152 at commit [`214d0d9`](https://github.com/apache/spark/commit/214d0d9d3350a624f98f36aff2393b7b61fd6176).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][MLlib]add support of arbitrary l...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/10152#issuecomment-162975357
  
    Overall, I'd say it's unclear whether we need to modify our implementation.  How about we look for use cases and see if people have reported differences between following and ignoring sentence boundaries, before continuing with lots of code changes?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

Posted by ygcao <gi...@git.apache.org>.

Github user ygcao commented on the pull request:

    https://github.com/apache/spark/pull/10152#issuecomment-183197942
  
    addressed new comments. still kept the if statement as I explained by sample codes.
    reran test and lint test. Jenkins should still be happy :fireworks: 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][MLlib]add support of arbitrary l...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10152#discussion_r47548255
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -534,8 +577,15 @@ class Word2VecModel private[spark] (
         // Need not divide with the norm of the given vector since it is constant.
         val cosVec = cosineVec.map(_.toDouble)
         var ind = 0
    +    var vecNorm = 1f
    +    if (norm) {
    --- End diff --
    
    You only need to normalize the top K values that are returned, at the end.
    If speed is important here, then this implementation should probably be changed to use a heap or something rather than needlessly sort the whole thing. That's a much bigger bottleneck.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][MLlib]add support of arbitrary l...

Posted by ygcao <gi...@git.apache.org>.

Github user ygcao commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10152#discussion_r47552249
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -281,17 +294,28 @@ class Word2Vec extends Serializable with Logging {
         val expTable = sc.broadcast(createExpTable())
         val bcVocab = sc.broadcast(vocab)
         val bcVocabHash = sc.broadcast(vocabHash)
    -
    -    val sentences: RDD[Array[Int]] = words.mapPartitions { iter =>
    +    // each partition is a collection of sentences, will be translated into arrays of Index integer
    +    val sentences: RDD[Array[Int]] = dataset.mapPartitions { sentenceIter =>
    --- End diff --
    
    This is very close to my original version exception for will throw out all words after MAX_SENTENCE_LENGTH, and are you preferring to make the maxSentenceLength static config? 
    The latest version of mine will still try to take use of the rest of sentences for training after cutting by maxSentenceLength. e.g. for a 2200 word long sentence, it will be used as three cut sentences just like the old version except for the last/third sentence from the cut will be 200 words long without words padded from the next sentence. This way, we can maximize the usage of our data with both respecting sentence boundary and sentence length restriction.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

Posted by ygcao <gi...@git.apache.org>.

Github user ygcao commented on the pull request:

    https://github.com/apache/spark/pull/10152#issuecomment-174413771
  
    Thanks for the merge! It looks successful, but it seems like the change is still not appearing in the master. 
    What's the rest of the workflow for it to be appearing in Master branch of Spark? Any follow up needed from me?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10152#discussion_r47875988
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -281,17 +295,28 @@ class Word2Vec extends Serializable with Logging {
         val expTable = sc.broadcast(createExpTable())
         val bcVocab = sc.broadcast(vocab)
         val bcVocabHash = sc.broadcast(vocabHash)
    -
    -    val sentences: RDD[Array[Int]] = words.mapPartitions { iter =>
    +    // each partition is a collection of sentences, will be translated into arrays of Index integer
    +    val sentences: RDD[Array[Int]] = dataset.mapPartitions { sentenceIter =>
           new Iterator[Array[Int]] {
    -        def hasNext: Boolean = iter.hasNext
    +        var wordIter: Iterator[String] = null
    +
    +        def hasNext: Boolean = sentenceIter.hasNext || (wordIter != null && wordIter.hasNext)
     
             def next(): Array[Int] = {
               val sentence = ArrayBuilder.make[Int]
               var sentenceLength = 0
    -          while (iter.hasNext && sentenceLength < MAX_SENTENCE_LENGTH) {
    -            val word = bcVocabHash.value.get(iter.next())
    -            word match {
    +          // do translation of each word into its index in the vocabulary,
    +          // do cutting only when the sentence is larger than maxSentenceLength
    +          if ((wordIter == null || !wordIter.hasNext) && sentenceIter.hasNext) {
    +            do {
    --- End diff --
    
    ok, fair enough


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10152#discussion_r52141782
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -289,26 +301,24 @@ class Word2Vec extends Serializable with Logging {
         val expTable = sc.broadcast(createExpTable())
         val bcVocab = sc.broadcast(vocab)
         val bcVocabHash = sc.broadcast(vocabHash)
    -
    -    val sentences: RDD[Array[Int]] = words.mapPartitions { iter =>
    -      new Iterator[Array[Int]] {
    -        def hasNext: Boolean = iter.hasNext
    -
    -        def next(): Array[Int] = {
    -          val sentence = ArrayBuilder.make[Int]
    -          var sentenceLength = 0
    -          while (iter.hasNext && sentenceLength < MAX_SENTENCE_LENGTH) {
    -            val word = bcVocabHash.value.get(iter.next())
    -            word match {
    -              case Some(w) =>
    -                sentence += w
    -                sentenceLength += 1
    -              case None =>
    +    // each partition is a collection of sentences,
    +    // will be translated into arrays of Index integer
    +    val sentences: RDD[Array[Int]] = dataset.mapPartitions {
    +      // Each sentence will map to 0 or more Array[Int]
    +      sentenceIter =>
    --- End diff --
    
    Generally I prefer the style:
    
    ```
    dataset.mapPartitions { sentenceIter =>
      sentenceIter.flatMap { sentence =>
        ...
    ```
    
    I don't think it's a hard rule for Spark but most other code follows this convention.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10152#discussion_r52638919
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -76,6 +76,18 @@ class Word2Vec extends Serializable with Logging {
       private var numIterations = 1
       private var seed = Utils.random.nextLong()
       private var minCount = 5
    +  private var maxSentenceLength = 1000
    +
    +  /**
    +   * Sets the maximum length of each sentence in the input data.
    +   * Any sentence longer than this threshold will be divided into chunks of
    +   * up to `maxSentenceLength` size (default: 1000)
    +   */
    +  @Since("2.0.0")
    +  def setMaxSentenceLength(maxSentenceLength: Int): this.type = {
    --- End diff --
    
    Sounds good.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org