You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by ganonp <gi...@git.apache.org> on 2014/12/14 23:59:51 UTC

[GitHub] spark pull request: Added setMinCount to Word2Vec.scala

GitHub user ganonp opened a pull request:

    https://github.com/apache/spark/pull/3693

    Added setMinCount to Word2Vec.scala

    Wanted to customize the private minCount variable in the Word2Vec class. Added
    a method to do so.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/ganonp/spark my-custom-spark

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/3693.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #3693
    
----
commit 5eb91000cd74ddd7704c79ca69259ee48c5840f9
Author: ganonp <ga...@gmail.com>
Date:   2014-12-14T21:56:19Z

    Added setMinCount to Word2Vec.scala
    
    Wanted to customize the minCount variable in the Word2Vec class. Added
    a method to do so.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: Added setMinCount to Word2Vec.scala

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/3693#issuecomment-68004178
  
    Bumping this, since it looks like it might be good to go.  Looks like there's a new change here related to making `norm` public, which might want additional review (I don't know anything about MLlib; just trying to help identify stuff that looks good-to-go to help drain the PR queue before the holidays).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: Added setMinCount to Word2Vec.scala

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/3693#issuecomment-68285349
  
    ok to test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: Added setMinCount to Word2Vec.scala

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/3693#issuecomment-68279848
  
    @ganonp Could you update the branch and remove the last commit?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: Added setMinCount to Word2Vec.scala

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3693#discussion_r21866645
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -122,8 +122,13 @@ class Word2Vec extends Serializable with Logging {
       /** context words from [-window, window] */
       private val window = 5
     
    -  /** minimum frequency to consider a vocabulary word */
    -  private val minCount = 5
    +/** minimum frequency to consider a vocabulary word */
    --- End diff --
    
    2-space indentation


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: Added setMinCount to Word2Vec.scala

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/3693#issuecomment-68315410
  
    LGTM (including the change to `norm`). Merged into master. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: Added setMinCount to Word2Vec.scala

Posted by ganonp <gi...@git.apache.org>.
Github user ganonp commented on the pull request:

    https://github.com/apache/spark/pull/3693#issuecomment-68062042
  
    Sorry I didn't mean to commit that norm method for this pull request. That said, I think it makes sense for norm to be public or at least a d=2 version of norm.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: Added setMinCount to Word2Vec.scala

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/3693#issuecomment-67533957
  
    Sounds good!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: Added setMinCount to Word2Vec.scala

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3693#discussion_r21999842
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -121,9 +121,19 @@ class Word2Vec extends Serializable with Logging {
     
       /** context words from [-window, window] */
       private val window = 5
    -
    -  /** minimum frequency to consider a vocabulary word */
    -  private val minCount = 5
    +  
    +  /** The minimum number of times a token must occur in the training corpus to be 
    --- End diff --
    
    This should be formatted as:
    ```
      /**
       * The minimum number of times a token must occur in the training corpus to be
       * included in the word2vec model (default: 5).
       */
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: Added setMinCount to Word2Vec.scala

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3693#issuecomment-68285678
  
      [Test build #24865 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24865/consoleFull) for   PR 3693 at commit [`ad534f2`](https://github.com/apache/spark/commit/ad534f26c44a7bdc8ee91f73d80a93bd13aa6805).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: Added setMinCount to Word2Vec.scala

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3693#issuecomment-68298891
  
      [Test build #24865 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24865/consoleFull) for   PR 3693 at commit [`ad534f2`](https://github.com/apache/spark/commit/ad534f26c44a7bdc8ee91f73d80a93bd13aa6805).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: Added setMinCount to Word2Vec.scala

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/3693#issuecomment-66934430
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: Added setMinCount to Word2Vec.scala

Posted by ganonp <gi...@git.apache.org>.
Github user ganonp commented on the pull request:

    https://github.com/apache/spark/pull/3693#issuecomment-67502378
  
    O wow, I just didn't see that the function and everything inside was lining up... Hurts to look at. 
    
    Thanks for those links and your patience. Spark now makes up about 70% of my job, so I'll definitely be contributing more.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: Added setMinCount to Word2Vec.scala

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/3693#issuecomment-67426600
  
    Oh, and yes, I meant to use 4 spaces inside the function.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: Added setMinCount to Word2Vec.scala

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3693#discussion_r21999847
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -121,9 +121,19 @@ class Word2Vec extends Serializable with Logging {
     
       /** context words from [-window, window] */
       private val window = 5
    -
    -  /** minimum frequency to consider a vocabulary word */
    -  private val minCount = 5
    +  
    +  /** The minimum number of times a token must occur in the training corpus to be 
    +    * included in the word2vec model (default: 5). 
    +    */
    +  private var minCount = 5
    +
    +  /** Sets minCount, the minimum number of times a token must appear to be included in the word2vec model's 
    +    * vocabulary (default: 5).
    +    */
    +  def setMinCount(minCount: Int): this.type = {
    +  this.minCount = minCount
    --- End diff --
    
    4 space indent


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: Added setMinCount to Word2Vec.scala

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/3693#issuecomment-68298901
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24865/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: Added setMinCount to Word2Vec.scala

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3693#discussion_r21866649
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -122,8 +122,13 @@ class Word2Vec extends Serializable with Logging {
       /** context words from [-window, window] */
       private val window = 5
     
    -  /** minimum frequency to consider a vocabulary word */
    -  private val minCount = 5
    +/** minimum frequency to consider a vocabulary word */
    +private var minCount = 5
    +
    +def setMinCount(minCount: Int): this.type = {
    --- End diff --
    
    please add javadoc


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: Added setMinCount to Word2Vec.scala

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/3693#issuecomment-67425285
  
    @ganonp  No problem (but we are pretty strict about style).  When in doubt, check out other code in the project for example.  Also, I'd recommend checking out these resources:
    * [Contributing to Spark](https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark)
    * [Spark style guide](https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide)
    * [Scala style guide](http://docs.scala-lang.org/style/)



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: Added setMinCount to Word2Vec.scala

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3693#discussion_r22061443
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -121,9 +121,21 @@ class Word2Vec extends Serializable with Logging {
     
       /** context words from [-window, window] */
       private val window = 5
    +  
    +  /** 
    +   * The minimum number of times a token must occur in the training corpus to be 
    +   * included in the word2vec model (default: 5). 
    +   */
    +  private var minCount = 5
    --- End diff --
    
    Sorry, I just noticed that the other options are grouped at the top of the Word2Vec class.  Would you mind moving this there?  Thanks a lot!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: Added setMinCount to Word2Vec.scala

Posted by ganonp <gi...@git.apache.org>.
Github user ganonp commented on the pull request:

    https://github.com/apache/spark/pull/3693#issuecomment-67424689
  
    Sorry (first time contributing), do you mean I should use 4 spaces instead of tab per the convention? When I do this, my additions appear out of alignment with the rest of the code... 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: Added setMinCount to Word2Vec.scala

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3693#discussion_r21999844
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -121,9 +121,19 @@ class Word2Vec extends Serializable with Logging {
     
       /** context words from [-window, window] */
       private val window = 5
    -
    -  /** minimum frequency to consider a vocabulary word */
    -  private val minCount = 5
    +  
    +  /** The minimum number of times a token must occur in the training corpus to be 
    +    * included in the word2vec model (default: 5). 
    +    */
    +  private var minCount = 5
    +
    +  /** Sets minCount, the minimum number of times a token must appear to be included in the word2vec model's 
    --- End diff --
    
    line should be < 100 chars.
    Also, format as in the comment above


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: Added setMinCount to Word2Vec.scala

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/3693


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org