You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by wangyum <gi...@git.apache.org> on 2016/06/17 11:57:08 UTC

[GitHub] spark pull request #13735: [SPARK-15328][MLLIB][ML] Word2Vec import for orig...

GitHub user wangyum opened a pull request:

    https://github.com/apache/spark/pull/13735

    [SPARK-15328][MLLIB][ML] Word2Vec import for original binary format

    ## What changes were proposed in this pull request?
    
    Add `loadGoogleModel()` function to import original wor2vec binary format.
    
    
    ## How was this patch tested?
    
    `mllib.feature.Word2VecSuite` and `ml.feature.Word2VecSuite`
    
    I also tested with real model:
    ![spark_load_google_word2vec](https://cloud.githubusercontent.com/assets/5399861/15271931/2a8d4f4a-1a93-11e6-880b-27122f608909.png)

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/wangyum/spark SPARK-15328

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/13735.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #13735
    
----
commit d4c77253a89c2b85c3b976db6bcd578a20d43b35
Author: Yuming Wang <wg...@gmail.com>
Date:   2016-06-17T11:52:27Z

    Load Google word2vec model

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13735: [SPARK-15328][MLLIB][ML] Word2Vec import for original bi...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13735
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13735: [SPARK-15328][MLLIB][ML] Word2Vec import for original bi...

Posted by insidedctm <gi...@git.apache.org>.

Github user insidedctm commented on the issue:

    https://github.com/apache/spark/pull/13735
  
    This seems to work fine with small model such as that produced by demo_word.sh in the word2vec code repository however I get problems when trying a large model such as [GoogleNews-vectors-negative300.bin](https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?usp=sharing).
    
    I can successfully load the model using this code (albeit I needed to give the driver 12GB of memory):
    `import org.apache.spark.ml.feature.Word2VecModel`
    `val path = "file:///Downloads/GoogleNews-vectors-negative300.bin"`
    `val model = Word2VecModel.loadGoogleModel(path)`
    
    However synonyms are not found for a typical lookup e.g.
    `model.findSynonyms("spark",20).show`
    responds with
    `java.lang.IllegalStateException: spark not in vocabulary`
    
    However the distance tool from the word2vec toolkit, loading the same model gives:
    
    <img width="594" alt="screen shot 2016-09-15 at 12 57 03" src="https://cloud.githubusercontent.com/assets/5909684/18549055/0a60f9da-7b44-11e6-895c-88ee018ed1a9.png">
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13735: [SPARK-15328][MLLIB][ML] Word2Vec import for orig...

Posted by wangyum <gi...@git.apache.org>.

Github user wangyum closed the pull request at:

    https://github.com/apache/spark/pull/13735


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org