You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Manoj Kumar (JIRA)" <ji...@apache.org> on 2015/04/09 20:55:14 UTC

[jira] [Commented] (SPARK-6065) Optimize word2vec.findSynonyms speed

    [ https://issues.apache.org/jira/browse/SPARK-6065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14487974#comment-14487974 ] 

Manoj Kumar commented on SPARK-6065:
------------------------------------

Sorry for taking too much time to get back to this. I'm not sure if using data structures like KDTree or BallTree is good because I've read somewhere that trees are not the best for high dimensional data. (scikit-learn defaults to a brute search if the metric provided is cosine). We could probably use algos like Locality Sensitive Hashing, but it might be overkill. WDYT?

> Optimize word2vec.findSynonyms speed
> ------------------------------------
>
>                 Key: SPARK-6065
>                 URL: https://issues.apache.org/jira/browse/SPARK-6065
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>    Affects Versions: 1.2.0
>            Reporter: Joseph K. Bradley
>
> word2vec.findSynonyms iterates through the entire vocabulary to find similar words.  This is really slow relative to the [gcode-hosted word2vec implementation | https://code.google.com/p/word2vec/].  It should be optimized by storing words in a datastructure designed for finding nearest neighbors.
> This would require storing a copy of the model (basically an inverted dictionary), which could be a problem if users have a big model (e.g., 100 features x 10M words or phrases = big dictionary).  It might be best to provide a function for converting the model into a model optimized for findSynonyms.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org