You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "William Benton (JIRA)" <ji...@apache.org> on 2016/09/19 14:35:20 UTC

[jira] [Created] (SPARK-17595) Inefficient selection in Word2VecModel.findSynonyms

William Benton created SPARK-17595:
--------------------------------------

             Summary: Inefficient selection in Word2VecModel.findSynonyms
                 Key: SPARK-17595
                 URL: https://issues.apache.org/jira/browse/SPARK-17595
             Project: Spark
          Issue Type: Improvement
          Components: MLlib
    Affects Versions: 2.0.0
            Reporter: William Benton
            Priority: Minor


The code in `Word2VecModel.findSynonyms` to choose the vocabulary elements with the highest similarity to the query vector currently sorts the similarities for every vocabulary element.  This involves making multiple copies of the collection of similarities while doing a (relatively) expensive sort.  It would be more efficient to find the best matches by maintaining a bounded priority queue and populating it with a single pass over the vocabulary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org