You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Giangiacomo Sanna (Jira)" <ji...@apache.org> on 2020/06/03 19:51:00 UTC
[jira] [Comment Edited] (SPARK-6617) Word2Vec is nondeterministic

    [ https://issues.apache.org/jira/browse/SPARK-6617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17125138#comment-17125138 ] 

Giangiacomo Sanna edited comment on SPARK-6617 at 6/3/20, 7:50 PM:
-------------------------------------------------------------------

Sorry, I see that this has not been fixed yet. I'm not fluent in Scala, but I see at least to things causing non-determinism:

I see "repartition" at [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala#L347] (same line as the original issue). Replacing it with repartitionAndSortWithinPartitions would help making the fit deterministic.

On top of that, when the vocabulary is learned in the learnVocab method ([https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala#L198]), the dataset of counts of word occurrences is collected and then sorted by count. This is also non-deterministic, since there are lots of ties which are solved according to the collect order. Since this determines the integer that is associated with each word, this also makes the repartition at line 347 non-deterministic.

Thanks!


was (Author: giangiacomosanna):
Sorry, I see this marked as resolved, but even when I fix the seed I get changing results for the fit method. Unfortunately, whenever full reproducibility is a regulatory constraint (as it is in some industries) this means that Spark's word2vec cannot be used.

I see "repartition" at [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala#L347] (same line as the original issue). Replacing it with repartitionAndSortWithinPartitions would help making the fit deterministic.

On top of that, when the vocabulary is learned in the learnVocab method ([https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala#L198]), the dataset of counts of word occurrences is collected and then sorted by count. This is also non-deterministic, since there are lots of ties which are solved according to the collect order. Since this determines the integer that is associated with each word, this also makes the repartition at line 347 non-deterministic.

> Word2Vec is nondeterministic
> ----------------------------
>
>                 Key: SPARK-6617
>                 URL: https://issues.apache.org/jira/browse/SPARK-6617
>             Project: Spark
>          Issue Type: Bug
>          Components: MLlib
>    Affects Versions: 1.3.0
>            Reporter: Xiangrui Meng
>            Priority: Minor
>              Labels: bulk-closed
>
> Word2Vec uses repartition: https://github.com/apache/spark/blob/v1.3.0/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala#L291, which doesn't provide deterministic ordering. This makes QA a little harder.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org