You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spark.apache.org by sr...@apache.org on 2015/12/05 16:27:38 UTC
spark git commit: [SPARK-12096][MLLIB] remove the old constraint in word2vec

Repository: spark
Updated Branches:
  refs/heads/master 3af53e61f -> ee94b70ce


[SPARK-12096][MLLIB] remove the old constraint in word2vec

jira: https://issues.apache.org/jira/browse/SPARK-12096

word2vec now can handle much bigger vocabulary.
The old constraint vocabSize.toLong * vectorSize < Ine.max / 8 should be removed.

new constraint is vocabSize.toLong * vectorSize < max array length (usually a little less than Int.MaxValue)

I tested with vocabsize over 18M and vectorsize = 100.

srowen jkbradley Sorry to miss this in last PR. I was reminded today.

Author: Yuhao Yang <hh...@gmail.com>

Closes #10103 from hhbyyh/w2vCapacity.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ee94b70c
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ee94b70c
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ee94b70c

Branch: refs/heads/master
Commit: ee94b70ce56661ea26c5aad17778ade32f3f1d3d
Parents: 3af53e6
Author: Yuhao Yang <hh...@gmail.com>
Authored: Sat Dec 5 15:27:31 2015 +0000
Committer: Sean Owen <so...@cloudera.com>
Committed: Sat Dec 5 15:27:31 2015 +0000

----------------------------------------------------------------------
 .../src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/ee94b70c/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala
----------------------------------------------------------------------
diff --git a/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala b/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala
index 655ac0b..be12d45 100644
--- a/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala
+++ b/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala
@@ -306,10 +306,10 @@ class Word2Vec extends Serializable with Logging {
     val newSentences = sentences.repartition(numPartitions).cache()
     val initRandom = new XORShiftRandom(seed)
 
-    if (vocabSize.toLong * vectorSize * 8 >= Int.MaxValue) {
+    if (vocabSize.toLong * vectorSize >= Int.MaxValue) {
       throw new RuntimeException("Please increase minCount or decrease vectorSize in Word2Vec" +
         " to avoid an OOM. You are highly recommended to make your vocabSize*vectorSize, " +
-        "which is " + vocabSize + "*" + vectorSize + " for now, less than `Int.MaxValue/8`.")
+        "which is " + vocabSize + "*" + vectorSize + " for now, less than `Int.MaxValue`.")
     }
 
     val syn0Global =


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org
For additional commands, e-mail: commits-help@spark.apache.org