You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spark.apache.org by sr...@apache.org on 2015/12/05 16:27:38 UTC
spark git commit: [SPARK-12096][MLLIB] remove the old constraint in
word2vec
Repository: spark
Updated Branches:
refs/heads/master 3af53e61f -> ee94b70ce
[SPARK-12096][MLLIB] remove the old constraint in word2vec
jira: https://issues.apache.org/jira/browse/SPARK-12096
word2vec now can handle much bigger vocabulary.
The old constraint vocabSize.toLong * vectorSize < Ine.max / 8 should be removed.
new constraint is vocabSize.toLong * vectorSize < max array length (usually a little less than Int.MaxValue)
I tested with vocabsize over 18M and vectorsize = 100.
srowen jkbradley Sorry to miss this in last PR. I was reminded today.
Author: Yuhao Yang <hh...@gmail.com>
Closes #10103 from hhbyyh/w2vCapacity.
Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ee94b70c
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ee94b70c
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ee94b70c
Branch: refs/heads/master
Commit: ee94b70ce56661ea26c5aad17778ade32f3f1d3d
Parents: 3af53e6
Author: Yuhao Yang <hh...@gmail.com>
Authored: Sat Dec 5 15:27:31 2015 +0000
Committer: Sean Owen <so...@cloudera.com>
Committed: Sat Dec 5 15:27:31 2015 +0000
----------------------------------------------------------------------
.../src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
----------------------------------------------------------------------
http://git-wip-us.apache.org/repos/asf/spark/blob/ee94b70c/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala
----------------------------------------------------------------------
diff --git a/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala b/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala
index 655ac0b..be12d45 100644
--- a/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala
+++ b/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala
@@ -306,10 +306,10 @@ class Word2Vec extends Serializable with Logging {
val newSentences = sentences.repartition(numPartitions).cache()
val initRandom = new XORShiftRandom(seed)
- if (vocabSize.toLong * vectorSize * 8 >= Int.MaxValue) {
+ if (vocabSize.toLong * vectorSize >= Int.MaxValue) {
throw new RuntimeException("Please increase minCount or decrease vectorSize in Word2Vec" +
" to avoid an OOM. You are highly recommended to make your vocabSize*vectorSize, " +
- "which is " + vocabSize + "*" + vectorSize + " for now, less than `Int.MaxValue/8`.")
+ "which is " + vocabSize + "*" + vectorSize + " for now, less than `Int.MaxValue`.")
}
val syn0Global =
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org
For additional commands, e-mail: commits-help@spark.apache.org