You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spark.apache.org by me...@apache.org on 2015/01/30 19:07:29 UTC

spark git commit: [MLLIB] SPARK-4846: throw a RuntimeException and give users hints to increase the minCount

Repository: spark
Updated Branches:
  refs/heads/master 254eaa4d3 -> 54d95758f


[MLLIB] SPARK-4846: throw a RuntimeException and give users hints to increase the minCount

When the vocabSize\*vectorSize is larger than Int.MaxValue/8, we try to throw a RuntimeException. Because under this circumstance it would definitely throw an OOM when allocating memory to serialize the arrays syn0Global&syn1Global.   syn0Global&syn1Global are float arrays. Serializing them should need a byte array of more than 8 times of syn0Global's size.
Also if we catch an OOM even if vocabSize\*vectorSize is less than Int.MaxValue/8, we should give users hints to increase the minCount or decrease the vectorSize.

Author: Joseph J.C. Tang <ji...@gmail.com>

Closes #4247 from jinntrance/w2v-fix and squashes the following commits:

b5eb71f [Joseph J.C. Tang] throw a RuntimeException and give users hints regarding the vectorSize&minCount


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/54d95758
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/54d95758
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/54d95758

Branch: refs/heads/master
Commit: 54d95758fcbe29a9af0f59673ac0b8a8c72b778e
Parents: 254eaa4
Author: Joseph J.C. Tang <ji...@gmail.com>
Authored: Fri Jan 30 10:07:26 2015 -0800
Committer: Xiangrui Meng <me...@databricks.com>
Committed: Fri Jan 30 10:07:26 2015 -0800

----------------------------------------------------------------------
 .../main/scala/org/apache/spark/mllib/feature/Word2Vec.scala  | 7 +++++++
 1 file changed, 7 insertions(+)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/54d95758/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala
----------------------------------------------------------------------
diff --git a/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala b/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala
index d25a7cd..a3e4020 100644
--- a/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala
+++ b/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala
@@ -290,6 +290,13 @@ class Word2Vec extends Serializable with Logging {
     
     val newSentences = sentences.repartition(numPartitions).cache()
     val initRandom = new XORShiftRandom(seed)
+
+    if (vocabSize.toLong * vectorSize * 8 >= Int.MaxValue) {
+      throw new RuntimeException("Please increase minCount or decrease vectorSize in Word2Vec" +
+        " to avoid an OOM. You are highly recommended to make your vocabSize*vectorSize, " +
+        "which is " + vocabSize + "*" + vectorSize + " for now, less than `Int.MaxValue/8`.")
+    }
+
     val syn0Global =
       Array.fill[Float](vocabSize * vectorSize)((initRandom.nextFloat() - 0.5f) / vectorSize)
     val syn1Global = new Array[Float](vocabSize * vectorSize)


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org
For additional commands, e-mail: commits-help@spark.apache.org