You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by ygcao <gi...@git.apache.org> on 2016/02/01 09:44:00 UTC
[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...
Github user ygcao commented on a diff in the pull request:
https://github.com/apache/spark/pull/10152#discussion_r51391185
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
@@ -289,17 +301,28 @@ class Word2Vec extends Serializable with Logging {
val expTable = sc.broadcast(createExpTable())
val bcVocab = sc.broadcast(vocab)
val bcVocabHash = sc.broadcast(vocabHash)
-
- val sentences: RDD[Array[Int]] = words.mapPartitions { iter =>
+ // each partition is a collection of sentences, will be translated into arrays of Index integer
+ val sentences: RDD[Array[Int]] = dataset.mapPartitions { sentenceIter =>
new Iterator[Array[Int]] {
- def hasNext: Boolean = iter.hasNext
+ var wordIter: Iterator[String] = null
+
+ def hasNext: Boolean = sentenceIter.hasNext || (wordIter != null && wordIter.hasNext)
def next(): Array[Int] = {
val sentence = ArrayBuilder.make[Int]
var sentenceLength = 0
- while (iter.hasNext && sentenceLength < MAX_SENTENCE_LENGTH) {
- val word = bcVocabHash.value.get(iter.next())
- word match {
+ // do translation of each word into its index in the vocabulary,
--- End diff --
I finally made up mind to do a hacky simple perf-test just for proof of concept: the 5x runs' perf diff of different implementation is quite ignorable since it's within variance of each run of the same version.
Some details:
I prepared a 32k document from two arbitrary picked wikipedia pages( for "machine learning" and "Adversarial machine learning", didn't include reference section), which contains 341 lines and can be split into 442 sentences by simply using dot+space for sentence boundary). I injected following test case into Word2VecSuite class and run it against three different implementations(the old one which is in the master branch, my final two versions before adopting Sean's suggestion, and after adopted Sean's suggestion) of fit function in mllib.feature.Word2Vec class.
<pre>
test("testSpeed") {
val lines = sc.parallelize(Source.fromFile(new File("/home/ygcao/machinelearning.txt")).getLines().toSeq)
val sentences = lines.flatMap(_.split("\\. ")).map(line => line.split(" ").toSeq)
println("read file into rdd, lines=", sentences.count())
var builtModel: org.apache.spark.mllib.feature.Word2VecModel = null
var duration = 0l
for (i <- 1 to 5) {
val start = System.currentTimeMillis()
val model = new org.apache.spark.mllib.feature.Word2Vec().setVectorSize(3).setSeed(42l)
builtModel = model.fit(sentences)
duration += (System.currentTimeMillis() - start)
}
println(s"builtModel take ${duration},vocabulary size:${builtModel.getVectors.size}, learning's synonyms:${builtModel.findSynonyms("learning", 4).mkString("\n")}")
}
</pre>
the vocabulary size from the model is 155. and here are the time taking three runs of each version and the average of the final two runs of them. As you can see from the code, each run actually run the model building 5 times to magnify the potential diff.
<pre>
masterVersion PR-useIter PR-useCollection
run1 2232 2107 1933
run2 2085 1986 1987
run3 2005 2123 2004
avarage(run2, run3) 2045 2054.5 1995.5
</pre>
BTW: Following is not relevant for perf-test, just FYI. the two versions in this pull request will produce exact the same result, which proves the correctness of both. and the result is interesting as well although the dataset is quite tiny, new versions(un-merged ones) looks better than the old version(the one in master branch right now). Of course, you can screw new version up by a bad sentence splitter(we can make a hard cut splitter to do exact the same thing as the old version). the simple splitter used for the test case can't deal with abbreviations, that's why I removed references section in the text.
Here are the top synonyms of learning using the tiny dataset, please keep in mind, it's just for fun, not a solid proof of which is definitely better since dataset is tiny.
New versions: learning's synonyms:
(network,0.9990321742854605)
(related,0.9966140511173031)
(sparse,0.9965729586431097)
(algorithms,0.99376379497485)
Old version, learning's synonyms:
(against,0.9895162633562077)
(Support,0.9547255372896342)
(Association,0.9499811242788365)
(Attacks,0.9321700815006693)
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org