You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by ygcao <gi...@git.apache.org> on 2016/02/01 09:44:00 UTC
[GitHub] spark pull request: [SPARK-12153][SPARK-7617][MLlib]add support of...

Github user ygcao commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10152#discussion_r51391185
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -289,17 +301,28 @@ class Word2Vec extends Serializable with Logging {
         val expTable = sc.broadcast(createExpTable())
         val bcVocab = sc.broadcast(vocab)
         val bcVocabHash = sc.broadcast(vocabHash)
    -
    -    val sentences: RDD[Array[Int]] = words.mapPartitions { iter =>
    +    // each partition is a collection of sentences, will be translated into arrays of Index integer
    +    val sentences: RDD[Array[Int]] = dataset.mapPartitions { sentenceIter =>
           new Iterator[Array[Int]] {
    -        def hasNext: Boolean = iter.hasNext
    +        var wordIter: Iterator[String] = null
    +
    +        def hasNext: Boolean = sentenceIter.hasNext || (wordIter != null && wordIter.hasNext)
     
             def next(): Array[Int] = {
               val sentence = ArrayBuilder.make[Int]
               var sentenceLength = 0
    -          while (iter.hasNext && sentenceLength < MAX_SENTENCE_LENGTH) {
    -            val word = bcVocabHash.value.get(iter.next())
    -            word match {
    +          // do translation of each word into its index in the vocabulary,
    --- End diff --
    
    I finally made up mind to do a hacky simple perf-test just for proof of concept: the 5x runs' perf diff of different implementation is quite ignorable since it's within variance of each run of the same version.
    Some details:
    I prepared a 32k document from two arbitrary picked wikipedia pages( for "machine learning" and "Adversarial machine learning", didn't include reference section), which contains 341 lines and can be split into 442 sentences by simply using dot+space for sentence boundary). I injected following test case into Word2VecSuite class and run it against three different implementations(the old one which is in the master branch, my final two versions before adopting Sean's suggestion, and after adopted Sean's suggestion) of fit function in mllib.feature.Word2Vec class.
    
    <pre>
    test("testSpeed") {
        val lines = sc.parallelize(Source.fromFile(new File("/home/ygcao/machinelearning.txt")).getLines().toSeq)
        val sentences = lines.flatMap(_.split("\\. ")).map(line => line.split(" ").toSeq)
        println("read file into rdd, lines=", sentences.count())
        var builtModel: org.apache.spark.mllib.feature.Word2VecModel = null
        var duration = 0l
        for (i <- 1 to 5) {
          val start = System.currentTimeMillis()
          val model = new org.apache.spark.mllib.feature.Word2Vec().setVectorSize(3).setSeed(42l)
          builtModel = model.fit(sentences)
          duration += (System.currentTimeMillis() - start)
        }
        println(s"builtModel take ${duration},vocabulary size:${builtModel.getVectors.size}, learning's synonyms:${builtModel.findSynonyms("learning", 4).mkString("\n")}")
      }
    </pre>
    
     the vocabulary size from the model is 155. and here are the time taking three runs of each version and the average of the final two runs of them. As you can see from the code, each run actually run the model building 5 times to magnify the potential diff.
    <pre>
    		masterVersion	PR-useIter	PR-useCollection
    run1	2232	2107	1933
    run2	2085	1986	1987
    run3	2005	2123	2004
    avarage(run2, run3)	2045	2054.5	1995.5
    </pre>
    BTW: Following is not relevant for perf-test, just FYI. the two versions in this pull request will produce exact the same result, which proves the correctness of both. and the result is interesting as well although the dataset is quite tiny, new versions(un-merged ones) looks better than the old version(the one in master branch right now). Of course, you can screw new version up by a bad sentence splitter(we can make a hard cut splitter to do exact the same thing as the old version). the simple splitter used for the test case can't deal with abbreviations, that's why I removed references section in the text.
    Here are the top synonyms of learning using the tiny dataset, please keep in mind, it's just for fun, not a solid proof of which is definitely better since dataset is tiny.
    New versions:  learning's synonyms:
    (network,0.9990321742854605)
    (related,0.9966140511173031)
    (sparse,0.9965729586431097)
    (algorithms,0.99376379497485)
    
    Old version, learning's synonyms:
    (against,0.9895162633562077)
    (Support,0.9547255372896342)
    (Association,0.9499811242788365)
    (Attacks,0.9321700815006693)
    
    
    
    
    
    
    
    
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org