You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by ymazari <gi...@git.apache.org> on 2018/01/23 18:00:32 UTC

[GitHub] spark pull request #20366: [SPARK-23166] [ML] Add maxDF Parameter to CountVe...

GitHub user ymazari opened a pull request:

    https://github.com/apache/spark/pull/20366

    [SPARK-23166] [ML] Add maxDF Parameter to CountVectorizer

    ## What changes were proposed in this pull request?
    
    (Please fill in changes proposed in this fix)
    
    ## How was this patch tested?
    
    (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
    (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/ymazari/spark SPARK-23166

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/20366.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #20366
    
----
commit 568ea65ff32dce179097098b5d2934df20cac17c
Author: Yacine Mazari <y....@...>
Date:   2018-01-22T13:22:58Z

    [SPARK-23166] [ML] Add maxDF Parameter to CountVectorizer

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20366: [SPARK-23166] [ML] Add maxDF Parameter to CountVe...

Posted by ymazari <gi...@git.apache.org>.

Github user ymazari commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20366#discussion_r163355088
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala ---
    @@ -169,7 +197,7 @@ class CountVectorizer @Since("1.5.0") (@Since("1.5.0") override val uid: String)
         }.reduceByKey { case ((wc1, df1), (wc2, df2)) =>
           (wc1 + wc2, df1 + df2)
         }.filter { case (word, (wc, df)) =>
    -      df >= minDf
    +      (df >= minDf) && (df <= maxDf)
    --- End diff --
    
    Right. I just added them for clarity.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20366: [SPARK-23166] [ML] Add maxDF Parameter to CountVe...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20366#discussion_r163353940
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala ---
    @@ -169,7 +197,7 @@ class CountVectorizer @Since("1.5.0") (@Since("1.5.0") override val uid: String)
         }.reduceByKey { case ((wc1, df1), (wc2, df2)) =>
           (wc1 + wc2, df1 + df2)
         }.filter { case (word, (wc, df)) =>
    -      df >= minDf
    +      (df >= minDf) && (df <= maxDf)
    --- End diff --
    
    nit: the parenthesis are not needed


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20366: [SPARK-23166] [ML] Add maxDF Parameter to CountVe...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20366#discussion_r163353845
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala ---
    @@ -113,7 +132,7 @@ private[feature] trait CountVectorizerParams extends Params with HasInputCol wit
       /** @group getParam */
       def getBinary: Boolean = $(binary)
     
    -  setDefault(vocabSize -> (1 << 18), minDF -> 1.0, minTF -> 1.0, binary -> false)
    +  setDefault(vocabSize -> (1 << 18), minDF -> 1.0, minTF -> 1.0, binary -> false, maxDF -> Long.MaxValue)
    --- End diff --
    
    what about avoiding to set a default value and apply the filter only if it was set?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20366: [SPARK-23166] [ML] Add maxDF Parameter to CountVectorize...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20366
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20366: [SPARK-23166] [ML] Add maxDF Parameter to CountVectorize...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20366
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20366: [SPARK-23166] [ML] Add maxDF Parameter to CountVe...

Posted by ymazari <gi...@git.apache.org>.

Github user ymazari commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20366#discussion_r163355218
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/CountVectorizerSuite.scala ---
    @@ -119,6 +119,41 @@ class CountVectorizerSuite extends SparkFunSuite with MLlibTestSparkContext
         }
       }
     
    +  test("CountVectorizer maxDF") {
    +    val df = Seq(
    +      (0, split("a b c d"), Vectors.sparse(3, Seq((0, 1.0), (1, 1.0), (2, 1.0)))),
    +      (1, split("a b c"), Vectors.sparse(3, Seq((0, 1.0), (1, 1.0)))),
    +      (2, split("a b"), Vectors.sparse(3, Seq((0, 1.0)))),
    +      (3, split("a"), Vectors.sparse(3, Seq()))
    +    ).toDF("id", "words", "expected")
    +
    +    // maxDF: ignore terms with count more than 3
    +    val cvModel = new CountVectorizer()
    +      .setInputCol("words")
    +      .setOutputCol("features")
    +      .setMaxDF(3)
    +      .fit(df)
    +    assert(cvModel.vocabulary === Array("b", "c", "d"))
    +
    +    cvModel.transform(df).select("features", "expected").collect().foreach {
    +      case Row(features: Vector, expected: Vector) =>
    +        assert(features ~== expected absTol 1e-14)
    +    }
    +
    +    // maxDF: ignore terms with freq > 0.75
    +    val cvModel2 = new CountVectorizer()
    +      .setInputCol("words")
    +      .setOutputCol("features")
    +      .setMaxDF(0.75)
    +      .fit(df)
    +    assert(cvModel2.vocabulary === Array("b", "c", "d"))
    +
    +    cvModel2.transform(df).select("features", "expected").collect().foreach {
    +      case Row(features: Vector, expected: Vector) =>
    +        assert(features ~== expected absTol 1e-14)
    +    }
    --- End diff --
    
    I will.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20366: [SPARK-23166] [ML] Add maxDF Parameter to CountVe...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20366#discussion_r163354161
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/CountVectorizerSuite.scala ---
    @@ -119,6 +119,41 @@ class CountVectorizerSuite extends SparkFunSuite with MLlibTestSparkContext
         }
       }
     
    +  test("CountVectorizer maxDF") {
    +    val df = Seq(
    +      (0, split("a b c d"), Vectors.sparse(3, Seq((0, 1.0), (1, 1.0), (2, 1.0)))),
    +      (1, split("a b c"), Vectors.sparse(3, Seq((0, 1.0), (1, 1.0)))),
    +      (2, split("a b"), Vectors.sparse(3, Seq((0, 1.0)))),
    +      (3, split("a"), Vectors.sparse(3, Seq()))
    +    ).toDF("id", "words", "expected")
    +
    +    // maxDF: ignore terms with count more than 3
    +    val cvModel = new CountVectorizer()
    +      .setInputCol("words")
    +      .setOutputCol("features")
    +      .setMaxDF(3)
    +      .fit(df)
    +    assert(cvModel.vocabulary === Array("b", "c", "d"))
    +
    +    cvModel.transform(df).select("features", "expected").collect().foreach {
    +      case Row(features: Vector, expected: Vector) =>
    +        assert(features ~== expected absTol 1e-14)
    +    }
    +
    +    // maxDF: ignore terms with freq > 0.75
    +    val cvModel2 = new CountVectorizer()
    +      .setInputCol("words")
    +      .setOutputCol("features")
    +      .setMaxDF(0.75)
    +      .fit(df)
    +    assert(cvModel2.vocabulary === Array("b", "c", "d"))
    +
    +    cvModel2.transform(df).select("features", "expected").collect().foreach {
    +      case Row(features: Vector, expected: Vector) =>
    +        assert(features ~== expected absTol 1e-14)
    +    }
    --- End diff --
    
    may you please also add a UT to check that setting both `maxDF` and `minDF` works as expected?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20366: [SPARK-23166] [ML] Add maxDF Parameter to CountVe...

Posted by ymazari <gi...@git.apache.org>.

Github user ymazari closed the pull request at:

    https://github.com/apache/spark/pull/20366


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org