You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by BryanCutler <gi...@git.apache.org> on 2016/03/18 22:47:02 UTC

[GitHub] spark pull request: [SPARK-13963][ML] Adding binary toggle param t...

GitHub user BryanCutler opened a pull request:

    https://github.com/apache/spark/pull/11832

    [SPARK-13963][ML] Adding binary toggle param to HashingTF

    ## What changes were proposed in this pull request?
    Adding binary toggle parameter to ml.feature.HashingTF, as well as mllib.feature.HashingTF since the former wraps this functionality.  This parameter, if true, will set non-zero valued term counts to 1 to transform term count features to binary values that are well suited for discrete probability models.
    
    ## How was this patch tested?
    Added unit tests for ML and MLlib
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/BryanCutler/spark binary-param-HashingTF-SPARK-13963

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/11832.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #11832
    
----
commit a5ff3309c0d07e57177374133130803eb98ebffb
Author: Bryan Cutler <cu...@gmail.com>
Date:   2016-03-18T21:19:19Z

    [SPARK-13963] Adding binary toggle to HashingTF in ml/mllib

commit 31097231769860b86d1d3234ebf7d4e95f96e5cb
Author: Bryan Cutler <cu...@gmail.com>
Date:   2016-03-18T21:19:48Z

    Added unit test for HashingTF binary toggle

commit ca1436166a1292f92d72408c10cf606623b31bbd
Author: Bryan Cutler <cu...@gmail.com>
Date:   2016-03-18T21:26:34Z

    fixed param description text

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13963][ML] Adding binary toggle param t...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11832#discussion_r56948484
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/HashingTFSuite.scala ---
    @@ -52,6 +52,27 @@ class HashingTFSuite extends SparkFunSuite with MLlibTestSparkContext with Defau
         assert(features ~== expected absTol 1e-14)
       }
     
    +  test("applying binary term freqs") {
    +    val df = sqlContext.createDataFrame(Seq(
    +      (0, "a a b c c c".split(" ").toSeq)
    +    )).toDF("id", "words")
    +    val n = 100
    +    val hashingTF = new HashingTF()
    +        .setInputCol("words")
    +        .setOutputCol("features")
    +        .setNumFeatures(n)
    +        .setBinary(true)
    +    val output = hashingTF.transform(df)
    +    val attrGroup = AttributeGroup.fromStructField(output.schema("features"))
    --- End diff --
    
    this is not required here as it's tested above


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13963][ML] Adding binary toggle param t...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11832#issuecomment-199975647
  
    **[Test build #53798 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/53798/consoleFull)** for PR 11832 at commit [`d6431eb`](https://github.com/apache/spark/commit/d6431ebc92b77e7fb1b21309eca0982c3ffd424c).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13963][ML] Adding binary toggle param t...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11832#discussion_r56947750
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/HashingTF.scala ---
    @@ -53,9 +65,10 @@ class HashingTF(val numFeatures: Int) extends Serializable {
       @Since("1.1.0")
       def transform(document: Iterable[_]): Vector = {
         val termFrequencies = mutable.HashMap.empty[Int, Double]
    +    val setTF = if (binary) (i: Int) => 1.0 else (i: Int) => termFrequencies.getOrElse(i, 0.0) + 1.0
    --- End diff --
    
    Yup, the if-else is not actually called within the loop for each term. I think this is ok to keep as is, I don't see any better approach.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13963][ML] Adding binary toggle param t...

Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11832#discussion_r57041720
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/HashingTFSuite.scala ---
    @@ -52,6 +52,27 @@ class HashingTFSuite extends SparkFunSuite with MLlibTestSparkContext with Defau
         assert(features ~== expected absTol 1e-14)
       }
     
    +  test("applying binary term freqs") {
    +    val df = sqlContext.createDataFrame(Seq(
    +      (0, "a a b c c c".split(" ").toSeq)
    +    )).toDF("id", "words")
    +    val n = 100
    +    val hashingTF = new HashingTF()
    +        .setInputCol("words")
    +        .setOutputCol("features")
    +        .setNumFeatures(n)
    +        .setBinary(true)
    +    val output = hashingTF.transform(df)
    +    val attrGroup = AttributeGroup.fromStructField(output.schema("features"))
    +    require(attrGroup.numAttributes === Some(n))
    +    val features = output.select("features").first().getAs[Vector](0)
    +    // Assume perfect hash on "a", "b", "c".
    +    def idx(any: Any): Int = Utils.nonNegativeMod(any.##, n)
    --- End diff --
    
    No, I was just talking about in this test suite, but nevermind, it wasn't very clean.  I guess if the index calculation ends up changing, these test will fail anyway.
    I was trying to avoid moving `def idx` to the class because then it would need to be called with a number to mod always, like `idx("a", 100)`, so I'll try something else that will hopefully be a little better.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13963][ML] Adding binary toggle param t...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11832#issuecomment-198560033
  
    **[Test build #53570 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/53570/consoleFull)** for PR 11832 at commit [`ca14361`](https://github.com/apache/spark/commit/ca1436166a1292f92d72408c10cf606623b31bbd).
     * This patch **fails MiMa tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13963][ML] Adding binary toggle param t...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11832#discussion_r57032672
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/HashingTFSuite.scala ---
    @@ -52,6 +52,27 @@ class HashingTFSuite extends SparkFunSuite with MLlibTestSparkContext with Defau
         assert(features ~== expected absTol 1e-14)
       }
     
    +  test("applying binary term freqs") {
    +    val df = sqlContext.createDataFrame(Seq(
    +      (0, "a a b c c c".split(" ").toSeq)
    +    )).toDF("id", "words")
    +    val n = 100
    +    val hashingTF = new HashingTF()
    +        .setInputCol("words")
    +        .setOutputCol("features")
    +        .setNumFeatures(n)
    +        .setBinary(true)
    +    val output = hashingTF.transform(df)
    +    val attrGroup = AttributeGroup.fromStructField(output.schema("features"))
    +    require(attrGroup.numAttributes === Some(n))
    +    val features = output.select("features").first().getAs[Vector](0)
    +    // Assume perfect hash on "a", "b", "c".
    +    def idx(any: Any): Int = Utils.nonNegativeMod(any.##, n)
    --- End diff --
    
    Do you mean expose "indexOf" in ml to match mllib?
    On Tue, 22 Mar 2016 at 19:27, Bryan Cutler <no...@github.com> wrote:
    
    > In mllib/src/test/scala/org/apache/spark/ml/feature/HashingTFSuite.scala
    > <https://github.com/apache/spark/pull/11832#discussion_r57030239>:
    >
    > > +  test("applying binary term freqs") {
    > > +    val df = sqlContext.createDataFrame(Seq(
    > > +      (0, "a a b c c c".split(" ").toSeq)
    > > +    )).toDF("id", "words")
    > > +    val n = 100
    > > +    val hashingTF = new HashingTF()
    > > +        .setInputCol("words")
    > > +        .setOutputCol("features")
    > > +        .setNumFeatures(n)
    > > +        .setBinary(true)
    > > +    val output = hashingTF.transform(df)
    > > +    val attrGroup = AttributeGroup.fromStructField(output.schema("features"))
    > > +    require(attrGroup.numAttributes === Some(n))
    > > +    val features = output.select("features").first().getAs[Vector](0)
    > > +    // Assume perfect hash on "a", "b", "c".
    > > +    def idx(any: Any): Int = Utils.nonNegativeMod(any.##, n)
    >
    > It actually kind of bugged me that these tests are defining their own
    > index calculations, even though it is just a mod. Since ml.HashingTF
    > doesn't have this available, I'll try out a wrapping the mllib function
    > being used, which I think is more correct.
    >
    > —
    > You are receiving this because you are subscribed to this thread.
    > Reply to this email directly or view it on GitHub
    > <https://github.com/apache/spark/pull/11832/files/19849fe49d1f0be62c9bf54f1312d8cfc457f003#r57030239>
    >



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13963][ML] Adding binary toggle param t...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11832#discussion_r56948423
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/HashingTFSuite.scala ---
    @@ -52,6 +52,27 @@ class HashingTFSuite extends SparkFunSuite with MLlibTestSparkContext with Defau
         assert(features ~== expected absTol 1e-14)
       }
     
    +  test("applying binary term freqs") {
    +    val df = sqlContext.createDataFrame(Seq(
    +      (0, "a a b c c c".split(" ").toSeq)
    +    )).toDF("id", "words")
    +    val n = 100
    +    val hashingTF = new HashingTF()
    +        .setInputCol("words")
    +        .setOutputCol("features")
    +        .setNumFeatures(n)
    +        .setBinary(true)
    +    val output = hashingTF.transform(df)
    +    val attrGroup = AttributeGroup.fromStructField(output.schema("features"))
    +    require(attrGroup.numAttributes === Some(n))
    +    val features = output.select("features").first().getAs[Vector](0)
    +    // Assume perfect hash on "a", "b", "c".
    +    def idx(any: Any): Int = Utils.nonNegativeMod(any.##, n)
    +    val expected = Vectors.sparse(n,
    +      Seq((idx("a"), 1.0), (idx("b"), 1.0), (idx("c"), 1.0)))
    +    assert(features ~== expected absTol 1e-14)
    +  }
    --- End diff --
    
    I think the idea is that you could test both normal mode and binary mode in the same test - though I personally like separate test cases as it is more obvious what went wrong when it fails. You can pare down this test though (see my other comments).
    
    I think the MLlib test is fine as is.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13963][ML] Adding binary toggle param t...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on the pull request:

    https://github.com/apache/spark/pull/11832#issuecomment-202825496
  
    merged to master, thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13963][ML] Adding binary toggle param t...

Posted by hhbyyh <gi...@git.apache.org>.
Github user hhbyyh commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11832#discussion_r56787196
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/HashingTF.scala ---
    @@ -53,9 +65,10 @@ class HashingTF(val numFeatures: Int) extends Serializable {
       @Since("1.1.0")
       def transform(document: Iterable[_]): Vector = {
         val termFrequencies = mutable.HashMap.empty[Int, Double]
    +    val setTF = if (binary) (i: Int) => 1.0 else (i: Int) => termFrequencies.getOrElse(i, 0.0) + 1.0
    --- End diff --
    
    +1


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13963][ML] Adding binary toggle param t...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11832#discussion_r56948444
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/HashingTFSuite.scala ---
    @@ -52,6 +52,27 @@ class HashingTFSuite extends SparkFunSuite with MLlibTestSparkContext with Defau
         assert(features ~== expected absTol 1e-14)
       }
     
    +  test("applying binary term freqs") {
    +    val df = sqlContext.createDataFrame(Seq(
    +      (0, "a a b c c c".split(" ").toSeq)
    +    )).toDF("id", "words")
    +    val n = 100
    +    val hashingTF = new HashingTF()
    +        .setInputCol("words")
    +        .setOutputCol("features")
    +        .setNumFeatures(n)
    +        .setBinary(true)
    +    val output = hashingTF.transform(df)
    +    val attrGroup = AttributeGroup.fromStructField(output.schema("features"))
    +    require(attrGroup.numAttributes === Some(n))
    +    val features = output.select("features").first().getAs[Vector](0)
    +    // Assume perfect hash on "a", "b", "c".
    +    def idx(any: Any): Int = Utils.nonNegativeMod(any.##, n)
    --- End diff --
    
    this `def` could be moved out and used in both tests


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13963][ML] Adding binary toggle param t...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11832#issuecomment-202823446
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13963][ML] Adding binary toggle param t...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11832#issuecomment-198560044
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/53570/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13963][ML] Adding binary toggle param t...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11832#issuecomment-198560043
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13963][ML] Adding binary toggle param t...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11832#discussion_r56779584
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/HashingTF.scala ---
    @@ -53,9 +65,10 @@ class HashingTF(val numFeatures: Int) extends Serializable {
       @Since("1.1.0")
       def transform(document: Iterable[_]): Vector = {
         val termFrequencies = mutable.HashMap.empty[Int, Double]
    +    val setTF = if (binary) (i: Int) => 1.0 else (i: Int) => termFrequencies.getOrElse(i, 0.0) + 1.0
    --- End diff --
    
    Could you please move this if-then-else outside of the loop?  (I'm guessing it would be worthwhile for speed.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13963][ML] Adding binary toggle param t...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11832#issuecomment-198587059
  
    **[Test build #53588 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/53588/consoleFull)** for PR 11832 at commit [`19849fe`](https://github.com/apache/spark/commit/19849fe49d1f0be62c9bf54f1312d8cfc457f003).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13963][ML] Adding binary toggle param t...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on the pull request:

    https://github.com/apache/spark/pull/11832#issuecomment-200922079
  
    LGTM - I'm traveling and don't have decent web access, so @jkbradley could
    you merge?
    On Thu, 24 Mar 2016 at 18:36, Bryan Cutler <no...@github.com> wrote:
    
    > I think it is good to go unless @MLnick <https://github.com/MLnick> has
    > any more comments.
    >
    > —
    > You are receiving this because you were mentioned.
    >
    >
    > Reply to this email directly or view it on GitHub
    > <https://github.com/apache/spark/pull/11832#issuecomment-200915659>
    >



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13963][ML] Adding binary toggle param t...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11832#issuecomment-202823481
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/54428/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13963][ML] Adding binary toggle param t...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11832#issuecomment-199976351
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/53798/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13963][ML] Adding binary toggle param t...

Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on the pull request:

    https://github.com/apache/spark/pull/11832#issuecomment-200915659
  
    I think it is good to go unless @MLnick has any more comments.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13963][ML] Adding binary toggle param t...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on the pull request:

    https://github.com/apache/spark/pull/11832#issuecomment-202807584
  
    jenkins retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13963][ML] Adding binary toggle param t...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11832#issuecomment-198594462
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/53588/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13963][ML] Adding binary toggle param t...

Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11832#discussion_r56896101
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/HashingTF.scala ---
    @@ -53,9 +65,10 @@ class HashingTF(val numFeatures: Int) extends Serializable {
       @Since("1.1.0")
       def transform(document: Iterable[_]): Vector = {
         val termFrequencies = mutable.HashMap.empty[Int, Double]
    +    val setTF = if (binary) (i: Int) => 1.0 else (i: Int) => termFrequencies.getOrElse(i, 0.0) + 1.0
    --- End diff --
    
    Are you guys talking about the `document.foreach` loop?  The if-then-else is just to assign a function, so it would just be called once per document, then the function is called in the loop.  If this line is a bit too much, I could rewrite it to something more straightforward.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13963][ML] Adding binary toggle param t...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11832#issuecomment-202807894
  
    **[Test build #54428 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54428/consoleFull)** for PR 11832 at commit [`d6431eb`](https://github.com/apache/spark/commit/d6431ebc92b77e7fb1b21309eca0982c3ffd424c).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13963][ML] Adding binary toggle param t...

Posted by hhbyyh <gi...@git.apache.org>.
Github user hhbyyh commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11832#discussion_r56787382
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/HashingTFSuite.scala ---
    @@ -52,6 +52,27 @@ class HashingTFSuite extends SparkFunSuite with MLlibTestSparkContext with Defau
         assert(features ~== expected absTol 1e-14)
       }
     
    +  test("applying binary term freqs") {
    +    val df = sqlContext.createDataFrame(Seq(
    +      (0, "a a b c c c".split(" ").toSeq)
    +    )).toDF("id", "words")
    +    val n = 100
    +    val hashingTF = new HashingTF()
    +        .setInputCol("words")
    +        .setOutputCol("features")
    +        .setNumFeatures(n)
    +        .setBinary(true)
    +    val output = hashingTF.transform(df)
    +    val attrGroup = AttributeGroup.fromStructField(output.schema("features"))
    +    require(attrGroup.numAttributes === Some(n))
    +    val features = output.select("features").first().getAs[Vector](0)
    +    // Assume perfect hash on "a", "b", "c".
    +    def idx(any: Any): Int = Utils.nonNegativeMod(any.##, n)
    +    val expected = Vectors.sparse(n,
    +      Seq((idx("a"), 1.0), (idx("b"), 1.0), (idx("c"), 1.0)))
    +    assert(features ~== expected absTol 1e-14)
    +  }
    --- End diff --
    
    Saw some duplication with last ut, perhaps consider merge them?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13963][ML] Adding binary toggle param t...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/11832#issuecomment-200659606
  
    LGTM Ready to merge?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13963][ML] Adding binary toggle param t...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/11832#issuecomment-199094605
  
    That's all!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13963][ML] Adding binary toggle param t...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on the pull request:

    https://github.com/apache/spark/pull/11832#issuecomment-198924173
  
    LGTM apart from the small doc comment. Will leave open for a while for any other comments (cc @jkbradley)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13963][ML] Adding binary toggle param t...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11832#discussion_r56763755
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/HashingTF.scala ---
    @@ -36,12 +36,24 @@ import org.apache.spark.util.Utils
     @Since("1.1.0")
     class HashingTF(val numFeatures: Int) extends Serializable {
     
    +  private var binary = false
    +
       /**
        */
       @Since("1.1.0")
       def this() = this(1 << 20)
     
       /**
    +   * Sets the binary toggle param. If true, term frequency vector will be binary such that non-zero
    --- End diff --
    
    Don't think it's necessary to say "Sets the binary toggle param."


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13963][ML] Adding binary toggle param t...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11832#issuecomment-198594461
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13963][ML] Adding binary toggle param t...

Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11832#discussion_r57030239
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/HashingTFSuite.scala ---
    @@ -52,6 +52,27 @@ class HashingTFSuite extends SparkFunSuite with MLlibTestSparkContext with Defau
         assert(features ~== expected absTol 1e-14)
       }
     
    +  test("applying binary term freqs") {
    +    val df = sqlContext.createDataFrame(Seq(
    +      (0, "a a b c c c".split(" ").toSeq)
    +    )).toDF("id", "words")
    +    val n = 100
    +    val hashingTF = new HashingTF()
    +        .setInputCol("words")
    +        .setOutputCol("features")
    +        .setNumFeatures(n)
    +        .setBinary(true)
    +    val output = hashingTF.transform(df)
    +    val attrGroup = AttributeGroup.fromStructField(output.schema("features"))
    +    require(attrGroup.numAttributes === Some(n))
    +    val features = output.select("features").first().getAs[Vector](0)
    +    // Assume perfect hash on "a", "b", "c".
    +    def idx(any: Any): Int = Utils.nonNegativeMod(any.##, n)
    --- End diff --
    
    It actually kind of bugged me that these tests are defining their own index calculations, even though it is just a mod.  Since ml.HashingTF doesn't have this available, I'll try out a wrapping the mllib function being used, which I think is more correct.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13963][ML] Adding binary toggle param t...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11832#discussion_r57273260
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/HashingTF.scala ---
    @@ -53,9 +65,10 @@ class HashingTF(val numFeatures: Int) extends Serializable {
       @Since("1.1.0")
       def transform(document: Iterable[_]): Vector = {
         val termFrequencies = mutable.HashMap.empty[Int, Double]
    +    val setTF = if (binary) (i: Int) => 1.0 else (i: Int) => termFrequencies.getOrElse(i, 0.0) + 1.0
    --- End diff --
    
    Oh yep you're right


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13963][ML] Adding binary toggle param t...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/11832


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13963][ML] Adding binary toggle param t...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11832#issuecomment-198556902
  
    **[Test build #53570 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/53570/consoleFull)** for PR 11832 at commit [`ca14361`](https://github.com/apache/spark/commit/ca1436166a1292f92d72408c10cf606623b31bbd).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13963][ML] Adding binary toggle param t...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11832#issuecomment-198594419
  
    **[Test build #53588 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/53588/consoleFull)** for PR 11832 at commit [`19849fe`](https://github.com/apache/spark/commit/19849fe49d1f0be62c9bf54f1312d8cfc457f003).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class HashingTF(val numFeatures: Int) extends Serializable `


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13963][ML] Adding binary toggle param t...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11832#issuecomment-199953336
  
    **[Test build #53798 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/53798/consoleFull)** for PR 11832 at commit [`d6431eb`](https://github.com/apache/spark/commit/d6431ebc92b77e7fb1b21309eca0982c3ffd424c).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13963][ML] Adding binary toggle param t...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11832#issuecomment-202823191
  
    **[Test build #54428 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54428/consoleFull)** for PR 11832 at commit [`d6431eb`](https://github.com/apache/spark/commit/d6431ebc92b77e7fb1b21309eca0982c3ffd424c).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13963][ML] Adding binary toggle param t...

Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11832#discussion_r56897863
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/HashingTFSuite.scala ---
    @@ -52,6 +52,27 @@ class HashingTFSuite extends SparkFunSuite with MLlibTestSparkContext with Defau
         assert(features ~== expected absTol 1e-14)
       }
     
    +  test("applying binary term freqs") {
    +    val df = sqlContext.createDataFrame(Seq(
    +      (0, "a a b c c c".split(" ").toSeq)
    +    )).toDF("id", "words")
    +    val n = 100
    +    val hashingTF = new HashingTF()
    +        .setInputCol("words")
    +        .setOutputCol("features")
    +        .setNumFeatures(n)
    +        .setBinary(true)
    +    val output = hashingTF.transform(df)
    +    val attrGroup = AttributeGroup.fromStructField(output.schema("features"))
    +    require(attrGroup.numAttributes === Some(n))
    +    val features = output.select("features").first().getAs[Vector](0)
    +    // Assume perfect hash on "a", "b", "c".
    +    def idx(any: Any): Int = Utils.nonNegativeMod(any.##, n)
    +    val expected = Vectors.sparse(n,
    +      Seq((idx("a"), 1.0), (idx("b"), 1.0), (idx("c"), 1.0)))
    +    assert(features ~== expected absTol 1e-14)
    +  }
    --- End diff --
    
    Yeah, maybe it is a little too redundant.  I could remove the MLlib version or change it to use an RDD to which would test a slightly different code path?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13963][ML] Adding binary toggle param t...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11832#issuecomment-199976346
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13963][ML] Adding binary toggle param t...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/11832#issuecomment-203126771
  
    Thanks for merging--just saw notification!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org