You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by facaiy <gi...@git.apache.org> on 2017/07/26 06:22:23 UTC

[GitHub] spark pull request #18736: [SPARK-21481][ML] Add indexOf method for ml.featu...

GitHub user facaiy opened a pull request:

    https://github.com/apache/spark/pull/18736

    [SPARK-21481][ML] Add indexOf method for ml.feature.HashingTF

    ## What changes were proposed in this pull request?
    
    Add indexOf method for ml.feature.HashingTF.
    
    The PR is a hotfix by storing hashingTF.
    I believe it is better to migrate native implement from mllib to ml. I can work on it, but I don't know whether to open an new JIRA for migrating or not.
    
    ## How was this patch tested?
    
    + [x] add a test case.
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/facaiy/spark ENH/add_indexOf_for_ml

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/18736.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #18736
    
----
commit f08861d64390f0eb4d7e34de13d78db67e046457
Author: Yan Facai (颜发才) <fa...@gmail.com>
Date:   2017-07-26T05:45:37Z

    hotfix: store hashingTF to reuse indexOf

commit 174a2d57ebeccba10b8f1435970cde69f5829c79
Author: Yan Facai (颜发才) <fa...@gmail.com>
Date:   2017-07-26T06:14:37Z

    TST: add test case

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18736: [SPARK-21481][ML] Add indexOf method for ml.feature.Hash...

Posted by yanboliang <gi...@git.apache.org>.
Github user yanboliang commented on the issue:

    https://github.com/apache/spark/pull/18736
  
    @facaiy Sure. Please open a JIRA to track it. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18736: [SPARK-21481][ML] Add indexOf method for ml.featu...

Posted by WeichenXu123 <gi...@git.apache.org>.
Github user WeichenXu123 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18736#discussion_r133080543
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/HashingTF.scala ---
    @@ -80,20 +82,31 @@ class HashingTF @Since("1.4.0") (@Since("1.4.0") override val uid: String)
     
       /** @group setParam */
       @Since("1.2.0")
    -  def setNumFeatures(value: Int): this.type = set(numFeatures, value)
    +  def setNumFeatures(value: Int): this.type = {
    +    hashingTF = new feature.HashingTF($(numFeatures)).setBinary($(binary))
    --- End diff --
    
    I think `mllib` only need fix bug, no need add new method.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18736: [SPARK-21481][ML] Add indexOf method for ml.featu...

Posted by WeichenXu123 <gi...@git.apache.org>.
Github user WeichenXu123 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18736#discussion_r132058692
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/HashingTF.scala ---
    @@ -90,10 +92,22 @@ class HashingTF @Since("1.4.0") (@Since("1.4.0") override val uid: String)
       @Since("2.0.0")
       def setBinary(value: Boolean): this.type = set(binary, value)
     
    +  /**
    +   * Returns the index of the input term.
    +   */
    +  @Since("2.3.0")
    +  def indexOf(term: Any): Int = {
    +    if (hashingTF != null) {
    +      hashingTF.indexOf(term)
    --- End diff --
    
    You can move the `feature.HashingTF` initialization into constructor.
    Otherwise the `indexOf` will throw error when `transform` not called, it is weird to user.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18736: [SPARK-21481][ML] Add indexOf method for ml.featu...

Posted by WeichenXu123 <gi...@git.apache.org>.
Github user WeichenXu123 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18736#discussion_r133080201
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/HashingTFSuite.scala ---
    @@ -69,6 +69,20 @@ class HashingTFSuite extends SparkFunSuite with MLlibTestSparkContext with Defau
         assert(features ~== expected absTol 1e-14)
       }
     
    +  test("indexOf method") {
    +    val df = Seq((0, "a a b b c d".split(" ").toSeq)).toDF("id", "words")
    +    val n = 100
    +    val hashingTF = new HashingTF()
    +      .setInputCol("words")
    +      .setOutputCol("features")
    +      .setNumFeatures(n)
    +    hashingTF.transform(df)
    +    assert(hashingTF.indexOf("a") === 70)
    +    assert(hashingTF.indexOf("b") === 61)
    +    assert(hashingTF.indexOf("c") === 22)
    +    assert(hashingTF.indexOf("d") === 94)
    --- End diff --
    
    I prefer the testcase compare the result between `ml.feature.HashingTF` and `mllib.feature.HashingTF`, avoid hardcoding the computing result here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18736: [SPARK-21481][ML] Add indexOf method for ml.featu...

Posted by facaiy <gi...@git.apache.org>.
Github user facaiy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18736#discussion_r132131171
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/HashingTF.scala ---
    @@ -90,10 +92,22 @@ class HashingTF @Since("1.4.0") (@Since("1.4.0") override val uid: String)
       @Since("2.0.0")
       def setBinary(value: Boolean): this.type = set(binary, value)
     
    +  /**
    +   * Returns the index of the input term.
    +   */
    +  @Since("2.3.0")
    +  def indexOf(term: Any): Int = {
    +    if (hashingTF != null) {
    +      hashingTF.indexOf(term)
    --- End diff --
    
    Thanks for you suggestion, I'll like to modify the code later.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18736: [SPARK-21481][ML] Add indexOf method for ml.featu...

Posted by WeichenXu123 <gi...@git.apache.org>.
Github user WeichenXu123 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18736#discussion_r132253487
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/HashingTF.scala ---
    @@ -90,10 +92,22 @@ class HashingTF @Since("1.4.0") (@Since("1.4.0") override val uid: String)
       @Since("2.0.0")
       def setBinary(value: Boolean): this.type = set(binary, value)
     
    +  /**
    +   * Returns the index of the input term.
    +   */
    +  @Since("2.3.0")
    +  def indexOf(term: Any): Int = {
    +    if (hashingTF != null) {
    +      hashingTF.indexOf(term)
    --- End diff --
    
    Another problem:
    
    When the parameter changed (numFeatures, binary), you should immediately create a new hashTF object  to replace the old one.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18736: [SPARK-21481][ML] Add indexOf method for ml.feature.Hash...

Posted by facaiy <gi...@git.apache.org>.
Github user facaiy commented on the issue:

    https://github.com/apache/spark/pull/18736
  
    Closed as #18998 takes too long to wait.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18736: [SPARK-21481][ML] Add indexOf method for ml.featu...

Posted by facaiy <gi...@git.apache.org>.
Github user facaiy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18736#discussion_r132618802
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/HashingTF.scala ---
    @@ -80,20 +82,31 @@ class HashingTF @Since("1.4.0") (@Since("1.4.0") override val uid: String)
     
       /** @group setParam */
       @Since("1.2.0")
    -  def setNumFeatures(value: Int): this.type = set(numFeatures, value)
    +  def setNumFeatures(value: Int): this.type = {
    +    hashingTF = new feature.HashingTF($(numFeatures)).setBinary($(binary))
    --- End diff --
    
    @WeichenXu123  How about adding `setNumFeatures` method to mllib? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18736: [SPARK-21481][ML] Add indexOf method for ml.feature.Hash...

Posted by facaiy <gi...@git.apache.org>.
Github user facaiy commented on the issue:

    https://github.com/apache/spark/pull/18736
  
    @yanboliang Hi, yangbo. Could you help review the PR? Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18736: [SPARK-21481][ML] Add indexOf method for ml.feature.Hash...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18736
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18736: [SPARK-21481][ML] Add indexOf method for ml.feature.Hash...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18736
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18736: [SPARK-21481][ML] Add indexOf method for ml.feature.Hash...

Posted by yanboliang <gi...@git.apache.org>.
Github user yanboliang commented on the issue:

    https://github.com/apache/spark/pull/18736
  
    @facaiy I support to add ```indexOf``` to ```ml.feature.HashingTF```, but I think the way you fixed  is incorrect for PySpark. So I'd suggest to migrate the ```HashingTF``` implementation from mllib to ml firstly, and then add this function. Would you like to work on it? Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18736: [SPARK-21481][ML] Add indexOf method for ml.featu...

Posted by facaiy <gi...@git.apache.org>.
Github user facaiy closed the pull request at:

    https://github.com/apache/spark/pull/18736


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18736: [SPARK-21481][ML] Add indexOf method for ml.feature.Hash...

Posted by facaiy <gi...@git.apache.org>.
Github user facaiy commented on the issue:

    https://github.com/apache/spark/pull/18736
  
    Sure, @yanboliang . Thanks for your suggestion. I'll work on it later, perhaps next week. Is it OK?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18736: [SPARK-21481][ML] Add indexOf method for ml.feature.Hash...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18736
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18736: [SPARK-21481][ML] Add indexOf method for ml.featu...

Posted by yanboliang <gi...@git.apache.org>.
Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18736#discussion_r133127624
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/HashingTF.scala ---
    @@ -74,26 +74,41 @@ class HashingTF @Since("1.4.0") (@Since("1.4.0") override val uid: String)
     
       setDefault(numFeatures -> (1 << 18), binary -> false)
     
    +  private[this] var hashingTF = new feature.HashingTF($(numFeatures)).setBinary($(binary))
    +
       /** @group getParam */
       @Since("1.2.0")
       def getNumFeatures: Int = $(numFeatures)
     
       /** @group setParam */
       @Since("1.2.0")
    -  def setNumFeatures(value: Int): this.type = set(numFeatures, value)
    +  def setNumFeatures(value: Int): this.type = {
    +    val t = set(numFeatures, value)
    +    hashingTF = new feature.HashingTF($(numFeatures)).setBinary($(binary))
    +    t
    +  }
     
       /** @group getParam */
       @Since("2.0.0")
       def getBinary: Boolean = $(binary)
     
       /** @group setParam */
       @Since("2.0.0")
    -  def setBinary(value: Boolean): this.type = set(binary, value)
    +  def setBinary(value: Boolean): this.type = {
    +    val t = set(binary, value)
    +    hashingTF.setBinary($(binary))
    --- End diff --
    
    I think we can't do other things except for set params in ```set***``` function, this is because PySpark ```set***``` doesn't call corresponding Scala function. In PySpark, we collect all params together and then pass them to Scala side, then call the following function:
    ```
    def fit(dataset: Dataset[_], paramMap: ParamMap): M = {
        copy(paramMap).fit(dataset)
      }
    ```
    which will skip the corresponding Scala ```set***``` functions. So it will make your change to old ```hashingTF``` doesn't take effect.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18736: [SPARK-21481][ML] Add indexOf method for ml.feature.Hash...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18736
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org