You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by holdenk <gi...@git.apache.org> on 2016/04/15 02:06:33 UTC

[GitHub] spark pull request: [SPARK-14644][ML][PYSPARK] Turn Binary param i...

GitHub user holdenk opened a pull request:

    https://github.com/apache/spark/pull/12404

    [SPARK-14644][ML][PYSPARK] Turn Binary param into a shared param

    ## What changes were proposed in this pull request?
    
    Change the binary param into a shared param.
    
    ## How was this patch tested?
    
    Existing unit tests

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/holdenk/spark SPARK-14644-binary-param-shared-param

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/12404.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #12404
    
----
commit 7a0230a07d8650ded80b3f255047b7558ba8989c
Author: Holden Karau <ho...@us.ibm.com>
Date:   2016-04-14T23:22:57Z

    update codegen

commit ccd28a47c9e96ac9df15a199ed85c42a2bb59787
Author: Holden Karau <ho...@us.ibm.com>
Date:   2016-04-14T23:37:54Z

    Generate binary param

commit edbd652f85c9a1003ad23a76ef02e789987328c9
Author: Holden Karau <ho...@us.ibm.com>
Date:   2016-04-14T23:41:55Z

    Use shared param

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14644][ML][PYSPARK] Turn Binary param i...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12404#discussion_r59826196
  
    --- Diff: python/pyspark/ml/param/shared.py ---
    @@ -583,6 +583,31 @@ def getVarianceCol(self):
             return self.getOrDefault(self.varianceCol)
     
     
    +class HasBinary(Params):
    +    """
    +    Mixin for param binary: If True, all non zero results are set to 1. This is useful for discrete probabilistic models that model binary events rather than integer counts. Default False.
    --- End diff --
    
    how come this didn't fail our style checker? did we break the pep8 checker?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14644][ML][PYSPARK] Turn Binary param i...

Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on the pull request:

    https://github.com/apache/spark/pull/12404#issuecomment-210617504
  
    So I thought the unified doc string of:
    
    > If true, all non zero results are set to 1. This is useful for discrete probabilistic models that model binary events rather than integer counts.  Default False.
    
    might make sense in the two cases where we are using it (since `minTF` is described as a filter). If you think this isn't clear enough with the `minTF` case on `HashingTF` I'm happy to close it for now and revisit if we add `minTF` to `CountVectorizer`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14644][ML][PYSPARK] Turn Binary param i...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12404#issuecomment-210225967
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/55872/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14644][ML][PYSPARK] Turn Binary param i...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on the pull request:

    https://github.com/apache/spark/pull/12404#issuecomment-210647651
  
    Fair enough. We can revisit if more estimators use the param, or if we add
    minTF to HashingTF.
    On Fri, 15 Apr 2016 at 22:54, jkbradley <no...@github.com> wrote:
    
    > I prefer to keep them separate. There isn't much benefit to combining 2
    > instances, and the doc becomes a bit less clear.
    >
    > —
    > You are receiving this because you commented.
    > Reply to this email directly or view it on GitHub
    > <https://github.com/apache/spark/pull/12404#issuecomment-210642641>
    >



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14644][ML][PYSPARK] Turn Binary param i...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12404#issuecomment-210225853
  
    **[Test build #55872 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55872/consoleFull)** for PR 12404 at commit [`edbd652`](https://github.com/apache/spark/commit/edbd652f85c9a1003ad23a76ef02e789987328c9).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class CountVectorizer(JavaEstimator, HasBinary, HasInputCol, HasOutputCol, JavaMLReadable,`
      * `class HashingTF(JavaTransformer, HasBinary, HasInputCol, HasOutputCol, HasNumFeatures,`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14644][ML][PYSPARK] Turn Binary param i...

Posted by holdenk <gi...@git.apache.org>.
Github user holdenk closed the pull request at:

    https://github.com/apache/spark/pull/12404


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14644][ML][PYSPARK] Turn Binary param i...

Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on the pull request:

    https://github.com/apache/spark/pull/12404#issuecomment-210641271
  
    Sure that sounds like a clearer docstring :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14644][ML][PYSPARK] Turn Binary param i...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12404#issuecomment-210662310
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/55961/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14644][ML][PYSPARK] Turn Binary param i...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on the pull request:

    https://github.com/apache/spark/pull/12404#issuecomment-210632010
  
    ah - sorry I missed that. Ok, yeah this makes sense if we unify the doc string appropriately. Perhaps something like `If true, all non-zero counts (after any filters are applied) ...`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14644][ML][PYSPARK] Turn Binary param i...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12404#issuecomment-210662305
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14644][ML][PYSPARK] Turn Binary param i...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12404#issuecomment-210225966
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14644][ML][PYSPARK] Turn Binary param i...

Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12404#discussion_r59931924
  
    --- Diff: python/pyspark/ml/param/shared.py ---
    @@ -583,6 +583,31 @@ def getVarianceCol(self):
             return self.getOrDefault(self.varianceCol)
     
     
    +class HasBinary(Params):
    +    """
    +    Mixin for param binary: If True, all non zero results are set to 1. This is useful for discrete probabilistic models that model binary events rather than integer counts. Default False.
    --- End diff --
    
    From toxi.ini we've excluded `cloudpickle.py`,`heapq3.py`, and `shared.py`, we've also done similar exclusions on the Scala side for the auto generated params.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14644][ML][PYSPARK] Turn Binary param i...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12404#issuecomment-210649318
  
    **[Test build #55961 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55961/consoleFull)** for PR 12404 at commit [`b702e72`](https://github.com/apache/spark/commit/b702e7278c840e4d917ea321e96fe28dcef8dee6).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14644][ML][PYSPARK] Turn Binary param i...

Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on the pull request:

    https://github.com/apache/spark/pull/12404#issuecomment-210649760
  
    Sounds good if people prefer to keep them separate I'll close this out.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14644][ML][PYSPARK] Turn Binary param i...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12404#issuecomment-210212321
  
    **[Test build #55872 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55872/consoleFull)** for PR 12404 at commit [`edbd652`](https://github.com/apache/spark/commit/edbd652f85c9a1003ad23a76ef02e789987328c9).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14644][ML][PYSPARK] Turn Binary param i...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on the pull request:

    https://github.com/apache/spark/pull/12404#issuecomment-210307663
  
    Hey @holdenk, in https://github.com/apache/spark/pull/12079#issuecomment-209049583 and https://github.com/apache/spark/pull/12308#issuecomment-209039855, we decided to keep them separate for now, since the doc for `binary` in `CountVectorizer` mentions that the binarization occurs after `minTF` is applied. In `HashingTF` this doesn't apply as there is no `minTF` parameter.
    
    So unless we can easily have different doc string, I think it might be best to not separate out `binary` into a shared param yet. We could look at adding `minTF` semantics to `hashingTF`, in which case it would then make a lot of sense to have this as a shared param.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14644][ML][PYSPARK] Turn Binary param i...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12404#issuecomment-210662151
  
    **[Test build #55961 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55961/consoleFull)** for PR 12404 at commit [`b702e72`](https://github.com/apache/spark/commit/b702e7278c840e4d917ea321e96fe28dcef8dee6).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14644][ML][PYSPARK] Turn Binary param i...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/12404#issuecomment-210642641
  
    I prefer to keep them separate.  There isn't much benefit to combining 2 instances, and the doc becomes a bit less clear.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org