You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by holdenk <gi...@git.apache.org> on 2016/04/15 02:06:33 UTC
[GitHub] spark pull request: [SPARK-14644][ML][PYSPARK] Turn Binary param i...
GitHub user holdenk opened a pull request:
https://github.com/apache/spark/pull/12404
[SPARK-14644][ML][PYSPARK] Turn Binary param into a shared param
## What changes were proposed in this pull request?
Change the binary param into a shared param.
## How was this patch tested?
Existing unit tests
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/holdenk/spark SPARK-14644-binary-param-shared-param
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/12404.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #12404
----
commit 7a0230a07d8650ded80b3f255047b7558ba8989c
Author: Holden Karau <ho...@us.ibm.com>
Date: 2016-04-14T23:22:57Z
update codegen
commit ccd28a47c9e96ac9df15a199ed85c42a2bb59787
Author: Holden Karau <ho...@us.ibm.com>
Date: 2016-04-14T23:37:54Z
Generate binary param
commit edbd652f85c9a1003ad23a76ef02e789987328c9
Author: Holden Karau <ho...@us.ibm.com>
Date: 2016-04-14T23:41:55Z
Use shared param
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-14644][ML][PYSPARK] Turn Binary param i...
Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on a diff in the pull request:
https://github.com/apache/spark/pull/12404#discussion_r59826196
--- Diff: python/pyspark/ml/param/shared.py ---
@@ -583,6 +583,31 @@ def getVarianceCol(self):
return self.getOrDefault(self.varianceCol)
+class HasBinary(Params):
+ """
+ Mixin for param binary: If True, all non zero results are set to 1. This is useful for discrete probabilistic models that model binary events rather than integer counts. Default False.
--- End diff --
how come this didn't fail our style checker? did we break the pep8 checker?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-14644][ML][PYSPARK] Turn Binary param i...
Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on the pull request:
https://github.com/apache/spark/pull/12404#issuecomment-210617504
So I thought the unified doc string of:
> If true, all non zero results are set to 1. This is useful for discrete probabilistic models that model binary events rather than integer counts. Default False.
might make sense in the two cases where we are using it (since `minTF` is described as a filter). If you think this isn't clear enough with the `minTF` case on `HashingTF` I'm happy to close it for now and revisit if we add `minTF` to `CountVectorizer`.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-14644][ML][PYSPARK] Turn Binary param i...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/12404#issuecomment-210225967
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/55872/
Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-14644][ML][PYSPARK] Turn Binary param i...
Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on the pull request:
https://github.com/apache/spark/pull/12404#issuecomment-210647651
Fair enough. We can revisit if more estimators use the param, or if we add
minTF to HashingTF.
On Fri, 15 Apr 2016 at 22:54, jkbradley <no...@github.com> wrote:
> I prefer to keep them separate. There isn't much benefit to combining 2
> instances, and the doc becomes a bit less clear.
>
> —
> You are receiving this because you commented.
> Reply to this email directly or view it on GitHub
> <https://github.com/apache/spark/pull/12404#issuecomment-210642641>
>
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-14644][ML][PYSPARK] Turn Binary param i...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/12404#issuecomment-210225853
**[Test build #55872 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55872/consoleFull)** for PR 12404 at commit [`edbd652`](https://github.com/apache/spark/commit/edbd652f85c9a1003ad23a76ef02e789987328c9).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds the following public classes _(experimental)_:
* `class CountVectorizer(JavaEstimator, HasBinary, HasInputCol, HasOutputCol, JavaMLReadable,`
* `class HashingTF(JavaTransformer, HasBinary, HasInputCol, HasOutputCol, HasNumFeatures,`
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-14644][ML][PYSPARK] Turn Binary param i...
Posted by holdenk <gi...@git.apache.org>.
Github user holdenk closed the pull request at:
https://github.com/apache/spark/pull/12404
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-14644][ML][PYSPARK] Turn Binary param i...
Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on the pull request:
https://github.com/apache/spark/pull/12404#issuecomment-210641271
Sure that sounds like a clearer docstring :)
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-14644][ML][PYSPARK] Turn Binary param i...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/12404#issuecomment-210662310
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/55961/
Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-14644][ML][PYSPARK] Turn Binary param i...
Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on the pull request:
https://github.com/apache/spark/pull/12404#issuecomment-210632010
ah - sorry I missed that. Ok, yeah this makes sense if we unify the doc string appropriately. Perhaps something like `If true, all non-zero counts (after any filters are applied) ...`?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-14644][ML][PYSPARK] Turn Binary param i...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/12404#issuecomment-210662305
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-14644][ML][PYSPARK] Turn Binary param i...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/12404#issuecomment-210225966
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-14644][ML][PYSPARK] Turn Binary param i...
Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on a diff in the pull request:
https://github.com/apache/spark/pull/12404#discussion_r59931924
--- Diff: python/pyspark/ml/param/shared.py ---
@@ -583,6 +583,31 @@ def getVarianceCol(self):
return self.getOrDefault(self.varianceCol)
+class HasBinary(Params):
+ """
+ Mixin for param binary: If True, all non zero results are set to 1. This is useful for discrete probabilistic models that model binary events rather than integer counts. Default False.
--- End diff --
From toxi.ini we've excluded `cloudpickle.py`,`heapq3.py`, and `shared.py`, we've also done similar exclusions on the Scala side for the auto generated params.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-14644][ML][PYSPARK] Turn Binary param i...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/12404#issuecomment-210649318
**[Test build #55961 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55961/consoleFull)** for PR 12404 at commit [`b702e72`](https://github.com/apache/spark/commit/b702e7278c840e4d917ea321e96fe28dcef8dee6).
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-14644][ML][PYSPARK] Turn Binary param i...
Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on the pull request:
https://github.com/apache/spark/pull/12404#issuecomment-210649760
Sounds good if people prefer to keep them separate I'll close this out.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-14644][ML][PYSPARK] Turn Binary param i...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/12404#issuecomment-210212321
**[Test build #55872 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55872/consoleFull)** for PR 12404 at commit [`edbd652`](https://github.com/apache/spark/commit/edbd652f85c9a1003ad23a76ef02e789987328c9).
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-14644][ML][PYSPARK] Turn Binary param i...
Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on the pull request:
https://github.com/apache/spark/pull/12404#issuecomment-210307663
Hey @holdenk, in https://github.com/apache/spark/pull/12079#issuecomment-209049583 and https://github.com/apache/spark/pull/12308#issuecomment-209039855, we decided to keep them separate for now, since the doc for `binary` in `CountVectorizer` mentions that the binarization occurs after `minTF` is applied. In `HashingTF` this doesn't apply as there is no `minTF` parameter.
So unless we can easily have different doc string, I think it might be best to not separate out `binary` into a shared param yet. We could look at adding `minTF` semantics to `hashingTF`, in which case it would then make a lot of sense to have this as a shared param.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-14644][ML][PYSPARK] Turn Binary param i...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/12404#issuecomment-210662151
**[Test build #55961 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55961/consoleFull)** for PR 12404 at commit [`b702e72`](https://github.com/apache/spark/commit/b702e7278c840e4d917ea321e96fe28dcef8dee6).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-14644][ML][PYSPARK] Turn Binary param i...
Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the pull request:
https://github.com/apache/spark/pull/12404#issuecomment-210642641
I prefer to keep them separate. There isn't much benefit to combining 2 instances, and the doc becomes a bit less clear.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org