You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by holdenk <gi...@git.apache.org> on 2015/12/02 02:22:32 UTC
[GitHub] spark pull request: [SPARK-11937][SPARK-11922][PYSPARK][ML] Python...
GitHub user holdenk opened a pull request:
https://github.com/apache/spark/pull/10085
[SPARK-11937][SPARK-11922][PYSPARK][ML] Python api for ml.feature.quantile discretizer
Add Python API for ml.feature.QuantileDiscretizer.
One open question: Do we want to do this stuff to re-use the java model, create a new model, or use a different wrapper around the java model.
cc @brkyvz & @mengxr
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/holdenk/spark SPARK-11937-SPARK-11922-Python-API-for-ml.feature.QuantileDiscretizer
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/10085.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #10085
----
commit dbabade554b93bee191b56115a516f39d4e128ea
Author: Holden Karau <ho...@us.ibm.com>
Date: 2015-12-01T22:03:01Z
Start working towards implementing python interface for quantilediscrectizer. One question (for review) is do we want to change the bucketizer as I've done or create a different wrapper? I think this way is better but it does introduce an extra param so no sure
commit 1cacd7667ac0fa37b94fcb842e1d1616898279e9
Author: Holden Karau <ho...@us.ibm.com>
Date: 2015-12-02T01:11:51Z
Ok remove _java_model before setting the params since it isn't really a param, print out the splits from the trained bucketizer
commit cfb255fc903f8283ef3fc55cf52e0fed8634f9bb
Author: Holden Karau <ho...@us.ibm.com>
Date: 2015-12-02T01:19:06Z
And make sure the generated model works
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/10085#issuecomment-170710756
**[Test build #49177 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49177/consoleFull)** for PR 10085 at commit [`798798c`](https://github.com/apache/spark/commit/798798c49eaa9b6b62c0266d343b2350edac5875).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...
Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the pull request:
https://github.com/apache/spark/pull/10085#issuecomment-172938977
Those are the only issues I see. Thanks everyone for reviewing & @holdenk for the PR!
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/10085#issuecomment-172963641
**[Test build #49694 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49694/consoleFull)** for PR 10085 at commit [`f21ebef`](https://github.com/apache/spark/commit/f21ebefc1e0edc16c2eed8e5474033e8d3baf1ae).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/10085#issuecomment-170706592
**[Test build #49177 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49177/consoleFull)** for PR 10085 at commit [`798798c`](https://github.com/apache/spark/commit/798798c49eaa9b6b62c0266d343b2350edac5875).
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...
Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on the pull request:
https://github.com/apache/spark/pull/10085#issuecomment-163392210
Ok - just to make sure do you see any issues with the current approach for getSplits? Its tested a bit in the doctests but if there is a potential issue I can add some more tests.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/10085#issuecomment-173022443
Merged build finished. Test FAILed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/10085#issuecomment-174732868
**[Test build #50037 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50037/consoleFull)** for PR 10085 at commit [`463aa37`](https://github.com/apache/spark/commit/463aa377f9af4f5f9d2691abaff0dbb9ff7881b1).
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/10085#issuecomment-172238836
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/49534/
Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...
Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on a diff in the pull request:
https://github.com/apache/spark/pull/10085#discussion_r50175399
--- Diff: python/pyspark/ml/feature.py ---
@@ -992,6 +993,88 @@ def getDegree(self):
@inherit_doc
+class QuantileDiscretizer(JavaEstimator, HasInputCol, HasOutputCol):
+ """
+ .. note:: Experimental
+
+ `QuantileDiscretizer` takes a column with continuous features and outputs a column with binned
+ categorical features. The bin ranges are chosen by taking a sample of the data and dividing it
+ into roughly equal parts. The lower and upper bin bounds will be -Infinity and +Infinity,
+ covering all real values. This attempts to find numBuckets partitions based on a sample of data,
+ but it may find fewer depending on the data sample values.
+
+ >>> df = sqlContext.createDataFrame([(0.1,), (0.4,), (1.2,), (1.5,)], ["values"])
+ >>> qds = QuantileDiscretizer(numBuckets=2,
+ ... inputCol="values", outputCol="buckets")
+ >>> bucketizer = qds.fit(df)
+ >>> splits = bucketizer.getSplits()
+ >>> splits[0]
+ -inf
+ >>> int(splits[1]*10)
--- End diff --
Sure :)
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/10085#issuecomment-174677492
**[Test build #50015 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50015/consoleFull)** for PR 10085 at commit [`463aa37`](https://github.com/apache/spark/commit/463aa377f9af4f5f9d2691abaff0dbb9ff7881b1).
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-11937][SPARK-11922][PYSPARK][ML] Python...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/10085#issuecomment-161166678
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/47031/
Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-11937][SPARK-11922][PYSPARK][ML] Python...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/10085#issuecomment-161163116
**[Test build #47031 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47031/consoleFull)** for PR 10085 at commit [`2afd197`](https://github.com/apache/spark/commit/2afd197cf74ca9552333ddd7a13bbfe8bd35490c).
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/10085#issuecomment-172238655
**[Test build #49534 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49534/consoleFull)** for PR 10085 at commit [`5e18778`](https://github.com/apache/spark/commit/5e18778d04266b0fd63ec70b871404bde83b0c58).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...
Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on a diff in the pull request:
https://github.com/apache/spark/pull/10085#discussion_r50194199
--- Diff: python/pyspark/ml/feature.py ---
@@ -992,6 +993,88 @@ def getDegree(self):
@inherit_doc
+class QuantileDiscretizer(JavaEstimator, HasInputCol, HasOutputCol):
+ """
+ .. note:: Experimental
+
+ `QuantileDiscretizer` takes a column with continuous features and outputs a column with binned
+ categorical features. The bin ranges are chosen by taking a sample of the data and dividing it
+ into roughly equal parts. The lower and upper bin bounds will be -Infinity and +Infinity,
+ covering all real values. This attempts to find numBuckets partitions based on a sample of data,
+ but it may find fewer depending on the data sample values.
+
+ >>> df = sqlContext.createDataFrame([(0.1,), (0.4,), (1.2,), (1.5,)], ["values"])
+ >>> qds = QuantileDiscretizer(numBuckets=2,
+ ... inputCol="values", outputCol="buckets")
+ >>> bucketizer = qds.fit(df)
+ >>> splits = bucketizer.getSplits()
+ >>> splits[0]
+ -inf
+ >>> int(splits[1]*10)
--- End diff --
Tried that, seems like it gets printed differently, I'll go back to the int one instead (or just drop it).
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...
Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on the pull request:
https://github.com/apache/spark/pull/10085#issuecomment-173644319
Think I addressed all of @jkbradley's comments
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/10085#issuecomment-172238834
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...
Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on the pull request:
https://github.com/apache/spark/pull/10085#issuecomment-163446677
@yinxusen thanks :)
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...
Posted by yinxusen <gi...@git.apache.org>.
Github user yinxusen commented on the pull request:
https://github.com/apache/spark/pull/10085#issuecomment-163442465
@holdenk No more issue in getSplits. It looks good.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-11937][SPARK-11922][PYSPARK][ML] Python...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/10085#issuecomment-161152153
**[Test build #47026 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47026/consoleFull)** for PR 10085 at commit [`2540101`](https://github.com/apache/spark/commit/254010184eeee33a4a9f8aeda7b77bfd365b18f3).
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/10085#issuecomment-174736775
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...
Posted by yinxusen <gi...@git.apache.org>.
Github user yinxusen commented on the pull request:
https://github.com/apache/spark/pull/10085#issuecomment-171904059
@jkbradley LGTM except for the version labels.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...
Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the pull request:
https://github.com/apache/spark/pull/10085#issuecomment-174857126
LGTM
Merging with master
Thanks for the PR!
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-11937][SPARK-11922][PYSPARK][ML] Python...
Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on the pull request:
https://github.com/apache/spark/pull/10085#issuecomment-161304048
cc @yanboliang who filed the JIRA for this.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...
Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on the pull request:
https://github.com/apache/spark/pull/10085#issuecomment-171473259
re-ping @jkbradley ?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/10085#issuecomment-174728577
**[Test build #50015 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50015/consoleFull)** for PR 10085 at commit [`463aa37`](https://github.com/apache/spark/commit/463aa377f9af4f5f9d2691abaff0dbb9ff7881b1).
* This patch **fails PySpark unit tests**.
* This patch merges cleanly.
* This patch adds no public classes.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...
Posted by yanboliang <gi...@git.apache.org>.
Github user yanboliang commented on the pull request:
https://github.com/apache/spark/pull/10085#issuecomment-163165252
I vote for make ```Bucketizer``` to a ```Model``` rather than ```Transformer``` which is consistent with Scala code. @yinxusen Could you explain the reason that you give up this scenario?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-11937][SPARK-11922][PYSPARK][ML] Python...
Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the pull request:
https://github.com/apache/spark/pull/10085#issuecomment-162658080
Can you please only link to the specific JIRA, not the umbrella?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-11937][SPARK-11922][PYSPARK][ML] Python...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/10085#issuecomment-161155114
Merged build finished. Test FAILed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/10085#issuecomment-173030256
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/49726/
Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...
Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on a diff in the pull request:
https://github.com/apache/spark/pull/10085#discussion_r50160074
--- Diff: python/pyspark/ml/feature.py ---
@@ -135,9 +135,9 @@ class Bucketizer(JavaTransformer, HasInputCol, HasOutputCol):
"specified will be treated as errors.")
@keyword_only
- def __init__(self, splits=None, inputCol=None, outputCol=None):
+ def __init__(self, splits=None, inputCol=None, outputCol=None, _java_model=None):
--- End diff --
Oh yah, I think the original plan was to avoid the overhead of object creation and sending the params back to the JVM if it is supplied since we already had a transformer. I'll remove this.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...
Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on a diff in the pull request:
https://github.com/apache/spark/pull/10085#discussion_r50160361
--- Diff: python/pyspark/ml/feature.py ---
@@ -992,6 +993,88 @@ def getDegree(self):
@inherit_doc
+class QuantileDiscretizer(JavaEstimator, HasInputCol, HasOutputCol):
+ """
+ .. note:: Experimental
+
+ `QuantileDiscretizer` takes a column with continuous features and outputs a column with binned
+ categorical features. The bin ranges are chosen by taking a sample of the data and dividing it
+ into roughly equal parts. The lower and upper bin bounds will be -Infinity and +Infinity,
+ covering all real values. This attempts to find numBuckets partitions based on a sample of data,
+ but it may find fewer depending on the data sample values.
+
+ >>> df = sqlContext.createDataFrame([(0.1,), (0.4,), (1.2,), (1.5,)], ["values"])
+ >>> qds = QuantileDiscretizer(numBuckets=2,
+ ... inputCol="values", outputCol="buckets")
+ >>> bucketizer = qds.fit(df)
+ >>> splits = bucketizer.getSplits()
+ >>> splits[0]
+ -inf
+ >>> int(splits[1]*10)
--- End diff --
Its a float, so to make the test not flaky and still human readable for doctests I truncated the split.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...
Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:
https://github.com/apache/spark/pull/10085#discussion_r50150751
--- Diff: python/pyspark/ml/feature.py ---
@@ -135,9 +135,9 @@ class Bucketizer(JavaTransformer, HasInputCol, HasOutputCol):
"specified will be treated as errors.")
@keyword_only
- def __init__(self, splits=None, inputCol=None, outputCol=None):
+ def __init__(self, splits=None, inputCol=None, outputCol=None, _java_model=None):
--- End diff --
Why is _java_model needed? It does not seem to be used.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/10085#issuecomment-174728776
Merged build finished. Test FAILed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...
Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on the pull request:
https://github.com/apache/spark/pull/10085#issuecomment-164526253
cc @yanboliang if you have a chance to take a look
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-11937][SPARK-11922][PYSPARK][ML] Python...
Posted by yinxusen <gi...@git.apache.org>.
Github user yinxusen commented on the pull request:
https://github.com/apache/spark/pull/10085#issuecomment-162748542
Hi @holdenk, I think the PR is duplicated with mine: https://github.com/apache/spark/pull/10007
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...
Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:
https://github.com/apache/spark/pull/10085
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...
Posted by yinxusen <gi...@git.apache.org>.
Github user yinxusen commented on a diff in the pull request:
https://github.com/apache/spark/pull/10085#discussion_r49830062
--- Diff: python/pyspark/ml/feature.py ---
@@ -972,6 +973,88 @@ def getDegree(self):
@inherit_doc
+class QuantileDiscretizer(JavaEstimator, HasInputCol, HasOutputCol):
+ """
+ .. note:: Experimental
+
+ `QuantileDiscretizer` takes a column with continuous features and outputs a column with binned
+ categorical features. The bin ranges are chosen by taking a sample of the data and dividing it
+ into roughly equal parts. The lower and upper bin bounds will be -Infinity and +Infinity,
+ covering all real values. This attempts to find numBuckets partitions based on a sample of data,
+ but it may find fewer depending on the data sample values.
+
+ >>> df = sqlContext.createDataFrame([(0.1,), (0.4,), (1.2,), (1.5,)], ["values"])
+ >>> qds = QuantileDiscretizer(numBuckets=2,
+ ... inputCol="values", outputCol="buckets")
+ >>> bucketizer = qds.fit(df)
+ >>> splits = bucketizer.getSplits()
+ >>> splits[0]
+ -inf
+ >>> int(splits[1]*10)
+ 4
+ >>> bucketed = bucketizer.transform(df).collect()
+ >>> bucketed[0].buckets
+ 0.0
+
+ .. versionadded:: 1.6.0
--- End diff --
change it to 2.0.0.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/10085#issuecomment-174728777
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/50015/
Test FAILed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-11937][SPARK-11922][PYSPARK][ML] Python...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/10085#issuecomment-161166603
**[Test build #47031 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47031/consoleFull)** for PR 10085 at commit [`2afd197`](https://github.com/apache/spark/commit/2afd197cf74ca9552333ddd7a13bbfe8bd35490c).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds the following public classes _(experimental)_:\n * `class QuantileDiscretizer(JavaEstimator, HasInputCol, HasOutputCol):`\n
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...
Posted by yinxusen <gi...@git.apache.org>.
Github user yinxusen commented on the pull request:
https://github.com/apache/spark/pull/10085#issuecomment-163172333
@yanboliang I am OK to change `Bucketizer` to a `JavaModel`. At that time I just do not want to change that piece of code. That's also why I closed my PR because I think @holdenk's implementation is better. But like what I said, be careful with `getSplits`. :)
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/10085#issuecomment-173022445
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/49719/
Test FAILed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-11937][SPARK-11922][PYSPARK][ML] Python...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/10085#issuecomment-161155116
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/47026/
Test FAILed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/10085#issuecomment-170710926
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/49177/
Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-11937][SPARK-11922][PYSPARK][ML] Python...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/10085#issuecomment-161161568
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/47030/
Test FAILed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/10085#issuecomment-170710921
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/10085#issuecomment-168056674
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/10085#issuecomment-173022402
**[Test build #49719 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49719/consoleFull)** for PR 10085 at commit [`f9e3086`](https://github.com/apache/spark/commit/f9e3086b2fa7eae24f22aa2fd32eb644d829e52b).
* This patch **fails PySpark unit tests**.
* This patch merges cleanly.
* This patch adds no public classes.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...
Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on the pull request:
https://github.com/apache/spark/pull/10085#issuecomment-174730604
seems unrelated, jenkins retest this please.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/10085#issuecomment-168056577
**[Test build #48495 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48495/consoleFull)** for PR 10085 at commit [`601a9ea`](https://github.com/apache/spark/commit/601a9eaa35d164cd94c48dd8d94a64931f10c6ea).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/10085#issuecomment-172963904
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/49694/
Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-11937][SPARK-11922][PYSPARK][ML] Python...
Posted by yinxusen <gi...@git.apache.org>.
Github user yinxusen commented on the pull request:
https://github.com/apache/spark/pull/10085#issuecomment-162749686
OK, I was not realized that there is an umbrella JIRA for this. I'll close mine.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/10085#issuecomment-172963900
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/10085#issuecomment-173019235
**[Test build #49719 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49719/consoleFull)** for PR 10085 at commit [`f9e3086`](https://github.com/apache/spark/commit/f9e3086b2fa7eae24f22aa2fd32eb644d829e52b).
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/10085#issuecomment-173030254
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...
Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:
https://github.com/apache/spark/pull/10085#discussion_r50150766
--- Diff: python/pyspark/ml/feature.py ---
@@ -992,6 +993,88 @@ def getDegree(self):
@inherit_doc
+class QuantileDiscretizer(JavaEstimator, HasInputCol, HasOutputCol):
+ """
+ .. note:: Experimental
+
+ `QuantileDiscretizer` takes a column with continuous features and outputs a column with binned
+ categorical features. The bin ranges are chosen by taking a sample of the data and dividing it
+ into roughly equal parts. The lower and upper bin bounds will be -Infinity and +Infinity,
+ covering all real values. This attempts to find numBuckets partitions based on a sample of data,
+ but it may find fewer depending on the data sample values.
+
+ >>> df = sqlContext.createDataFrame([(0.1,), (0.4,), (1.2,), (1.5,)], ["values"])
+ >>> qds = QuantileDiscretizer(numBuckets=2,
+ ... inputCol="values", outputCol="buckets")
+ >>> bucketizer = qds.fit(df)
+ >>> splits = bucketizer.getSplits()
+ >>> splits[0]
+ -inf
+ >>> int(splits[1]*10)
+ 4
+ >>> bucketed = bucketizer.transform(df).collect()
+ >>> bucketed[0].buckets
+ 0.0
+
+ .. versionadded:: 2.0.0
+ """
+
+ # a placeholder to make it appear in the generated doc
+ numBuckets = Param(Params._dummy(), "numBuckets",
+ "Maximum number of buckets (quantiles, or " +
+ "categories) into which data points are grouped. Must be >= 2. Default 2.")
+
+ @keyword_only
+ def __init__(self, numBuckets=2, inputCol=None, outputCol=None):
+ """
+ __init__(self, numBuckets=2, inputCol=None, outputCol=None)
+ """
+ super(QuantileDiscretizer, self).__init__()
+ self._java_obj = self._new_java_obj("org.apache.spark.ml.feature.QuantileDiscretizer",
+ self.uid)
+ self.numBuckets = Param(self, "numBuckets",
+ "Maximum number of buckets (quantiles, or " +
+ "categories) into which data points are grouped. Must be >= 2.")
+ self._setDefault(numBuckets=2)
+ kwargs = self.__init__._input_kwargs
+ self.setParams(**kwargs)
+
+ @keyword_only
+ @since("2.0.0")
+ def setParams(self, numBuckets=2, inputCol=None, outputCol=None):
+ """
+ setParams(self, numBuckets=2, inputCol=None, outputCol=None)
+ Set the params for the QuantileDiscertizerBase
--- End diff --
typo
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/10085#issuecomment-168056675
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48495/
Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...
Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the pull request:
https://github.com/apache/spark/pull/10085#issuecomment-171492302
I'll try to check this soon, but have some others first. It will be great if someone else can review this PR in the meantime. @yinxusen Would you have time? Thanks!
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-11937][SPARK-11922][PYSPARK][ML] Python...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/10085#issuecomment-161155079
**[Test build #47026 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47026/consoleFull)** for PR 10085 at commit [`2540101`](https://github.com/apache/spark/commit/254010184eeee33a4a9f8aeda7b77bfd365b18f3).
* This patch **fails PySpark unit tests**.
* This patch merges cleanly.
* This patch adds the following public classes _(experimental)_:\n * `class QuantileDiscretizer(JavaEstimator, HasInputCol, HasOutputCol):`\n
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...
Posted by yinxusen <gi...@git.apache.org>.
Github user yinxusen commented on a diff in the pull request:
https://github.com/apache/spark/pull/10085#discussion_r49829992
--- Diff: python/pyspark/ml/feature.py ---
@@ -972,6 +973,88 @@ def getDegree(self):
@inherit_doc
+class QuantileDiscretizer(JavaEstimator, HasInputCol, HasOutputCol):
+ """
+ .. note:: Experimental
+
+ `QuantileDiscretizer` takes a column with continuous features and outputs a column with binned
+ categorical features. The bin ranges are chosen by taking a sample of the data and dividing it
+ into roughly equal parts. The lower and upper bin bounds will be -Infinity and +Infinity,
+ covering all real values. This attempts to find numBuckets partitions based on a sample of data,
+ but it may find fewer depending on the data sample values.
+
+ >>> df = sqlContext.createDataFrame([(0.1,), (0.4,), (1.2,), (1.5,)], ["values"])
+ >>> qds = QuantileDiscretizer(numBuckets=2,
+ ... inputCol="values", outputCol="buckets")
+ >>> bucketizer = qds.fit(df)
+ >>> splits = bucketizer.getSplits()
+ >>> splits[0]
+ -inf
+ >>> int(splits[1]*10)
+ 4
+ >>> bucketed = bucketizer.transform(df).collect()
+ >>> bucketed[0].buckets
+ 0.0
+
+ .. versionadded:: 1.6.0
+ """
+
+ # a placeholder to make it appear in the generated doc
+ numBuckets = Param(Params._dummy(), "numBuckets",
+ "Maximum number of buckets (quantiles, or " +
+ "categories) into which data points are grouped. Must be >= 2.")
--- End diff --
Should we add a `default 2` here?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-11937][SPARK-11922][PYSPARK][ML] Python...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/10085#issuecomment-161161566
Merged build finished. Test FAILed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/10085#issuecomment-172237877
**[Test build #49534 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49534/consoleFull)** for PR 10085 at commit [`5e18778`](https://github.com/apache/spark/commit/5e18778d04266b0fd63ec70b871404bde83b0c58).
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/10085#issuecomment-173030147
**[Test build #49726 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49726/consoleFull)** for PR 10085 at commit [`194ec6d`](https://github.com/apache/spark/commit/194ec6daa6da519490c33af4aa431f91cc7df88d).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...
Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on the pull request:
https://github.com/apache/spark/pull/10085#issuecomment-172235563
@yinxusen /@jkbradley updated the version added tag to 2.0.0 :)
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...
Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:
https://github.com/apache/spark/pull/10085#discussion_r50175081
--- Diff: python/pyspark/ml/feature.py ---
@@ -992,6 +993,88 @@ def getDegree(self):
@inherit_doc
+class QuantileDiscretizer(JavaEstimator, HasInputCol, HasOutputCol):
+ """
+ .. note:: Experimental
+
+ `QuantileDiscretizer` takes a column with continuous features and outputs a column with binned
+ categorical features. The bin ranges are chosen by taking a sample of the data and dividing it
+ into roughly equal parts. The lower and upper bin bounds will be -Infinity and +Infinity,
+ covering all real values. This attempts to find numBuckets partitions based on a sample of data,
+ but it may find fewer depending on the data sample values.
+
+ >>> df = sqlContext.createDataFrame([(0.1,), (0.4,), (1.2,), (1.5,)], ["values"])
+ >>> qds = QuantileDiscretizer(numBuckets=2,
+ ... inputCol="values", outputCol="buckets")
+ >>> bucketizer = qds.fit(df)
+ >>> splits = bucketizer.getSplits()
+ >>> splits[0]
+ -inf
+ >>> int(splits[1]*10)
--- End diff --
How about using round instead?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...
Posted by dbtsai <gi...@git.apache.org>.
Github user dbtsai commented on the pull request:
https://github.com/apache/spark/pull/10085#issuecomment-172429593
LGTM as well! Thanks.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/10085#issuecomment-168054538
**[Test build #48495 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48495/consoleFull)** for PR 10085 at commit [`601a9ea`](https://github.com/apache/spark/commit/601a9eaa35d164cd94c48dd8d94a64931f10c6ea).
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/10085#issuecomment-173027414
**[Test build #49726 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49726/consoleFull)** for PR 10085 at commit [`194ec6d`](https://github.com/apache/spark/commit/194ec6daa6da519490c33af4aa431f91cc7df88d).
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-11937][SPARK-11922][PYSPARK][ML] Python...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/10085#issuecomment-161159457
**[Test build #47030 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47030/consoleFull)** for PR 10085 at commit [`1145ec4`](https://github.com/apache/spark/commit/1145ec420590fc2e2cfc554433d9ba9ebabeb821).
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/10085#issuecomment-174736779
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/50037/
Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-11937][SPARK-11922][PYSPARK][ML] Python...
Posted by yinxusen <gi...@git.apache.org>.
Github user yinxusen commented on the pull request:
https://github.com/apache/spark/pull/10085#issuecomment-162761248
@holdenk For your questions, I first tried to modify the interface of `Bucketizer`, making it to a `JavaModel` other than a `JavaTransfomer`. But I finally decided not to touch the `Bucketizer`, and added a inner class of [`QuantileDiscretizerModel`](https://github.com/yinxusen/spark/blob/3a33327122ae94d59403d807255273180528d9a9/python/pyspark/ml/feature.py#L2149) to get the splits.
But I recommend to test the `getSplits` of `Bucketizer` that generating from the `QuantileDiscretizer`, since I got a serialization error, and I added a [`getJavaSplits`](https://github.com/yinxusen/spark/blob/3a33327122ae94d59403d807255273180528d9a9/mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala#L64) to avoide it. JIRA issue [here](https://issues.apache.org/jira/browse/SPARK-7379).
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...
Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:
https://github.com/apache/spark/pull/10085#discussion_r50150754
--- Diff: python/pyspark/ml/feature.py ---
@@ -992,6 +993,88 @@ def getDegree(self):
@inherit_doc
+class QuantileDiscretizer(JavaEstimator, HasInputCol, HasOutputCol):
+ """
+ .. note:: Experimental
+
+ `QuantileDiscretizer` takes a column with continuous features and outputs a column with binned
+ categorical features. The bin ranges are chosen by taking a sample of the data and dividing it
+ into roughly equal parts. The lower and upper bin bounds will be -Infinity and +Infinity,
+ covering all real values. This attempts to find numBuckets partitions based on a sample of data,
+ but it may find fewer depending on the data sample values.
+
+ >>> df = sqlContext.createDataFrame([(0.1,), (0.4,), (1.2,), (1.5,)], ["values"])
+ >>> qds = QuantileDiscretizer(numBuckets=2,
+ ... inputCol="values", outputCol="buckets")
+ >>> bucketizer = qds.fit(df)
+ >>> splits = bucketizer.getSplits()
+ >>> splits[0]
+ -inf
+ >>> int(splits[1]*10)
--- End diff --
This is odd. Can you not just check ```splits[1]```?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/10085#issuecomment-172959696
**[Test build #49694 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49694/consoleFull)** for PR 10085 at commit [`f21ebef`](https://github.com/apache/spark/commit/f21ebefc1e0edc16c2eed8e5474033e8d3baf1ae).
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...
Posted by yinxusen <gi...@git.apache.org>.
Github user yinxusen commented on the pull request:
https://github.com/apache/spark/pull/10085#issuecomment-171518197
@jkbradley I'll help you reviewing this.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...
Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:
https://github.com/apache/spark/pull/10085#discussion_r50150757
--- Diff: python/pyspark/ml/feature.py ---
@@ -992,6 +993,88 @@ def getDegree(self):
@inherit_doc
+class QuantileDiscretizer(JavaEstimator, HasInputCol, HasOutputCol):
+ """
+ .. note:: Experimental
+
+ `QuantileDiscretizer` takes a column with continuous features and outputs a column with binned
+ categorical features. The bin ranges are chosen by taking a sample of the data and dividing it
+ into roughly equal parts. The lower and upper bin bounds will be -Infinity and +Infinity,
+ covering all real values. This attempts to find numBuckets partitions based on a sample of data,
+ but it may find fewer depending on the data sample values.
+
+ >>> df = sqlContext.createDataFrame([(0.1,), (0.4,), (1.2,), (1.5,)], ["values"])
+ >>> qds = QuantileDiscretizer(numBuckets=2,
+ ... inputCol="values", outputCol="buckets")
+ >>> bucketizer = qds.fit(df)
+ >>> splits = bucketizer.getSplits()
+ >>> splits[0]
+ -inf
+ >>> int(splits[1]*10)
+ 4
+ >>> bucketed = bucketizer.transform(df).collect()
--- End diff --
How about ```head()``` instead of ```collect()[0]```?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/10085#issuecomment-174736475
**[Test build #50037 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50037/consoleFull)** for PR 10085 at commit [`463aa37`](https://github.com/apache/spark/commit/463aa377f9af4f5f9d2691abaff0dbb9ff7881b1).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-11937][SPARK-11922][PYSPARK][ML] Python...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/10085#issuecomment-161166676
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-11937][SPARK-11922][PYSPARK][ML] Python...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/10085#issuecomment-161161541
**[Test build #47030 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47030/consoleFull)** for PR 10085 at commit [`1145ec4`](https://github.com/apache/spark/commit/1145ec420590fc2e2cfc554433d9ba9ebabeb821).
* This patch **fails PySpark unit tests**.
* This patch merges cleanly.
* This patch adds the following public classes _(experimental)_:\n * `class QuantileDiscretizer(JavaEstimator, HasInputCol, HasOutputCol):`\n
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...
Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on the pull request:
https://github.com/apache/spark/pull/10085#issuecomment-168051426
re-ping @yanboliang or @jkbradley if you've got the time to look at this (already been reviewed a bit).
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org