You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by holdenk <gi...@git.apache.org> on 2015/12/02 02:22:32 UTC

[GitHub] spark pull request: [SPARK-11937][SPARK-11922][PYSPARK][ML] Python...

GitHub user holdenk opened a pull request:

    https://github.com/apache/spark/pull/10085

    [SPARK-11937][SPARK-11922][PYSPARK][ML] Python api for ml.feature.quantile discretizer

    Add Python API for ml.feature.QuantileDiscretizer.
    
    One open question: Do we want to do this stuff to re-use the java model, create a new model, or use a different wrapper around the java model.
    cc @brkyvz & @mengxr 

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/holdenk/spark SPARK-11937-SPARK-11922-Python-API-for-ml.feature.QuantileDiscretizer

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/10085.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #10085
    
----
commit dbabade554b93bee191b56115a516f39d4e128ea
Author: Holden Karau <ho...@us.ibm.com>
Date:   2015-12-01T22:03:01Z

    Start working towards implementing python interface for quantilediscrectizer. One question (for review) is do we want to change the bucketizer as I've done or create a different wrapper? I think this way is better but it does introduce an extra param so no sure

commit 1cacd7667ac0fa37b94fcb842e1d1616898279e9
Author: Holden Karau <ho...@us.ibm.com>
Date:   2015-12-02T01:11:51Z

    Ok remove _java_model before setting the params since it isn't really a param, print out the splits from the trained bucketizer

commit cfb255fc903f8283ef3fc55cf52e0fed8634f9bb
Author: Holden Karau <ho...@us.ibm.com>
Date:   2015-12-02T01:19:06Z

    And make sure the generated model works

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10085#issuecomment-170710756
  
    **[Test build #49177 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49177/consoleFull)** for PR 10085 at commit [`798798c`](https://github.com/apache/spark/commit/798798c49eaa9b6b62c0266d343b2350edac5875).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/10085#issuecomment-172938977
  
    Those are the only issues I see.  Thanks everyone for reviewing & @holdenk for the PR!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10085#issuecomment-172963641
  
    **[Test build #49694 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49694/consoleFull)** for PR 10085 at commit [`f21ebef`](https://github.com/apache/spark/commit/f21ebefc1e0edc16c2eed8e5474033e8d3baf1ae).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10085#issuecomment-170706592
  
    **[Test build #49177 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49177/consoleFull)** for PR 10085 at commit [`798798c`](https://github.com/apache/spark/commit/798798c49eaa9b6b62c0266d343b2350edac5875).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...

Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on the pull request:

    https://github.com/apache/spark/pull/10085#issuecomment-163392210
  
    Ok - just to make sure do you see any issues with the current approach for getSplits? Its tested a bit in the doctests but if there is a potential issue I can add some more tests.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10085#issuecomment-173022443
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10085#issuecomment-174732868
  
    **[Test build #50037 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50037/consoleFull)** for PR 10085 at commit [`463aa37`](https://github.com/apache/spark/commit/463aa377f9af4f5f9d2691abaff0dbb9ff7881b1).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10085#issuecomment-172238836
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/49534/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...

Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10085#discussion_r50175399
  
    --- Diff: python/pyspark/ml/feature.py ---
    @@ -992,6 +993,88 @@ def getDegree(self):
     
     
     @inherit_doc
    +class QuantileDiscretizer(JavaEstimator, HasInputCol, HasOutputCol):
    +    """
    +    .. note:: Experimental
    +
    +    `QuantileDiscretizer` takes a column with continuous features and outputs a column with binned
    +    categorical features. The bin ranges are chosen by taking a sample of the data and dividing it
    +    into roughly equal parts. The lower and upper bin bounds will be -Infinity and +Infinity,
    +    covering all real values. This attempts to find numBuckets partitions based on a sample of data,
    +    but it may find fewer depending on the data sample values.
    +
    +    >>> df = sqlContext.createDataFrame([(0.1,), (0.4,), (1.2,), (1.5,)], ["values"])
    +    >>> qds = QuantileDiscretizer(numBuckets=2,
    +    ...     inputCol="values", outputCol="buckets")
    +    >>> bucketizer = qds.fit(df)
    +    >>> splits = bucketizer.getSplits()
    +    >>> splits[0]
    +    -inf
    +    >>> int(splits[1]*10)
    --- End diff --
    
    Sure :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10085#issuecomment-174677492
  
    **[Test build #50015 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50015/consoleFull)** for PR 10085 at commit [`463aa37`](https://github.com/apache/spark/commit/463aa377f9af4f5f9d2691abaff0dbb9ff7881b1).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11937][SPARK-11922][PYSPARK][ML] Python...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10085#issuecomment-161166678
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/47031/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11937][SPARK-11922][PYSPARK][ML] Python...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10085#issuecomment-161163116
  
    **[Test build #47031 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47031/consoleFull)** for PR 10085 at commit [`2afd197`](https://github.com/apache/spark/commit/2afd197cf74ca9552333ddd7a13bbfe8bd35490c).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10085#issuecomment-172238655
  
    **[Test build #49534 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49534/consoleFull)** for PR 10085 at commit [`5e18778`](https://github.com/apache/spark/commit/5e18778d04266b0fd63ec70b871404bde83b0c58).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...

Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10085#discussion_r50194199
  
    --- Diff: python/pyspark/ml/feature.py ---
    @@ -992,6 +993,88 @@ def getDegree(self):
     
     
     @inherit_doc
    +class QuantileDiscretizer(JavaEstimator, HasInputCol, HasOutputCol):
    +    """
    +    .. note:: Experimental
    +
    +    `QuantileDiscretizer` takes a column with continuous features and outputs a column with binned
    +    categorical features. The bin ranges are chosen by taking a sample of the data and dividing it
    +    into roughly equal parts. The lower and upper bin bounds will be -Infinity and +Infinity,
    +    covering all real values. This attempts to find numBuckets partitions based on a sample of data,
    +    but it may find fewer depending on the data sample values.
    +
    +    >>> df = sqlContext.createDataFrame([(0.1,), (0.4,), (1.2,), (1.5,)], ["values"])
    +    >>> qds = QuantileDiscretizer(numBuckets=2,
    +    ...     inputCol="values", outputCol="buckets")
    +    >>> bucketizer = qds.fit(df)
    +    >>> splits = bucketizer.getSplits()
    +    >>> splits[0]
    +    -inf
    +    >>> int(splits[1]*10)
    --- End diff --
    
    Tried that, seems like it gets printed differently, I'll go back to the int one instead (or just drop it).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...

Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on the pull request:

    https://github.com/apache/spark/pull/10085#issuecomment-173644319
  
    Think I addressed all of @jkbradley's comments


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10085#issuecomment-172238834
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...

Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on the pull request:

    https://github.com/apache/spark/pull/10085#issuecomment-163446677
  
    @yinxusen thanks :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...

Posted by yinxusen <gi...@git.apache.org>.
Github user yinxusen commented on the pull request:

    https://github.com/apache/spark/pull/10085#issuecomment-163442465
  
    @holdenk No more issue in getSplits. It looks good. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11937][SPARK-11922][PYSPARK][ML] Python...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10085#issuecomment-161152153
  
    **[Test build #47026 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47026/consoleFull)** for PR 10085 at commit [`2540101`](https://github.com/apache/spark/commit/254010184eeee33a4a9f8aeda7b77bfd365b18f3).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10085#issuecomment-174736775
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...

Posted by yinxusen <gi...@git.apache.org>.
Github user yinxusen commented on the pull request:

    https://github.com/apache/spark/pull/10085#issuecomment-171904059
  
    @jkbradley LGTM except for the version labels.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/10085#issuecomment-174857126
  
    LGTM
    Merging with master
    Thanks for the PR!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11937][SPARK-11922][PYSPARK][ML] Python...

Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on the pull request:

    https://github.com/apache/spark/pull/10085#issuecomment-161304048
  
    cc @yanboliang who filed the JIRA for this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...

Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on the pull request:

    https://github.com/apache/spark/pull/10085#issuecomment-171473259
  
    re-ping @jkbradley ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10085#issuecomment-174728577
  
    **[Test build #50015 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50015/consoleFull)** for PR 10085 at commit [`463aa37`](https://github.com/apache/spark/commit/463aa377f9af4f5f9d2691abaff0dbb9ff7881b1).
     * This patch **fails PySpark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...

Posted by yanboliang <gi...@git.apache.org>.
Github user yanboliang commented on the pull request:

    https://github.com/apache/spark/pull/10085#issuecomment-163165252
  
    I vote for make ```Bucketizer``` to a ```Model``` rather than ```Transformer``` which is consistent with Scala code. @yinxusen Could you explain the reason that you give up this scenario?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11937][SPARK-11922][PYSPARK][ML] Python...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/10085#issuecomment-162658080
  
    Can you please only link to the specific JIRA, not the umbrella?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11937][SPARK-11922][PYSPARK][ML] Python...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10085#issuecomment-161155114
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10085#issuecomment-173030256
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/49726/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...

Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10085#discussion_r50160074
  
    --- Diff: python/pyspark/ml/feature.py ---
    @@ -135,9 +135,9 @@ class Bucketizer(JavaTransformer, HasInputCol, HasOutputCol):
                   "specified will be treated as errors.")
     
         @keyword_only
    -    def __init__(self, splits=None, inputCol=None, outputCol=None):
    +    def __init__(self, splits=None, inputCol=None, outputCol=None, _java_model=None):
    --- End diff --
    
    Oh yah, I think the original plan was to avoid the overhead of object creation and sending the params back to the JVM if it is supplied since we already had a transformer. I'll remove this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...

Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10085#discussion_r50160361
  
    --- Diff: python/pyspark/ml/feature.py ---
    @@ -992,6 +993,88 @@ def getDegree(self):
     
     
     @inherit_doc
    +class QuantileDiscretizer(JavaEstimator, HasInputCol, HasOutputCol):
    +    """
    +    .. note:: Experimental
    +
    +    `QuantileDiscretizer` takes a column with continuous features and outputs a column with binned
    +    categorical features. The bin ranges are chosen by taking a sample of the data and dividing it
    +    into roughly equal parts. The lower and upper bin bounds will be -Infinity and +Infinity,
    +    covering all real values. This attempts to find numBuckets partitions based on a sample of data,
    +    but it may find fewer depending on the data sample values.
    +
    +    >>> df = sqlContext.createDataFrame([(0.1,), (0.4,), (1.2,), (1.5,)], ["values"])
    +    >>> qds = QuantileDiscretizer(numBuckets=2,
    +    ...     inputCol="values", outputCol="buckets")
    +    >>> bucketizer = qds.fit(df)
    +    >>> splits = bucketizer.getSplits()
    +    >>> splits[0]
    +    -inf
    +    >>> int(splits[1]*10)
    --- End diff --
    
    Its a float, so to make the test not flaky and still human readable for doctests I truncated the split.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10085#discussion_r50150751
  
    --- Diff: python/pyspark/ml/feature.py ---
    @@ -135,9 +135,9 @@ class Bucketizer(JavaTransformer, HasInputCol, HasOutputCol):
                   "specified will be treated as errors.")
     
         @keyword_only
    -    def __init__(self, splits=None, inputCol=None, outputCol=None):
    +    def __init__(self, splits=None, inputCol=None, outputCol=None, _java_model=None):
    --- End diff --
    
    Why is _java_model needed?  It does not seem to be used.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10085#issuecomment-174728776
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...

Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on the pull request:

    https://github.com/apache/spark/pull/10085#issuecomment-164526253
  
    cc @yanboliang if you have a chance to take a look


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11937][SPARK-11922][PYSPARK][ML] Python...

Posted by yinxusen <gi...@git.apache.org>.
Github user yinxusen commented on the pull request:

    https://github.com/apache/spark/pull/10085#issuecomment-162748542
  
    Hi @holdenk, I think the PR is duplicated with mine: https://github.com/apache/spark/pull/10007


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/10085


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...

Posted by yinxusen <gi...@git.apache.org>.
Github user yinxusen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10085#discussion_r49830062
  
    --- Diff: python/pyspark/ml/feature.py ---
    @@ -972,6 +973,88 @@ def getDegree(self):
     
     
     @inherit_doc
    +class QuantileDiscretizer(JavaEstimator, HasInputCol, HasOutputCol):
    +    """
    +    .. note:: Experimental
    +
    +    `QuantileDiscretizer` takes a column with continuous features and outputs a column with binned
    +    categorical features. The bin ranges are chosen by taking a sample of the data and dividing it
    +    into roughly equal parts. The lower and upper bin bounds will be -Infinity and +Infinity,
    +    covering all real values. This attempts to find numBuckets partitions based on a sample of data,
    +    but it may find fewer depending on the data sample values.
    +
    +    >>> df = sqlContext.createDataFrame([(0.1,), (0.4,), (1.2,), (1.5,)], ["values"])
    +    >>> qds = QuantileDiscretizer(numBuckets=2,
    +    ...     inputCol="values", outputCol="buckets")
    +    >>> bucketizer = qds.fit(df)
    +    >>> splits = bucketizer.getSplits()
    +    >>> splits[0]
    +    -inf
    +    >>> int(splits[1]*10)
    +    4
    +    >>> bucketed = bucketizer.transform(df).collect()
    +    >>> bucketed[0].buckets
    +    0.0
    +
    +    .. versionadded:: 1.6.0
    --- End diff --
    
    change it to 2.0.0.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10085#issuecomment-174728777
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/50015/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11937][SPARK-11922][PYSPARK][ML] Python...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10085#issuecomment-161166603
  
    **[Test build #47031 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47031/consoleFull)** for PR 10085 at commit [`2afd197`](https://github.com/apache/spark/commit/2afd197cf74ca9552333ddd7a13bbfe8bd35490c).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:\n  * `class QuantileDiscretizer(JavaEstimator, HasInputCol, HasOutputCol):`\n


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...

Posted by yinxusen <gi...@git.apache.org>.
Github user yinxusen commented on the pull request:

    https://github.com/apache/spark/pull/10085#issuecomment-163172333
  
    @yanboliang I am OK to change `Bucketizer` to a `JavaModel`. At that time I just do not want to change that piece of code. That's also why I closed my PR because I think @holdenk's implementation is better. But like what I said, be careful with `getSplits`. :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10085#issuecomment-173022445
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/49719/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11937][SPARK-11922][PYSPARK][ML] Python...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10085#issuecomment-161155116
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/47026/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10085#issuecomment-170710926
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/49177/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11937][SPARK-11922][PYSPARK][ML] Python...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10085#issuecomment-161161568
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/47030/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10085#issuecomment-170710921
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10085#issuecomment-168056674
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10085#issuecomment-173022402
  
    **[Test build #49719 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49719/consoleFull)** for PR 10085 at commit [`f9e3086`](https://github.com/apache/spark/commit/f9e3086b2fa7eae24f22aa2fd32eb644d829e52b).
     * This patch **fails PySpark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...

Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on the pull request:

    https://github.com/apache/spark/pull/10085#issuecomment-174730604
  
    seems unrelated, jenkins retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10085#issuecomment-168056577
  
    **[Test build #48495 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48495/consoleFull)** for PR 10085 at commit [`601a9ea`](https://github.com/apache/spark/commit/601a9eaa35d164cd94c48dd8d94a64931f10c6ea).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10085#issuecomment-172963904
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/49694/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11937][SPARK-11922][PYSPARK][ML] Python...

Posted by yinxusen <gi...@git.apache.org>.
Github user yinxusen commented on the pull request:

    https://github.com/apache/spark/pull/10085#issuecomment-162749686
  
    OK, I was not realized that there is an umbrella JIRA for this. I'll close mine.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10085#issuecomment-172963900
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10085#issuecomment-173019235
  
    **[Test build #49719 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49719/consoleFull)** for PR 10085 at commit [`f9e3086`](https://github.com/apache/spark/commit/f9e3086b2fa7eae24f22aa2fd32eb644d829e52b).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10085#issuecomment-173030254
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10085#discussion_r50150766
  
    --- Diff: python/pyspark/ml/feature.py ---
    @@ -992,6 +993,88 @@ def getDegree(self):
     
     
     @inherit_doc
    +class QuantileDiscretizer(JavaEstimator, HasInputCol, HasOutputCol):
    +    """
    +    .. note:: Experimental
    +
    +    `QuantileDiscretizer` takes a column with continuous features and outputs a column with binned
    +    categorical features. The bin ranges are chosen by taking a sample of the data and dividing it
    +    into roughly equal parts. The lower and upper bin bounds will be -Infinity and +Infinity,
    +    covering all real values. This attempts to find numBuckets partitions based on a sample of data,
    +    but it may find fewer depending on the data sample values.
    +
    +    >>> df = sqlContext.createDataFrame([(0.1,), (0.4,), (1.2,), (1.5,)], ["values"])
    +    >>> qds = QuantileDiscretizer(numBuckets=2,
    +    ...     inputCol="values", outputCol="buckets")
    +    >>> bucketizer = qds.fit(df)
    +    >>> splits = bucketizer.getSplits()
    +    >>> splits[0]
    +    -inf
    +    >>> int(splits[1]*10)
    +    4
    +    >>> bucketed = bucketizer.transform(df).collect()
    +    >>> bucketed[0].buckets
    +    0.0
    +
    +    .. versionadded:: 2.0.0
    +    """
    +
    +    # a placeholder to make it appear in the generated doc
    +    numBuckets = Param(Params._dummy(), "numBuckets",
    +                       "Maximum number of buckets (quantiles, or " +
    +                       "categories) into which data points are grouped. Must be >= 2. Default 2.")
    +
    +    @keyword_only
    +    def __init__(self, numBuckets=2, inputCol=None, outputCol=None):
    +        """
    +        __init__(self, numBuckets=2, inputCol=None, outputCol=None)
    +        """
    +        super(QuantileDiscretizer, self).__init__()
    +        self._java_obj = self._new_java_obj("org.apache.spark.ml.feature.QuantileDiscretizer",
    +                                            self.uid)
    +        self.numBuckets = Param(self, "numBuckets",
    +                                "Maximum number of buckets (quantiles, or " +
    +                                "categories) into which data points are grouped. Must be >= 2.")
    +        self._setDefault(numBuckets=2)
    +        kwargs = self.__init__._input_kwargs
    +        self.setParams(**kwargs)
    +
    +    @keyword_only
    +    @since("2.0.0")
    +    def setParams(self, numBuckets=2, inputCol=None, outputCol=None):
    +        """
    +        setParams(self, numBuckets=2, inputCol=None, outputCol=None)
    +        Set the params for the QuantileDiscertizerBase
    --- End diff --
    
    typo


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10085#issuecomment-168056675
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48495/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/10085#issuecomment-171492302
  
    I'll try to check this soon, but have some others first.  It will be great if someone else can review this PR in the meantime.  @yinxusen Would you have time?  Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11937][SPARK-11922][PYSPARK][ML] Python...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10085#issuecomment-161155079
  
    **[Test build #47026 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47026/consoleFull)** for PR 10085 at commit [`2540101`](https://github.com/apache/spark/commit/254010184eeee33a4a9f8aeda7b77bfd365b18f3).
     * This patch **fails PySpark unit tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:\n  * `class QuantileDiscretizer(JavaEstimator, HasInputCol, HasOutputCol):`\n


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...

Posted by yinxusen <gi...@git.apache.org>.
Github user yinxusen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10085#discussion_r49829992
  
    --- Diff: python/pyspark/ml/feature.py ---
    @@ -972,6 +973,88 @@ def getDegree(self):
     
     
     @inherit_doc
    +class QuantileDiscretizer(JavaEstimator, HasInputCol, HasOutputCol):
    +    """
    +    .. note:: Experimental
    +
    +    `QuantileDiscretizer` takes a column with continuous features and outputs a column with binned
    +    categorical features. The bin ranges are chosen by taking a sample of the data and dividing it
    +    into roughly equal parts. The lower and upper bin bounds will be -Infinity and +Infinity,
    +    covering all real values. This attempts to find numBuckets partitions based on a sample of data,
    +    but it may find fewer depending on the data sample values.
    +
    +    >>> df = sqlContext.createDataFrame([(0.1,), (0.4,), (1.2,), (1.5,)], ["values"])
    +    >>> qds = QuantileDiscretizer(numBuckets=2,
    +    ...     inputCol="values", outputCol="buckets")
    +    >>> bucketizer = qds.fit(df)
    +    >>> splits = bucketizer.getSplits()
    +    >>> splits[0]
    +    -inf
    +    >>> int(splits[1]*10)
    +    4
    +    >>> bucketed = bucketizer.transform(df).collect()
    +    >>> bucketed[0].buckets
    +    0.0
    +
    +    .. versionadded:: 1.6.0
    +    """
    +
    +    # a placeholder to make it appear in the generated doc
    +    numBuckets = Param(Params._dummy(), "numBuckets",
    +                       "Maximum number of buckets (quantiles, or " +
    +                       "categories) into which data points are grouped. Must be >= 2.")
    --- End diff --
    
    Should we add a `default 2` here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11937][SPARK-11922][PYSPARK][ML] Python...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10085#issuecomment-161161566
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10085#issuecomment-172237877
  
    **[Test build #49534 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49534/consoleFull)** for PR 10085 at commit [`5e18778`](https://github.com/apache/spark/commit/5e18778d04266b0fd63ec70b871404bde83b0c58).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10085#issuecomment-173030147
  
    **[Test build #49726 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49726/consoleFull)** for PR 10085 at commit [`194ec6d`](https://github.com/apache/spark/commit/194ec6daa6da519490c33af4aa431f91cc7df88d).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...

Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on the pull request:

    https://github.com/apache/spark/pull/10085#issuecomment-172235563
  
    @yinxusen /@jkbradley updated the version added tag to 2.0.0 :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10085#discussion_r50175081
  
    --- Diff: python/pyspark/ml/feature.py ---
    @@ -992,6 +993,88 @@ def getDegree(self):
     
     
     @inherit_doc
    +class QuantileDiscretizer(JavaEstimator, HasInputCol, HasOutputCol):
    +    """
    +    .. note:: Experimental
    +
    +    `QuantileDiscretizer` takes a column with continuous features and outputs a column with binned
    +    categorical features. The bin ranges are chosen by taking a sample of the data and dividing it
    +    into roughly equal parts. The lower and upper bin bounds will be -Infinity and +Infinity,
    +    covering all real values. This attempts to find numBuckets partitions based on a sample of data,
    +    but it may find fewer depending on the data sample values.
    +
    +    >>> df = sqlContext.createDataFrame([(0.1,), (0.4,), (1.2,), (1.5,)], ["values"])
    +    >>> qds = QuantileDiscretizer(numBuckets=2,
    +    ...     inputCol="values", outputCol="buckets")
    +    >>> bucketizer = qds.fit(df)
    +    >>> splits = bucketizer.getSplits()
    +    >>> splits[0]
    +    -inf
    +    >>> int(splits[1]*10)
    --- End diff --
    
    How about using round instead?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...

Posted by dbtsai <gi...@git.apache.org>.
Github user dbtsai commented on the pull request:

    https://github.com/apache/spark/pull/10085#issuecomment-172429593
  
    LGTM as well! Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10085#issuecomment-168054538
  
    **[Test build #48495 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48495/consoleFull)** for PR 10085 at commit [`601a9ea`](https://github.com/apache/spark/commit/601a9eaa35d164cd94c48dd8d94a64931f10c6ea).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10085#issuecomment-173027414
  
    **[Test build #49726 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49726/consoleFull)** for PR 10085 at commit [`194ec6d`](https://github.com/apache/spark/commit/194ec6daa6da519490c33af4aa431f91cc7df88d).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11937][SPARK-11922][PYSPARK][ML] Python...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10085#issuecomment-161159457
  
    **[Test build #47030 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47030/consoleFull)** for PR 10085 at commit [`1145ec4`](https://github.com/apache/spark/commit/1145ec420590fc2e2cfc554433d9ba9ebabeb821).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10085#issuecomment-174736779
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/50037/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11937][SPARK-11922][PYSPARK][ML] Python...

Posted by yinxusen <gi...@git.apache.org>.
Github user yinxusen commented on the pull request:

    https://github.com/apache/spark/pull/10085#issuecomment-162761248
  
    @holdenk For your questions, I first tried to modify the interface of `Bucketizer`, making it to a `JavaModel` other than a `JavaTransfomer`. But I finally decided not to touch the `Bucketizer`, and added a inner class of [`QuantileDiscretizerModel`](https://github.com/yinxusen/spark/blob/3a33327122ae94d59403d807255273180528d9a9/python/pyspark/ml/feature.py#L2149) to get the splits.
    
    But I recommend to test the `getSplits` of `Bucketizer` that generating from the `QuantileDiscretizer`, since I got a serialization error, and I added a [`getJavaSplits`](https://github.com/yinxusen/spark/blob/3a33327122ae94d59403d807255273180528d9a9/mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala#L64) to avoide it. JIRA issue [here](https://issues.apache.org/jira/browse/SPARK-7379).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10085#discussion_r50150754
  
    --- Diff: python/pyspark/ml/feature.py ---
    @@ -992,6 +993,88 @@ def getDegree(self):
     
     
     @inherit_doc
    +class QuantileDiscretizer(JavaEstimator, HasInputCol, HasOutputCol):
    +    """
    +    .. note:: Experimental
    +
    +    `QuantileDiscretizer` takes a column with continuous features and outputs a column with binned
    +    categorical features. The bin ranges are chosen by taking a sample of the data and dividing it
    +    into roughly equal parts. The lower and upper bin bounds will be -Infinity and +Infinity,
    +    covering all real values. This attempts to find numBuckets partitions based on a sample of data,
    +    but it may find fewer depending on the data sample values.
    +
    +    >>> df = sqlContext.createDataFrame([(0.1,), (0.4,), (1.2,), (1.5,)], ["values"])
    +    >>> qds = QuantileDiscretizer(numBuckets=2,
    +    ...     inputCol="values", outputCol="buckets")
    +    >>> bucketizer = qds.fit(df)
    +    >>> splits = bucketizer.getSplits()
    +    >>> splits[0]
    +    -inf
    +    >>> int(splits[1]*10)
    --- End diff --
    
    This is odd.  Can you not just check ```splits[1]```?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10085#issuecomment-172959696
  
    **[Test build #49694 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49694/consoleFull)** for PR 10085 at commit [`f21ebef`](https://github.com/apache/spark/commit/f21ebefc1e0edc16c2eed8e5474033e8d3baf1ae).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...

Posted by yinxusen <gi...@git.apache.org>.
Github user yinxusen commented on the pull request:

    https://github.com/apache/spark/pull/10085#issuecomment-171518197
  
    @jkbradley I'll help you reviewing this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10085#discussion_r50150757
  
    --- Diff: python/pyspark/ml/feature.py ---
    @@ -992,6 +993,88 @@ def getDegree(self):
     
     
     @inherit_doc
    +class QuantileDiscretizer(JavaEstimator, HasInputCol, HasOutputCol):
    +    """
    +    .. note:: Experimental
    +
    +    `QuantileDiscretizer` takes a column with continuous features and outputs a column with binned
    +    categorical features. The bin ranges are chosen by taking a sample of the data and dividing it
    +    into roughly equal parts. The lower and upper bin bounds will be -Infinity and +Infinity,
    +    covering all real values. This attempts to find numBuckets partitions based on a sample of data,
    +    but it may find fewer depending on the data sample values.
    +
    +    >>> df = sqlContext.createDataFrame([(0.1,), (0.4,), (1.2,), (1.5,)], ["values"])
    +    >>> qds = QuantileDiscretizer(numBuckets=2,
    +    ...     inputCol="values", outputCol="buckets")
    +    >>> bucketizer = qds.fit(df)
    +    >>> splits = bucketizer.getSplits()
    +    >>> splits[0]
    +    -inf
    +    >>> int(splits[1]*10)
    +    4
    +    >>> bucketed = bucketizer.transform(df).collect()
    --- End diff --
    
    How about ```head()``` instead of ```collect()[0]```?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10085#issuecomment-174736475
  
    **[Test build #50037 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50037/consoleFull)** for PR 10085 at commit [`463aa37`](https://github.com/apache/spark/commit/463aa377f9af4f5f9d2691abaff0dbb9ff7881b1).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11937][SPARK-11922][PYSPARK][ML] Python...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10085#issuecomment-161166676
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11937][SPARK-11922][PYSPARK][ML] Python...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10085#issuecomment-161161541
  
    **[Test build #47030 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47030/consoleFull)** for PR 10085 at commit [`1145ec4`](https://github.com/apache/spark/commit/1145ec420590fc2e2cfc554433d9ba9ebabeb821).
     * This patch **fails PySpark unit tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:\n  * `class QuantileDiscretizer(JavaEstimator, HasInputCol, HasOutputCol):`\n


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...

Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on the pull request:

    https://github.com/apache/spark/pull/10085#issuecomment-168051426
  
    re-ping @yanboliang or @jkbradley if you've got the time to look at this (already been reviewed a bit).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org