You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by holdenk <gi...@git.apache.org> on 2015/12/02 02:22:32 UTC

[GitHub] spark pull request: [SPARK-11937][SPARK-11922][PYSPARK][ML] Python...

GitHub user holdenk opened a pull request:

    https://github.com/apache/spark/pull/10085

    [SPARK-11937][SPARK-11922][PYSPARK][ML] Python api for ml.feature.quantile discretizer

    Add Python API for ml.feature.QuantileDiscretizer.
    
    One open question: Do we want to do this stuff to re-use the java model, create a new model, or use a different wrapper around the java model.
    cc @brkyvz & @mengxr 

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/holdenk/spark SPARK-11937-SPARK-11922-Python-API-for-ml.feature.QuantileDiscretizer

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/10085.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #10085
    
----
commit dbabade554b93bee191b56115a516f39d4e128ea
Author: Holden Karau <ho...@us.ibm.com>
Date:   2015-12-01T22:03:01Z

    Start working towards implementing python interface for quantilediscrectizer. One question (for review) is do we want to change the bucketizer as I've done or create a different wrapper? I think this way is better but it does introduce an extra param so no sure

commit 1cacd7667ac0fa37b94fcb842e1d1616898279e9
Author: Holden Karau <ho...@us.ibm.com>
Date:   2015-12-02T01:11:51Z

    Ok remove _java_model before setting the params since it isn't really a param, print out the splits from the trained bucketizer

commit cfb255fc903f8283ef3fc55cf52e0fed8634f9bb
Author: Holden Karau <ho...@us.ibm.com>
Date:   2015-12-02T01:19:06Z

    And make sure the generated model works

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10085#issuecomment-170710756
  
    **[Test build #49177 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49177/consoleFull)** for PR 10085 at commit [`798798c`](https://github.com/apache/spark/commit/798798c49eaa9b6b62c0266d343b2350edac5875).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/10085#issuecomment-172938977
  
    Those are the only issues I see.  Thanks everyone for reviewing & @holdenk for the PR!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10085#issuecomment-172963641
  
    **[Test build #49694 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49694/consoleFull)** for PR 10085 at commit [`f21ebef`](https://github.com/apache/spark/commit/f21ebefc1e0edc16c2eed8e5474033e8d3baf1ae).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10085#issuecomment-170706592
  
    **[Test build #49177 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49177/consoleFull)** for PR 10085 at commit [`798798c`](https://github.com/apache/spark/commit/798798c49eaa9b6b62c0266d343b2350edac5875).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...

Posted by holdenk <gi...@git.apache.org>.

Github user holdenk commented on the pull request:

    https://github.com/apache/spark/pull/10085#issuecomment-163392210
  
    Ok - just to make sure do you see any issues with the current approach for getSplits? Its tested a bit in the doctests but if there is a potential issue I can add some more tests.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10085#issuecomment-173022443
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10085#issuecomment-174732868
  
    **[Test build #50037 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50037/consoleFull)** for PR 10085 at commit [`463aa37`](https://github.com/apache/spark/commit/463aa377f9af4f5f9d2691abaff0dbb9ff7881b1).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10085#issuecomment-172238836
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/49534/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...

Posted by holdenk <gi...@git.apache.org>.

Github user holdenk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10085#discussion_r50175399
  
    --- Diff: python/pyspark/ml/feature.py ---
    @@ -992,6 +993,88 @@ def getDegree(self):
     
     
     @inherit_doc
    +class QuantileDiscretizer(JavaEstimator, HasInputCol, HasOutputCol):
    +    """
    +    .. note:: Experimental
    +
    +    `QuantileDiscretizer` takes a column with continuous features and outputs a column with binned
    +    categorical features. The bin ranges are chosen by taking a sample of the data and dividing it
    +    into roughly equal parts. The lower and upper bin bounds will be -Infinity and +Infinity,
    +    covering all real values. This attempts to find numBuckets partitions based on a sample of data,
    +    but it may find fewer depending on the data sample values.
    +
    +    >>> df = sqlContext.createDataFrame([(0.1,), (0.4,), (1.2,), (1.5,)], ["values"])
    +    >>> qds = QuantileDiscretizer(numBuckets=2,
    +    ...     inputCol="values", outputCol="buckets")
    +    >>> bucketizer = qds.fit(df)
    +    >>> splits = bucketizer.getSplits()
    +    >>> splits[0]
    +    -inf
    +    >>> int(splits[1]*10)
    --- End diff --
    
    Sure :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10085#issuecomment-174677492
  
    **[Test build #50015 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50015/consoleFull)** for PR 10085 at commit [`463aa37`](https://github.com/apache/spark/commit/463aa377f9af4f5f9d2691abaff0dbb9ff7881b1).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org