You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by jkbradley <gi...@git.apache.org> on 2016/02/25 01:56:15 UTC

[GitHub] spark pull request: Added Python API for approxQuantile

GitHub user jkbradley opened a pull request:

    https://github.com/apache/spark/pull/11356

    Added Python API for approxQuantile

    ## What changes were proposed in this pull request?
    
    * Scala DataFrameStatFunctions: Added version of approxQuantile taking a List instead of an Array, for Python compatbility
    * Python DataFrame and DataFrameStatFunctions: Added approxQuantile
    
    ## How was this patch tested?
    
    * unit test in sql/tests.py
    
    Documentation was copied from the existing approxQuantile exactly.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/jkbradley/spark approx-quantile-python

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/11356.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #11356
    
----
commit e2f85b623a33982e3b9737a86442a04ee6c7f9eb
Author: Joseph K. Bradley <jo...@databricks.com>
Date:   2016-02-25T00:52:16Z

    Added Python API for approxQuantile

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13479] [SQL] [PYTHON] Added Python API ...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/11356#issuecomment-188613807
  
    LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13479] [SQL] [PYTHON] Added Python API ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11356#issuecomment-188633381
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13479] [SQL] [PYTHON] Added Python API ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11356#issuecomment-188631576
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13479] [SQL] [PYTHON] Added Python API ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11356#issuecomment-188578103
  
    **[Test build #51916 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/51916/consoleFull)** for PR 11356 at commit [`770cb4c`](https://github.com/apache/spark/commit/770cb4cc7a655d834f5e326056960bcb1ac22be4).
     * This patch **fails PySpark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13479] [SQL] [PYTHON] Added Python API ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11356#issuecomment-188540618
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13479] [SQL] [PYTHON] Added Python API ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11356#issuecomment-188550975
  
    **[Test build #51916 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/51916/consoleFull)** for PR 11356 at commit [`770cb4c`](https://github.com/apache/spark/commit/770cb4cc7a655d834f5e326056960bcb1ac22be4).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13479] [SQL] [PYTHON] Added Python API ...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/11356#issuecomment-188583500
  
    Weird...the method works for me locally.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13479] [SQL] [PYTHON] Added Python API ...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/11356#issuecomment-188647418
  
    Merged into master. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13479] [SQL] [PYTHON] Added Python API ...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11356#discussion_r54047396
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala ---
    @@ -71,6 +71,41 @@ final class DataFrameStatFunctions private[sql](df: DataFrame) {
       }
     
       /**
    +   * Calculates the approximate quantiles of a numerical column of a DataFrame.
    +   * Provided for the Python API.
    +   *
    +   * The result of this algorithm has the following deterministic bound:
    +   * If the DataFrame has N elements and if we request the quantile at probability `p` up to error
    +   * `err`, then the algorithm will return a sample `x` from the DataFrame so that the *exact* rank
    +   * of `x` is close to (p * N).
    +   * More precisely,
    +   *
    +   *   floor((p - err) * N) <= rank(x) <= ceil((p + err) * N).
    +   *
    +   * This method implements a variation of the Greenwald-Khanna algorithm (with some speed
    +   * optimizations).
    +   * The algorithm was first present in [[http://dx.doi.org/10.1145/375663.375670 Space-efficient
    +   * Online Computation of Quantile Summaries]] by Greenwald and Khanna.
    +   *
    +   * @param col the name of the numerical column
    +   * @param probabilities a list of quantile probabilities
    +   *   Each number must belong to [0, 1].
    +   *   For example 0 is the minimum, 0.5 is the median, 1 is the maximum.
    +   * @param relativeError The relative target precision to achieve (>= 0).
    +   *   If set to zero, the exact quantiles are computed, which could be very expensive.
    +   *   Note that values greater than 1 are accepted but give the same result as 1.
    +   * @return the approximate quantiles at the given probabilities
    +   *
    +   * @since 2.0.0
    +   */
    +  private[spark] def approxQuantile(
    +      col: String,
    +      probabilities: List[Double],
    +      relativeError: Double): Array[Double] = {
    --- End diff --
    
    We can return `java.util.List[Double]`, which would simplify the Python implementation.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13479] [SQL] [PYTHON] Added Python API ...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/11356#issuecomment-188539791
  
    CC: @mengxr @thunterdb 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13479] [SQL] [PYTHON] Added Python API ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11356#issuecomment-188540612
  
    **[Test build #51913 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/51913/consoleFull)** for PR 11356 at commit [`e2f85b6`](https://github.com/apache/spark/commit/e2f85b623a33982e3b9737a86442a04ee6c7f9eb).
     * This patch **fails Python style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13479] [SQL] [PYTHON] Added Python API ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11356#issuecomment-188578416
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/51916/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13479] [SQL] [PYTHON] Added Python API ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11356#issuecomment-188539832
  
    **[Test build #51913 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/51913/consoleFull)** for PR 11356 at commit [`e2f85b6`](https://github.com/apache/spark/commit/e2f85b623a33982e3b9737a86442a04ee6c7f9eb).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13479] [SQL] [PYTHON] Added Python API ...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11356#discussion_r54047394
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala ---
    @@ -71,6 +71,41 @@ final class DataFrameStatFunctions private[sql](df: DataFrame) {
       }
     
       /**
    +   * Calculates the approximate quantiles of a numerical column of a DataFrame.
    +   * Provided for the Python API.
    +   *
    +   * The result of this algorithm has the following deterministic bound:
    +   * If the DataFrame has N elements and if we request the quantile at probability `p` up to error
    +   * `err`, then the algorithm will return a sample `x` from the DataFrame so that the *exact* rank
    +   * of `x` is close to (p * N).
    +   * More precisely,
    +   *
    +   *   floor((p - err) * N) <= rank(x) <= ceil((p + err) * N).
    +   *
    +   * This method implements a variation of the Greenwald-Khanna algorithm (with some speed
    +   * optimizations).
    +   * The algorithm was first present in [[http://dx.doi.org/10.1145/375663.375670 Space-efficient
    +   * Online Computation of Quantile Summaries]] by Greenwald and Khanna.
    +   *
    +   * @param col the name of the numerical column
    +   * @param probabilities a list of quantile probabilities
    +   *   Each number must belong to [0, 1].
    +   *   For example 0 is the minimum, 0.5 is the median, 1 is the maximum.
    +   * @param relativeError The relative target precision to achieve (>= 0).
    +   *   If set to zero, the exact quantiles are computed, which could be very expensive.
    +   *   Note that values greater than 1 are accepted but give the same result as 1.
    +   * @return the approximate quantiles at the given probabilities
    +   *
    +   * @since 2.0.0
    --- End diff --
    
    This is a package private API. Maybe we should simply say "Python-friendly version of [[...]].`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13479] [SQL] [PYTHON] Added Python API ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11356#issuecomment-188631382
  
    **[Test build #51932 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/51932/consoleFull)** for PR 11356 at commit [`3f18d78`](https://github.com/apache/spark/commit/3f18d78502ce89ab961c75afd857ffd8cfd0d5f1).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13479] [SQL] [PYTHON] Added Python API ...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/11356


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13479] [SQL] [PYTHON] Added Python API ...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11356#discussion_r54047942
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala ---
    @@ -71,6 +71,41 @@ final class DataFrameStatFunctions private[sql](df: DataFrame) {
       }
     
       /**
    +   * Calculates the approximate quantiles of a numerical column of a DataFrame.
    +   * Provided for the Python API.
    +   *
    +   * The result of this algorithm has the following deterministic bound:
    +   * If the DataFrame has N elements and if we request the quantile at probability `p` up to error
    +   * `err`, then the algorithm will return a sample `x` from the DataFrame so that the *exact* rank
    +   * of `x` is close to (p * N).
    +   * More precisely,
    +   *
    +   *   floor((p - err) * N) <= rank(x) <= ceil((p + err) * N).
    +   *
    +   * This method implements a variation of the Greenwald-Khanna algorithm (with some speed
    +   * optimizations).
    +   * The algorithm was first present in [[http://dx.doi.org/10.1145/375663.375670 Space-efficient
    +   * Online Computation of Quantile Summaries]] by Greenwald and Khanna.
    +   *
    +   * @param col the name of the numerical column
    +   * @param probabilities a list of quantile probabilities
    +   *   Each number must belong to [0, 1].
    +   *   For example 0 is the minimum, 0.5 is the median, 1 is the maximum.
    +   * @param relativeError The relative target precision to achieve (>= 0).
    +   *   If set to zero, the exact quantiles are computed, which could be very expensive.
    +   *   Note that values greater than 1 are accepted but give the same result as 1.
    +   * @return the approximate quantiles at the given probabilities
    +   *
    +   * @since 2.0.0
    +   */
    +  private[spark] def approxQuantile(
    +      col: String,
    +      probabilities: List[Double],
    +      relativeError: Double): Array[Double] = {
    --- End diff --
    
    Yeah, again I was trying to follow other code there...but probably shouldn't.  Fixed now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13479] [SQL] [PYTHON] Added Python API ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11356#issuecomment-188611115
  
    **[Test build #51933 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/51933/consoleFull)** for PR 11356 at commit [`4f21f06`](https://github.com/apache/spark/commit/4f21f06a21e5ffa415ed4d63d9d1cada078d108a).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13479] [SQL] [PYTHON] Added Python API ...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11356#discussion_r54036751
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala ---
    @@ -71,6 +71,41 @@ final class DataFrameStatFunctions private[sql](df: DataFrame) {
       }
     
       /**
    +   * Calculates the approximate quantiles of a numerical column of a DataFrame.
    +   * Provided for the Python API.
    +   *
    +   * The result of this algorithm has the following deterministic bound:
    +   * If the DataFrame has N elements and if we request the quantile at probability `p` up to error
    +   * `err`, then the algorithm will return a sample `x` from the DataFrame so that the *exact* rank
    +   * of `x` is close to (p * N).
    +   * More precisely,
    +   *
    +   *   floor((p - err) * N) <= rank(x) <= ceil((p + err) * N).
    +   *
    +   * This method implements a variation of the Greenwald-Khanna algorithm (with some speed
    +   * optimizations).
    +   * The algorithm was first present in [[http://dx.doi.org/10.1145/375663.375670 Space-efficient
    +   * Online Computation of Quantile Summaries]] by Greenwald and Khanna.
    +   *
    +   * @param col the name of the numerical column
    +   * @param probabilities a list of quantile probabilities
    +   *   Each number must belong to [0, 1].
    +   *   For example 0 is the minimum, 0.5 is the median, 1 is the maximum.
    +   * @param relativeError The relative target precision to achieve (>= 0).
    +   *   If set to zero, the exact quantiles are computed, which could be very expensive.
    +   *   Note that values greater than 1 are accepted but give the same result as 1.
    +   * @return the approximate quantiles at the given probabilities
    +   *
    +   * @since 2.0.0
    +   */
    +  def approxQuantile(
    +      col: String,
    +      probabilities: List[Double],
    --- End diff --
    
    Done, thanks


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13479] [SQL] [PYTHON] Added Python API ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11356#issuecomment-188608553
  
    **[Test build #51932 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/51932/consoleFull)** for PR 11356 at commit [`3f18d78`](https://github.com/apache/spark/commit/3f18d78502ce89ab961c75afd857ffd8cfd0d5f1).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13479] [SQL] [PYTHON] Added Python API ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11356#issuecomment-188633059
  
    **[Test build #51933 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/51933/consoleFull)** for PR 11356 at commit [`4f21f06`](https://github.com/apache/spark/commit/4f21f06a21e5ffa415ed4d63d9d1cada078d108a).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13479] [SQL] [PYTHON] Added Python API ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11356#issuecomment-188540621
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/51913/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13479] [SQL] [PYTHON] Added Python API ...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11356#discussion_r54047585
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala ---
    @@ -71,6 +71,41 @@ final class DataFrameStatFunctions private[sql](df: DataFrame) {
       }
     
       /**
    +   * Calculates the approximate quantiles of a numerical column of a DataFrame.
    +   * Provided for the Python API.
    +   *
    +   * The result of this algorithm has the following deterministic bound:
    +   * If the DataFrame has N elements and if we request the quantile at probability `p` up to error
    +   * `err`, then the algorithm will return a sample `x` from the DataFrame so that the *exact* rank
    +   * of `x` is close to (p * N).
    +   * More precisely,
    +   *
    +   *   floor((p - err) * N) <= rank(x) <= ceil((p + err) * N).
    +   *
    +   * This method implements a variation of the Greenwald-Khanna algorithm (with some speed
    +   * optimizations).
    +   * The algorithm was first present in [[http://dx.doi.org/10.1145/375663.375670 Space-efficient
    +   * Online Computation of Quantile Summaries]] by Greenwald and Khanna.
    +   *
    +   * @param col the name of the numerical column
    +   * @param probabilities a list of quantile probabilities
    +   *   Each number must belong to [0, 1].
    +   *   For example 0 is the minimum, 0.5 is the median, 1 is the maximum.
    +   * @param relativeError The relative target precision to achieve (>= 0).
    +   *   If set to zero, the exact quantiles are computed, which could be very expensive.
    +   *   Note that values greater than 1 are accepted but give the same result as 1.
    +   * @return the approximate quantiles at the given probabilities
    +   *
    +   * @since 2.0.0
    --- End diff --
    
    I would have, but I was just following other conventions in the DataFrame code.  I'll change it though.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13479] [SQL] [PYTHON] Added Python API ...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/11356#issuecomment-188607697
  
    oh nevermind, I didn't test the last commit properly.  I'll send a fix


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13479] [SQL] [PYTHON] Added Python API ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11356#issuecomment-188631579
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/51932/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13479] [SQL] [PYTHON] Added Python API ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11356#issuecomment-188578413
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13479] [SQL] [PYTHON] Added Python API ...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11356#discussion_r54035313
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala ---
    @@ -71,6 +71,41 @@ final class DataFrameStatFunctions private[sql](df: DataFrame) {
       }
     
       /**
    +   * Calculates the approximate quantiles of a numerical column of a DataFrame.
    +   * Provided for the Python API.
    +   *
    +   * The result of this algorithm has the following deterministic bound:
    +   * If the DataFrame has N elements and if we request the quantile at probability `p` up to error
    +   * `err`, then the algorithm will return a sample `x` from the DataFrame so that the *exact* rank
    +   * of `x` is close to (p * N).
    +   * More precisely,
    +   *
    +   *   floor((p - err) * N) <= rank(x) <= ceil((p + err) * N).
    +   *
    +   * This method implements a variation of the Greenwald-Khanna algorithm (with some speed
    +   * optimizations).
    +   * The algorithm was first present in [[http://dx.doi.org/10.1145/375663.375670 Space-efficient
    +   * Online Computation of Quantile Summaries]] by Greenwald and Khanna.
    +   *
    +   * @param col the name of the numerical column
    +   * @param probabilities a list of quantile probabilities
    +   *   Each number must belong to [0, 1].
    +   *   For example 0 is the minimum, 0.5 is the median, 1 is the maximum.
    +   * @param relativeError The relative target precision to achieve (>= 0).
    +   *   If set to zero, the exact quantiles are computed, which could be very expensive.
    +   *   Note that values greater than 1 are accepted but give the same result as 1.
    +   * @return the approximate quantiles at the given probabilities
    +   *
    +   * @since 2.0.0
    +   */
    +  def approxQuantile(
    +      col: String,
    +      probabilities: List[Double],
    --- End diff --
    
    maybe java.util.List to make it very explicit this is a java list, not a scala list ...



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13479] [SQL] [PYTHON] Added Python API ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11356#issuecomment-188633385
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/51933/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org