You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by wzhfy <gi...@git.apache.org> on 2017/10/01 05:29:27 UTC

[GitHub] spark pull request #19406: [SPARK-22179] percentile_approx should choose the...

GitHub user wzhfy opened a pull request:

    https://github.com/apache/spark/pull/19406

    [SPARK-22179] percentile_approx should choose the first element for small percentage

    ## What changes were proposed in this pull request?
    
    percentile_approx should choose the first element for small percentage
    
    ## How was this patch tested?
    
    Added a new test.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/wzhfy/spark minor_improve_percentile_approx

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19406.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19406
    
----
commit 24f8295498a7ad6d2d99ea27a196ccf154165907
Author: Zhenhua Wang <wz...@163.com>
Date:   2017-09-30T16:04:32Z

    return the first element for small percentage

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19406
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19406
  
    **[Test build #82368 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82368/testReport)** for PR 19406 at commit [`24f8295`](https://github.com/apache/spark/commit/24f8295498a7ad6d2d99ea27a196ccf154165907).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...

Posted by wzhfy <gi...@git.apache.org>.
Github user wzhfy commented on the issue:

    https://github.com/apache/spark/pull/19406
  
    Yes it's ok if consider it's approximate with relativeError. But the point of this pr is that there could be a further improvement in those special cases. IMHO it's more accurate and intuitive for end users, such as return 1 instead of 2 as 10% percentile among 1 to 10.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19406
  
    **[Test build #82384 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82384/testReport)** for PR 19406 at commit [`dbc3d47`](https://github.com/apache/spark/commit/dbc3d47b0a56113032d2a4565180932e4ef26219).
     * This patch **fails PySpark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19406
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82368/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19406
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/19406
  
    I think we should probably follow the paper, no? this should fix more cases. Yes, this case also failed for me. The answer 499 is OK too.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...

Posted by wzhfy <gi...@git.apache.org>.
Github user wzhfy commented on the issue:

    https://github.com/apache/spark/pull/19406
  
    @srowen @jiangxb1987 OK, I'll close this JIRA and creating a new JIRA as improvement instead of bugfix.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19406
  
    **[Test build #82419 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82419/testReport)** for PR 19406 at commit [`9815ce8`](https://github.com/apache/spark/commit/9815ce8e17e34422f8c915d115061a9635abd119).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/19406
  
    I read the original paper at http://infolab.stanford.edu/~datar/courses/cs361a/papers/quantiles.pdf and think there might be two bugs in the implementation here that might actually cause the problem.
    
    First, targetError should not be rounded up. The algorithm in 2.2.1 does not show that. 
    
    Second, this starts by considering sampled(1), not sampled(0). But the impl and algorithm in the paper seem to be 0-based.
    
    I would try it myself but for some reason I'm getting null from your query. I'll try again on my end but what do you think of this?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...

Posted by wzhfy <gi...@git.apache.org>.
Github user wzhfy commented on the issue:

    https://github.com/apache/spark/pull/19406
  
    @srowen For example, given input data 1 to 10, if a user queries 10% (or even less) percentile, it should return 1, because the first value 1 already reaches 10% percentage. Without this fix, it returned 2. I've updated description by adding this example in JIRA and PR, thanks.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19406
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82419/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...

Posted by wzhfy <gi...@git.apache.org>.
Github user wzhfy commented on the issue:

    https://github.com/apache/spark/pull/19406
  
    @srowen Before the change, the answer in the test case is `2, 2, 2`.
    Based on the code before the change, percentile_approx would never return the first element when percentile is in (relativeError, 1/N], where relativeError default is 1/10000, and N is the total number of elements.
    But ideally, percentiles in [0, 1/N] should all return the first element as the answer.
    
    > why this method has to use relativeError
    
    `QuantileSummaries` is a sampled data structure, `relativeError` is used to compute `targeError`, which decides the error bound of the target rank (https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/QuantileSummaries.scala#L204).
    



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/19406
  
    Ah, that's fine :). It was just an option. I will follow discussion and help sort it out in any event.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/19406
  
    The JIRA doesn't explain what this is meant to fix. What case does this help get more correct?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...

Posted by jiangxb1987 <gi...@git.apache.org>.
Github user jiangxb1987 commented on the issue:

    https://github.com/apache/spark/pull/19406
  
    This is actually a bugfix instead of improvement, I think we should follow the approach that @srowen have suggested.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...

Posted by wzhfy <gi...@git.apache.org>.
Github user wzhfy commented on the issue:

    https://github.com/apache/spark/pull/19406
  
    @HyukjinKwon uh...just saw this, already created a new [JIRA](url) and [PR](https://github.com/apache/spark/pull/19438), is it also ok?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/19406
  
    It's probably OK, though kind of special cases the first quantile. This expression doesn't incorporate relativeError. I'm actually not sure why it needs to account for relativeError here; it's an extra level of approximation. But the asymmetry seems a bit wrong.
    
    It is after all approximate? the loop below actually accounts for the same case, but incorporates relativeError.
    
    When in doubt ask the author: @clockfly 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/19406
  
    (What is the answer in your test case before the change?)
    Yea, I guess I am not sure why this method has to use relativeError at all, but I didn't think about it much. If it didn't I think it already does the same thing you've done. Which then leads me to ask why there's an approximation in this part of the process at all.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19406
  
    **[Test build #82379 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82379/testReport)** for PR 19406 at commit [`8c8c22d`](https://github.com/apache/spark/commit/8c8c22dbebe99def6127b49988dfc4f886797bd6).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/19406
  
    Your new test also passes when changing ...
    
    ```
        val targetError = relativeError * count
    ...
        var i = 0
    ```
    
    I think this is more likely to be the correct fix as it matches the paper and isn't just patching the case of the first element.
    
    It does make another test fail, just barely -- seems still within tolerances.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...

Posted by wzhfy <gi...@git.apache.org>.
Github user wzhfy commented on the issue:

    https://github.com/apache/spark/pull/19406
  
    @HyukjinKwon These two JIRAs change percentile_approx in different ways, so maybe it's better to use different JIRAs?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19406
  
    **[Test build #82384 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82384/testReport)** for PR 19406 at commit [`dbc3d47`](https://github.com/apache/spark/commit/dbc3d47b0a56113032d2a4565180932e4ef26219).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/19406
  
    cc @WeichenXu123 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19406
  
    **[Test build #82368 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82368/testReport)** for PR 19406 at commit [`24f8295`](https://github.com/apache/spark/commit/24f8295498a7ad6d2d99ea27a196ccf154165907).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...

Posted by wzhfy <gi...@git.apache.org>.
Github user wzhfy commented on the issue:

    https://github.com/apache/spark/pull/19406
  
    @HyukjinKwon thanks!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19406
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...

Posted by wzhfy <gi...@git.apache.org>.
Github user wzhfy commented on the issue:

    https://github.com/apache/spark/pull/19406
  
    @srowen I tried your way on my laptop, it works fine. But an existing test case failed: (500 expected , 499 returned)
    ```
      test("percentile_approx, supports constant folding for parameter accuracy and percentages") {
        withTempView(table) {
          (1 to 1000).toDF("col").createOrReplaceTempView(table)
          checkAnswer(
            spark.sql(s"SELECT percentile_approx(col, array(0.25 + 0.25D), 200 + 800D) FROM $table"),
            Row(Seq(500D))
          )
        }
      }
    ```
    It seems to me both way (rounding targetError and starting from 1/ not rounding targetError and starting from 0) can get the answer within relative error. One way is more accurate in some cases, and the other is more accurate in some other cases.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19406
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19406
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82384/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19406
  
    **[Test build #82379 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82379/testReport)** for PR 19406 at commit [`8c8c22d`](https://github.com/apache/spark/commit/8c8c22dbebe99def6127b49988dfc4f886797bd6).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19406: [SPARK-22179] percentile_approx should choose the...

Posted by wzhfy <gi...@git.apache.org>.
Github user wzhfy closed the pull request at:

    https://github.com/apache/spark/pull/19406


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/19406
  
    Oh, optionally, we can just edit the JIRA I guess.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...

Posted by wzhfy <gi...@git.apache.org>.
Github user wzhfy commented on the issue:

    https://github.com/apache/spark/pull/19406
  
    Your proposed way is more consistent with the paper and I also think it's more natural to start from 0.
    The purpose of this patch is to not introduce accuracy regression for other cases, and improve some corner cases.
    I can't tell which one is better, I think both ways can work.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...

Posted by wzhfy <gi...@git.apache.org>.
Github user wzhfy commented on the issue:

    https://github.com/apache/spark/pull/19406
  
    cc @thunterdb @cloud-fan @gatorsmile 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19406
  
    **[Test build #82419 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82419/testReport)** for PR 19406 at commit [`9815ce8`](https://github.com/apache/spark/commit/9815ce8e17e34422f8c915d115061a9635abd119).
     * This patch **fails SparkR unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19406
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82379/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org