You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Apache Spark (Jira)" <ji...@apache.org> on 2020/09/17 10:20:00 UTC
[jira] [Assigned] (SPARK-32908) percentile_approx() returns
incorrect results
[ https://issues.apache.org/jira/browse/SPARK-32908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Apache Spark reassigned SPARK-32908:
------------------------------------
Assignee: (was: Apache Spark)
> percentile_approx() returns incorrect results
> ---------------------------------------------
>
> Key: SPARK-32908
> URL: https://issues.apache.org/jira/browse/SPARK-32908
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 2.4.8, 3.0.2, 3.1.0
> Reporter: Maxim Gekk
> Priority: Major
> Attachments: percentile_approx-input.csv
>
>
> Read input data from the attached CSV file:
> {code:scala}
> val df = spark.read.option("header", "true")
> .option("inferSchema", "true")
> .csv("/Users/maximgekk/tmp/percentile_approx-input.csv")
> .repartition(1)
> df.createOrReplaceTempView(table)
> {code}
> Calculate the 0.77 percentile with accuracy 1e-05:
> {code:Scala}
> spark.sql(
> s"""SELECT
> | percentile_approx(tr_rat_resampling_score, 0.77, 100000)
> |FROM $table
> """.stripMargin).show
> {code}
> {code}
> +------------------------------------------------------------------------+
> |percentile_approx(tr_rat_resampling_score, CAST(0.77 AS DOUBLE), 100000)|
> +------------------------------------------------------------------------+
> | 1000|
> +------------------------------------------------------------------------+
> {code}
> The same for smaller accuracy 0.001:
> {code}
> +----------------------------------------------------------------------+
> |percentile_approx(tr_rat_resampling_score, CAST(0.77 AS DOUBLE), 1000)|
> +----------------------------------------------------------------------+
> | 18|
> +----------------------------------------------------------------------+
> {code}
> and better accuracy 1e-06:
> {code}
> +-------------------------------------------------------------------------+
> |percentile_approx(tr_rat_resampling_score, CAST(0.77 AS DOUBLE), 1000000)|
> +-------------------------------------------------------------------------+
> | 17|
> +-------------------------------------------------------------------------+
> {code}
> For the accuracy 1e-05, the result must be around 17-18 but not 1000.
> Here is percentile calculation in Google Sheets for the same input:
> https://docs.google.com/spreadsheets/d/1Y1i4Td6s9jZQ-bD4IRTESLXP3UxKpqJSXGtmx0Q5TA0/edit?usp=sharing
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org