You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "James Verbus (Jira)" <ji...@apache.org> on 2019/10/02 05:06:00 UTC
[jira] [Created] (SPARK-29325) approxQuantile() results are
incorrect and vary significantly for small changes in relativeError
James Verbus created SPARK-29325:
------------------------------------
Summary: approxQuantile() results are incorrect and vary significantly for small changes in relativeError
Key: SPARK-29325
URL: https://issues.apache.org/jira/browse/SPARK-29325
Project: Spark
Issue Type: Bug
Components: SQL
Affects Versions: 2.4.4, 2.3.2
Environment: I was using OSX 10.14.6.
I was using Scala 2.11.12 and Spark 2.4.4.
I also verified the bug exists for Scala 2.11.8 and Spark 2.3.2.
Reporter: James Verbus
Attachments: 20191001_example_data_approx_quantile_bug.zip
The [approxQuantile() method|https://github.com/apache/spark/blob/3b1674cb1f244598463e879477d89632b0817578/sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala#L40] returns sometimes incorrect results that are sensitively dependent upon the choice of the relativeError.
Below is an example in the latest Spark version (2.4.4). You can see the result varies significantly for modest changes in the specified relativeError parameter. The result varies much more than the magnitude of the relativeError parameter.
{code:java}
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.4.4
/_/
Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_212)
Type in expressions to have them evaluated.
Type :help for more information.
scala> val df = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("./20191001_example_data_approx_quantile_bug")
df: org.apache.spark.sql.DataFrame = [value: double]
scala> df.stat.approxQuantile("value", Array(0.9), 0)
res0: Array[Double] = Array(0.5929591082174609)
scala> df.stat.approxQuantile("value", Array(0.9), 0.001)
res1: Array[Double] = Array(0.67621027121925)
scala> df.stat.approxQuantile("value", Array(0.9), 0.002)
res2: Array[Double] = Array(0.5926195654486178)
scala> df.stat.approxQuantile("value", Array(0.9), 0.003)
res3: Array[Double] = Array(0.5924693999048418)
scala> df.stat.approxQuantile("value", Array(0.9), 0.004)
res4: Array[Double] = Array(0.67621027121925)
scala> df.stat.approxQuantile("value", Array(0.9), 0.005)
res5: Array[Double] = Array(0.5923925937051544)
{code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org