You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "James Verbus (Jira)" <ji...@apache.org> on 2019/10/02 05:06:00 UTC

[jira] [Updated] (SPARK-29325) approxQuantile() results are incorrect and vary significantly for small changes in relativeError

     [ https://issues.apache.org/jira/browse/SPARK-29325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

James Verbus updated SPARK-29325:
---------------------------------
    Attachment: 20191001_example_data_approx_quantile_bug.zip

> approxQuantile() results are incorrect and vary significantly for small changes in relativeError
> ------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-29325
>                 URL: https://issues.apache.org/jira/browse/SPARK-29325
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.3.2, 2.4.4
>         Environment: I was using OSX 10.14.6.
> I was using Scala 2.11.12 and Spark 2.4.4.
> I also verified the bug exists for Scala 2.11.8 and Spark 2.3.2.
>            Reporter: James Verbus
>            Priority: Major
>              Labels: correctness
>         Attachments: 20191001_example_data_approx_quantile_bug.zip
>
>
> The [approxQuantile() method|https://github.com/apache/spark/blob/3b1674cb1f244598463e879477d89632b0817578/sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala#L40] returns sometimes incorrect results that are sensitively dependent upon the choice of the relativeError.
> Below is an example in the latest Spark version (2.4.4). You can see the result varies significantly for modest changes in the specified relativeError parameter. The result varies much more than the magnitude of the relativeError parameter.
>  
> {code:java}
> Welcome to
>       ____              __
>      / __/__  ___ _____/ /__
>     _\ \/ _ \/ _ `/ __/  '_/
>    /___/ .__/\_,_/_/ /_/\_\   version 2.4.4
>       /_/
>          
> Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_212)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> val df = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("./20191001_example_data_approx_quantile_bug")
> df: org.apache.spark.sql.DataFrame = [value: double]
> scala> df.stat.approxQuantile("value", Array(0.9), 0)
> res0: Array[Double] = Array(0.5929591082174609)
> scala> df.stat.approxQuantile("value", Array(0.9), 0.001)
> res1: Array[Double] = Array(0.67621027121925)
> scala> df.stat.approxQuantile("value", Array(0.9), 0.002)
> res2: Array[Double] = Array(0.5926195654486178)
> scala> df.stat.approxQuantile("value", Array(0.9), 0.003)
> res3: Array[Double] = Array(0.5924693999048418)
> scala> df.stat.approxQuantile("value", Array(0.9), 0.004)
> res4: Array[Double] = Array(0.67621027121925)
> scala> df.stat.approxQuantile("value", Array(0.9), 0.005)
> res5: Array[Double] = Array(0.5923925937051544)
>  
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org