You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "James Verbus (Jira)" <ji...@apache.org> on 2019/10/02 05:06:00 UTC

[jira] [Created] (SPARK-29325) approxQuantile() results are incorrect and vary significantly for small changes in relativeError

James Verbus created SPARK-29325:
------------------------------------

             Summary: approxQuantile() results are incorrect and vary significantly for small changes in relativeError
                 Key: SPARK-29325
                 URL: https://issues.apache.org/jira/browse/SPARK-29325
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 2.4.4, 2.3.2
         Environment: I was using OSX 10.14.6.

I was using Scala 2.11.12 and Spark 2.4.4.

I also verified the bug exists for Scala 2.11.8 and Spark 2.3.2.
            Reporter: James Verbus
         Attachments: 20191001_example_data_approx_quantile_bug.zip

The [approxQuantile() method|https://github.com/apache/spark/blob/3b1674cb1f244598463e879477d89632b0817578/sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala#L40] returns sometimes incorrect results that are sensitively dependent upon the choice of the relativeError.

Below is an example in the latest Spark version (2.4.4). You can see the result varies significantly for modest changes in the specified relativeError parameter. The result varies much more than the magnitude of the relativeError parameter.

 
{code:java}
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.4
      /_/
         
Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_212)
Type in expressions to have them evaluated.
Type :help for more information.


scala> val df = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("./20191001_example_data_approx_quantile_bug")
df: org.apache.spark.sql.DataFrame = [value: double]


scala> df.stat.approxQuantile("value", Array(0.9), 0)
res0: Array[Double] = Array(0.5929591082174609)


scala> df.stat.approxQuantile("value", Array(0.9), 0.001)
res1: Array[Double] = Array(0.67621027121925)


scala> df.stat.approxQuantile("value", Array(0.9), 0.002)
res2: Array[Double] = Array(0.5926195654486178)


scala> df.stat.approxQuantile("value", Array(0.9), 0.003)
res3: Array[Double] = Array(0.5924693999048418)


scala> df.stat.approxQuantile("value", Array(0.9), 0.004)
res4: Array[Double] = Array(0.67621027121925)


scala> df.stat.approxQuantile("value", Array(0.9), 0.005)
res5: Array[Double] = Array(0.5923925937051544)
 
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org