You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Maxim Gekk (Jira)" <ji...@apache.org> on 2020/09/21 19:02:00 UTC
[jira] [Commented] (SPARK-30882) Inaccurate results with higher precision ApproximatePercentile results

    [ https://issues.apache.org/jira/browse/SPARK-30882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17199602#comment-17199602 ] 

Maxim Gekk commented on SPARK-30882:
------------------------------------

This PR [https://github.com/apache/spark/pull/29784] solves the issue:
{code}
+----------+--------------------------------------------------------+
|MetricName|percentile_approx(DataPoint, [0.1,0.25,0.75,0.9], 10000)|
+----------+--------------------------------------------------------+
|metric    |[450000, 1124848, 3374848, 4050000]                     |
+----------+--------------------------------------------------------+

+----------+----------------------------------------------------------+
|MetricName|percentile_approx(DataPoint, [0.1,0.25,0.75,0.9], 1000000)|
+----------+----------------------------------------------------------+
|metric    |[450000, 1124996, 3374996, 4050000]                       |
+----------+----------------------------------------------------------+

+----------+----------------------------------------------------------+
|MetricName|percentile_approx(DataPoint, [0.1,0.25,0.75,0.9], 5000000)|
+----------+----------------------------------------------------------+
|metric    |[450000, 1125000, 3375000, 4050000]                       |
+----------+----------------------------------------------------------+

+----------+----------------------------------------------------------+
|MetricName|percentile_approx(DataPoint, [0.1,0.25,0.75,0.9], 7000000)|
+----------+----------------------------------------------------------+
|metric    |[450000, 1125000, 3375000, 4050000]                       |
+----------+----------------------------------------------------------+

+----------+-----------------------------------------------------------+
|MetricName|percentile_approx(DataPoint, [0.1,0.25,0.75,0.9], 10000000)|
+----------+-----------------------------------------------------------+
|metric    |[450000, 1125000, 3375000, 4050000]                        |
+----------+-----------------------------------------------------------+
{code}

I propose to close this bug report. cc [~hyukjin.kwon] [~cloud_fan] [~srowen] [~dongjoon]

> Inaccurate results with higher precision ApproximatePercentile results 
> -----------------------------------------------------------------------
>
>                 Key: SPARK-30882
>                 URL: https://issues.apache.org/jira/browse/SPARK-30882
>             Project: Spark
>          Issue Type: Bug
>          Components: Java API
>    Affects Versions: 2.4.4
>            Reporter: Nandini malempati
>            Priority: Major
>
> Results of ApproximatePercentile should have better accuracy with increased precision as per the documentation provided here: 
> [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproximatePercentile.scala]
> But i'm seeing nondeterministic behavior . On a data set of size 4500000 with Accuracy 1000000 is returning better results than 5000000 (P25 and P90 looks fine but P75 is very off). And accuracy with 7000000 gives exact results. But this behavior is not consistent. 
> {code:scala}
> // Some comments here
> package com.microsoft.teams.test.utils
> import org.apache.spark.sql.{Column, DataFrame, Row, SparkSession}
> import org.scalatest.{Assertions, FunSpec}
> import org.apache.spark.sql.functions.{lit, _}
> import org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile
> import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType}
> import scala.collection.mutable.ListBuffer
> class PercentilesTest extends FunSpec with Assertions {
>     it("check percentiles with different precision") {
>       val schema = List(StructField("MetricName", StringType), StructField("DataPoint", IntegerType))
>       val data = new ListBuffer[Row]
>       for(i <- 1 to 4500000) { data += Row("metric", i)}
>       import spark.implicits._
>     val df = createDF(schema, data.toSeq)
>     val accuracy1000 = df.groupBy("MetricName").agg(percentile_approx($"DataPoint", typedLit(Seq(0.1, 0.25, 0.75, 0.9)), ApproximatePercentile.DEFAULT_PERCENTILE_ACCURACY))
>     val accuracy1M = df.groupBy("MetricName").agg(percentile_approx($"DataPoint", typedLit(Seq(0.1, 0.25, 0.75, 0.9)), 1000000))
>     val accuracy5M = df.groupBy("MetricName").agg(percentile_approx($"DataPoint", typedLit(Seq(0.1, 0.25, 0.75, 0.9)), 5000000))
>     val accuracy7M = df.groupBy("MetricName").agg(percentile_approx($"DataPoint", typedLit(Seq(0.1, 0.25, 0.75, 0.9)), 7000000))
>     val accuracy10M = df.groupBy("MetricName").agg(percentile_approx($"DataPoint", typedLit(Seq(0.1, 0.25, 0.75, 0.9)), 10000000))
>     accuracy1000.show(1, false)
>     accuracy1M.show(1, false)
>     accuracy5M.show(1, false)
>     accuracy7M.show(1, false)
>     accuracy10M.show(1, false)
>   }
>   def percentile_approx(col: Column, percentage: Column, accuracy: Column): Column = {
>     val expr = new ApproximatePercentile(
>       col.expr,  percentage.expr, accuracy.expr
>     ).toAggregateExpression
>     new Column(expr)
>   }
>   def percentile_approx(col: Column, percentage: Column, accuracy: Int): Column = percentile_approx(
>     col, percentage, lit(accuracy)
>   )
>   lazy val spark: SparkSession = {
>     SparkSession
>       .builder()
>       .master("local")
>       .appName("spark tests")
>       .getOrCreate()
>   }
>   def createDF(schema: List[StructField], data: Seq[Row]): DataFrame = {
>     spark.createDataFrame(
>       spark.sparkContext.parallelize(data),
>       StructType(schema))
>   }
> }
> {code}
> Above is a test run to reproduce the error. Below are few runs with different accuracies.  P25 and P90 looks fine. But P75 is the max of the column.
> +----------+----------------------------------------------------------+
> |MetricName|percentile_approx(DataPoint, [0.1,0.25,0.75,0.9], 7000000)|
> +----------+----------------------------------------------------------+
> |metric    |[450000, 1125000, 3375000, 4050000]                       |
> +----------+----------------------------------------------------------+
>   
> +----------+----------------------------------------------------------+
> |MetricName|percentile_approx(DataPoint, [0.1,0.25,0.75,0.9], 5000000)|
> +----------+----------------------------------------------------------+
> |metric    |[450000, 1125000, 4500000, 4050000]                       |
> +----------+----------------------------------------------------------+
>   
> +----------+----------------------------------------------------------+
> |MetricName|percentile_approx(DataPoint, [0.1,0.25,0.75,0.9], 1000000)|
> +----------+----------------------------------------------------------+
> |metric    |[450000, 1124998, 3374996, 4050000]                       |
> +----------+----------------------------------------------------------+
>   
> +----------+--------------------------------------------------------+
> |MetricName|percentile_approx(DataPoint, [0.1,0.25,0.75,0.9], 10000)|
> +----------+--------------------------------------------------------+
> |metric    |[450000, 1124848, 3374638, 4050000]                     |
> +----------+--------------------------------------------------------+
> This is breaking our reports as there is no proper definition of accuracy . we have data sets of size more than 27000000. After studying the pattern found that inaccurate percentiles always have "max" of the column as value. P50 and P99 might be right in few cases but P75 can be wrong. 
> Is there a way to define what the correct accuracy would be for a given dataset size ? 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org