You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Liwei Lin (JIRA)" <ji...@apache.org> on 2016/07/13 04:02:20 UTC

[jira] [Comment Edited] (SPARK-16283) Implement percentile_approx SQL function

    [ https://issues.apache.org/jira/browse/SPARK-16283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15374295#comment-15374295 ] 

Liwei Lin edited comment on SPARK-16283 at 7/13/16 4:01 AM:
------------------------------------------------------------

Hive's percentile_approx implementation computes approximate percentile values from a histogram (please refer to [Hive/GenericUDAFPercentileApprox.java|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFPercentileApprox.java] and [Hive/NumericHistogram.java|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/NumericHistogram.java] for details):
- Hive's percentile_approx's signature is: {{\_FUNC\_(expr, pc, \[nb\])}}
- parameter \[nb\] -- the number of histogram bins to use -- is optionally specified by users
- if the number of unique values in the actual dataset is less than or equals to this \[nb\], we can expect an exact result; otherwise there are no approximation guarantees


Our Dataset's approxQuantile() implementation is not really histogram-based (and thus differs from Hive's implementation):
- our Dataset's approxQuantile()'s signature is something like: {{\_FUNC\_(expr, pc, relativeError)}}
- parameter relativeError is specified by users and should be in \[0, 1\]; our approximation is deterministicly bounded by this relativeError -- please refer to [Spark/DataFrameStatFunctions.scala|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala#L39] for details


Since there's no direct deterministic relationship between \[nb\] and relativeError, it seems hard to build Hive's percentile_approx on top of our Dataset's approxQuantile(). So, [~rxin], [~thunterdb], should we: (a) port Hive' implementation into Spark, and provide {{\_FUNC\_(expr, pc, \[nb\])}} on top of it, or (b) provide {{\_FUNC\_(expr, pc, relativeError)}} directly on top of our Dataset's approxQuantile() implementation, but this might be incompatible with Hive? Could you share some thoughts? Thanks !


was (Author: proflin):
Hive's percentile_approx implementation computes approximate percentile values from a histogram (please refer to [Hive/GenericUDAFPercentileApprox.java|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFPercentileApprox.java] and [Hive/NumericHistogram.java|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/NumericHistogram.java] for details):
- Hive's percentile_approx's signature is: {{\_FUNC\_(expr, pc, \[nb\])}}
- parameter \[nb\] -- the number of histogram bins to use -- is optionally specified by users
- if the number of unique values in the actual dataset is less than or equals to this \[nb\], we can expect an exact result; otherwise there are no approximation guarantees


Our Dataset's approxQuantile() implementation is not really histogram-based (and thus differs from Hive's implementation):
- our Dataset's approxQuantile()'s signature is something like: {{\_FUNC\_(expr, pc, relativeError)}}
- parameter relativeError is specified by users and should be in \[0, 1\]; our approximation is deterministicly bounded by this relativeError -- please refer to [Spark/DataFrameStatFunctions.scala|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala#L39] for details


Since there's no direct deterministic relationship between \[nb\] and relativeError, it seems hard to build Hive's percentile_approx on top of our Dataset's approxQuantile(). So, [~rxin], [~thunterdb], should we: (a) port Hive' implementation into Spark, and provide {{\_FUNC\_(expr, pc, \[nb\])}} on top of it, or (b) provide {{\_FUNC\_(expr, pc, relativeError)}} directly on top of our Dataset's approxQuantile() implementation, but this might be incompatible with Hive? Thanks !

> Implement percentile_approx SQL function
> ----------------------------------------
>
>                 Key: SPARK-16283
>                 URL: https://issues.apache.org/jira/browse/SPARK-16283
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>            Reporter: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org