You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (JIRA)" <ji...@apache.org> on 2019/05/21 04:02:21 UTC
[jira] [Updated] (SPARK-18940) Percentile and approximate
percentile support for frequency distribution table
[ https://issues.apache.org/jira/browse/SPARK-18940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hyukjin Kwon updated SPARK-18940:
---------------------------------
Labels: bulk-closed (was: )
> Percentile and approximate percentile support for frequency distribution table
> ------------------------------------------------------------------------------
>
> Key: SPARK-18940
> URL: https://issues.apache.org/jira/browse/SPARK-18940
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 2.0.2
> Reporter: gagan taneja
> Priority: Major
> Labels: bulk-closed
>
> I have a frequency distribution table with following entries
> {noformat}
> Age, No of person
> 21, 10
> 22, 15
> 23, 18
> ..
> ..
> 30, 14
> {noformat}
> Moreover it is common to have data in frequency distribution format to further calculate Percentile, Median. With current implementation
> It would be very difficult and complex to find the percentile.
> Therefore i am proposing enhancement to current Percentile and Approx Percentile implementation to take frequency distribution column into consideration
> Current Percentile definition
> {noformat}
> percentile(col, array(percentage1 [, percentage2]...))
> case class Percentile(
> child: Expression,
> percentageExpression: Expression,
> mutableAggBufferOffset: Int = 0,
> inputAggBufferOffset: Int = 0) {
> def this(child: Expression, percentageExpression: Expression) = {
> this(child, percentageExpression, 0, 0)
> }
> }
> {noformat}
> Proposed changes
> {noformat}
> percentile(col, [frequency], array(percentage1 [, percentage2]...))
> case class Percentile(
> child: Expression,
> frequency : Expression,
> percentageExpression: Expression,
> mutableAggBufferOffset: Int = 0,
> inputAggBufferOffset: Int = 0) {
> def this(child: Expression, percentageExpression: Expression) = {
> this(child, Literal(1L), percentageExpression, 0, 0)
> }
> def this(child: Expression, frequency : Expression, percentageExpression: Expression) = {
> this(child, frequency, percentageExpression, 0, 0)
> }
> }
> {noformat}
> Although this definition will differ from hive implementation, it will be useful functionality to many spark user.
> Moreover the changes are local to only Percentile and ApproxPercentile implementation
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org