You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Hyukjin Kwon (JIRA)" <ji...@apache.org> on 2019/05/21 04:02:21 UTC

[jira] [Updated] (SPARK-18940) Percentile and approximate percentile support for frequency distribution table

     [ https://issues.apache.org/jira/browse/SPARK-18940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hyukjin Kwon updated SPARK-18940:
---------------------------------
    Labels: bulk-closed  (was: )

> Percentile and approximate percentile support for frequency distribution table
> ------------------------------------------------------------------------------
>
>                 Key: SPARK-18940
>                 URL: https://issues.apache.org/jira/browse/SPARK-18940
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.0.2
>            Reporter: gagan taneja
>            Priority: Major
>              Labels: bulk-closed
>
> I have a frequency distribution table with following entries 
> {noformat}
> Age,    No of person 
> 21, 10
> 22, 15
> 23, 18 
> ..
> ..
> 30, 14
> {noformat}
> Moreover it is common to have data in frequency distribution format to further calculate Percentile, Median. With current implementation
> It would be very difficult and complex to find the percentile.
> Therefore i am proposing enhancement to current Percentile and Approx Percentile implementation to take frequency distribution column into consideration 
> Current Percentile definition 
> {noformat}
> percentile(col, array(percentage1 [, percentage2]...))
> case class Percentile(
>   child: Expression,
>   percentageExpression: Expression,
>   mutableAggBufferOffset: Int = 0,
>   inputAggBufferOffset: Int = 0) {
>    def this(child: Expression, percentageExpression: Expression) = {
>     this(child, percentageExpression, 0, 0)
>   }
> }
> {noformat}
> Proposed changes 
> {noformat}
> percentile(col, [frequency], array(percentage1 [, percentage2]...))
> case class Percentile(
>   child: Expression,
>   frequency : Expression,
>   percentageExpression: Expression,
>   mutableAggBufferOffset: Int = 0,
>   inputAggBufferOffset: Int = 0) {
>    def this(child: Expression, percentageExpression: Expression) = {
>     this(child, Literal(1L), percentageExpression, 0, 0)
>   }
>   def this(child: Expression, frequency : Expression, percentageExpression: Expression) = {
>     this(child, frequency, percentageExpression, 0, 0)
>   }
> }
> {noformat}
> Although this definition will differ from hive implementation, it will be useful functionality to many spark user.
> Moreover the changes are local to only Percentile and ApproxPercentile implementation



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org