You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Herman van Hovell (JIRA)" <ji...@apache.org> on 2016/12/20 11:31:58 UTC

[jira] [Commented] (SPARK-18940) Percentile and approximate percentile support for frequency distribution table

    [ https://issues.apache.org/jira/browse/SPARK-18940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15763987#comment-15763987 ] 

Herman van Hovell commented on SPARK-18940:
-------------------------------------------

I like this idea.

We can maintain Hive compatibility by adding the {[frequency}} argument at the end of the function. You would also have to add an appropriate constructor.

> Percentile and approximate percentile support for frequency distribution table
> ------------------------------------------------------------------------------
>
>                 Key: SPARK-18940
>                 URL: https://issues.apache.org/jira/browse/SPARK-18940
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.0.2
>            Reporter: gagan taneja
>
> I have a frequency distribution table with following entries 
> Age,    No of person 
> 21, 10
> 22, 15
> 23, 18 
> ..
> ..
> 30, 14
> Moreover it is common to have data in frequency distribution format to further calculate Percentile, Median. With current implementation
> It would be very difficult and complex to find the percentile.
> Therefore i am proposing enhancement to current Percentile and Approx Percentile implementation to take frequency distribution column into consideration 
> Current Percentile definition 
> percentile(col, array(percentage1 [, percentage2]...))
> case class Percentile(
>   child: Expression,
>   percentageExpression: Expression,
>   mutableAggBufferOffset: Int = 0,
>   inputAggBufferOffset: Int = 0) {
>    def this(child: Expression, percentageExpression: Expression) = {
>     this(child, percentageExpression, 0, 0)
>   }
> }
> Proposed changes 
> percentile(col, [frequency], array(percentage1 [, percentage2]...))
> case class Percentile(
>   child: Expression,
>   frequency : Expression,
>   percentageExpression: Expression,
>   mutableAggBufferOffset: Int = 0,
>   inputAggBufferOffset: Int = 0) {
>    def this(child: Expression, percentageExpression: Expression) = {
>     this(child, Literal(1L), percentageExpression, 0, 0)
>   }
>   def this(child: Expression, frequency : Expression, percentageExpression: Expression) = {
>     this(child, frequency, percentageExpression, 0, 0)
>   }
> }
> Although this definition will differ from hive implementation, it will be useful functionality to many spark user.
> Moreover the changes are local to only Percentile and ApproxPercentile implementation



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org