You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "gagan taneja (JIRA)" <ji...@apache.org> on 2016/12/20 08:11:58 UTC

[jira] [Created] (SPARK-18940) Percentile and approximate percentile support for frequency distribution table

gagan taneja created SPARK-18940:
------------------------------------

             Summary: Percentile and approximate percentile support for frequency distribution table
                 Key: SPARK-18940
                 URL: https://issues.apache.org/jira/browse/SPARK-18940
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 2.0.2
            Reporter: gagan taneja


I have a frequency distribution table with following entries 
Age,    No of person 
21, 10
22, 15
23, 18 
..
..
30, 14

Moreover it is common to have data in frequency distribution format to further calculate Percentile, Median. With current implementation
It would be very difficult and complex to find the percentile.
Therefore i am proposing enhancement to current Percentile and Approx Percentile implementation to take frequency distribution column into consideration 
Current Percentile definition 

percentile(col, array(percentage1 [, percentage2]...))
case class Percentile(
  child: Expression,
  percentageExpression: Expression,
  mutableAggBufferOffset: Int = 0,
  inputAggBufferOffset: Int = 0) {
   def this(child: Expression, percentageExpression: Expression) = {
    this(child, percentageExpression, 0, 0)
  }
}

Proposed changes 

percentile(col, [frequency], array(percentage1 [, percentage2]...))
case class Percentile(
  child: Expression,
  frequency : Expression,
  percentageExpression: Expression,
  mutableAggBufferOffset: Int = 0,
  inputAggBufferOffset: Int = 0) {
   def this(child: Expression, percentageExpression: Expression) = {
    this(child, Literal(1L), percentageExpression, 0, 0)
  }
  def this(child: Expression, frequency : Expression, percentageExpression: Expression) = {
    this(child, frequency, percentageExpression, 0, 0)
  }
}

Although this definition will differ from hive implementation, it will be useful functionality to many spark user.
Moreover the changes are local to only Percentile and ApproxPercentile implementation




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org