You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Herman van Hovell (JIRA)" <ji...@apache.org> on 2016/12/20 11:31:58 UTC
[jira] [Commented] (SPARK-18940) Percentile and approximate
percentile support for frequency distribution table
[ https://issues.apache.org/jira/browse/SPARK-18940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15763987#comment-15763987 ]
Herman van Hovell commented on SPARK-18940:
-------------------------------------------
I like this idea.
We can maintain Hive compatibility by adding the {[frequency}} argument at the end of the function. You would also have to add an appropriate constructor.
> Percentile and approximate percentile support for frequency distribution table
> ------------------------------------------------------------------------------
>
> Key: SPARK-18940
> URL: https://issues.apache.org/jira/browse/SPARK-18940
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 2.0.2
> Reporter: gagan taneja
>
> I have a frequency distribution table with following entries
> Age, No of person
> 21, 10
> 22, 15
> 23, 18
> ..
> ..
> 30, 14
> Moreover it is common to have data in frequency distribution format to further calculate Percentile, Median. With current implementation
> It would be very difficult and complex to find the percentile.
> Therefore i am proposing enhancement to current Percentile and Approx Percentile implementation to take frequency distribution column into consideration
> Current Percentile definition
> percentile(col, array(percentage1 [, percentage2]...))
> case class Percentile(
> child: Expression,
> percentageExpression: Expression,
> mutableAggBufferOffset: Int = 0,
> inputAggBufferOffset: Int = 0) {
> def this(child: Expression, percentageExpression: Expression) = {
> this(child, percentageExpression, 0, 0)
> }
> }
> Proposed changes
> percentile(col, [frequency], array(percentage1 [, percentage2]...))
> case class Percentile(
> child: Expression,
> frequency : Expression,
> percentageExpression: Expression,
> mutableAggBufferOffset: Int = 0,
> inputAggBufferOffset: Int = 0) {
> def this(child: Expression, percentageExpression: Expression) = {
> this(child, Literal(1L), percentageExpression, 0, 0)
> }
> def this(child: Expression, frequency : Expression, percentageExpression: Expression) = {
> this(child, frequency, percentageExpression, 0, 0)
> }
> }
> Although this definition will differ from hive implementation, it will be useful functionality to many spark user.
> Moreover the changes are local to only Percentile and ApproxPercentile implementation
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org