You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Stephan Ewen (JIRA)" <ji...@apache.org> on 2016/04/04 16:42:25 UTC

[jira] [Comment Edited] (FLINK-3613) Add standard deviation, mean, variance to list of Aggregations

    [ https://issues.apache.org/jira/browse/FLINK-3613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15224259#comment-15224259 ] 

Stephan Ewen edited comment on FLINK-3613 at 4/4/16 2:42 PM:
-------------------------------------------------------------

The design of the extended aggregators makes a lot of sense. I agree with Fabian that we should discuss two things first, however:

  1. Do we want such extended aggregations in the DataSet API, or basically push people to use the Table API instead? My gut feeling is that it makes sense to have this in the DataSet API if we answer (2) with "yes" have a good design for (3).
  2. I assume it should allow to use multiple aggregation functions, such that one could create something like {{(a, b) --> (max(a), min(a), avg(b))}}
  3. How do we want the signatures for this to look? Ideally making this typesafe via a builder (similar to the CSV input on ExecutionEnvironment).



was (Author: stephanewen):
The design of the extended aggregators makes a lot of sense. I agree with Fabian that we should discuss two things first, however:

  1. Do we want such extended aggregations in the DataSet API, or basically push people to use the Table API instead? My gut feeling is that it makes sense to have this in the DataSet API if we answer (2) with "yes" have a good design for (3).
  2. I assume it should allow to use multiple aggregation functions, such that one could create something {{like (a, b) --> (max(a), min(a), avg(b))}}
  3. How do we want the signatures for this to look? Ideally making this typesafe via a builder (similar to the CSV input on ExecutionEnvironment).


> Add standard deviation, mean, variance to list of Aggregations
> --------------------------------------------------------------
>
>                 Key: FLINK-3613
>                 URL: https://issues.apache.org/jira/browse/FLINK-3613
>             Project: Flink
>          Issue Type: Improvement
>            Reporter: Todd Lisonbee
>            Priority: Minor
>         Attachments: DataSet-Aggregation-Design-March2016-v1.txt
>
>
> Implement standard deviation, mean, variance for org.apache.flink.api.java.aggregation.Aggregations
> Ideally implementation should be single pass and numerically stable.
> References:
> "Scalable and Numerically Stable Descriptive Statistics in SystemML", Tian et al, International Conference on Data Engineering 2012
> http://dl.acm.org/citation.cfm?id=2310392
> "The Kahan summation algorithm (also known as compensated summation) reduces the numerical errors that occur when adding a sequence of finite precision floating point numbers. Numerical errors arise due to truncation and rounding. These errors can lead to numerical instability when calculating variance."
> https://en.wikipedia.org/wiki/Kahan_summation_algorithm



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)