You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2019/07/12 19:15:01 UTC

[GitHub] [incubator-druid] himanshug opened a new issue #8071: add aggregators for computing mean/average

himanshug opened a new issue #8071: add aggregators for computing mean/average
URL: https://github.com/apache/incubator-druid/issues/8071
 
 
   ### Motivation
   
   I have use case of querying mean value of certain columns and need an easy and efficient way to do same.
   
   ### Proposed changes
   
   We would introduce following `DoubleMeanAggregatorFactory` implementation and other related classes e.g. `DoubleMeanAggregator` . It would work by using following well known algorithm.
   
   ```
   // maintain following variables
   long count;
   double mean;
   
   // update with a value v
   count++;
   mean = mean + (v - mean)/count;
   
   // merging
   count = count1 + count2;
   mean = (mean1*count1 + mean2*count2)/count;
   ```
   
   consequently a new aggregator type called `doubleMean` would be made available.
   
   ### Rationale
   In comparison to the alternatives, proposed implementation is most straightforward and least overhead way to get mean of a column.
   
   Alternative#1:
   Use `doubleSum` , `doubleCount` aggregators and use `arithmetic` or `expression` post aggregator to do the division to compute mean. It becomes tedious for system generating the druid query and mean is a very common aggregation being available out of the box. 
   
   Alternative#2:
   Add a `MeanPostAggregator` that extracts the mean from `VarianceAggregatorCollector` maintained by `VarianceAggregator` OR  add a option to `VarianceAggregatorFactory` to 
    output mean in `finalizeComputation(obj)` method. However, this would unnecessarily maintain variables for variance.
   
   ### Operational impact
   None
   
   ### Test plan (optional)
   Changes proposed here would be easily unit testable.
   
   ### Future work (optional)
   
   Maybe a `floatMean` aggregator if some use case is too paranoid about saving few bytes at query time. However, I don't think mean is a stat that should be indexed and stored in segment so `floatMean` is not important I think.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org