You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2018/11/19 20:09:31 UTC

[GitHub] jon-wei commented on issue #6638: Fixed buckets histogram aggregator

jon-wei commented on issue #6638: Fixed buckets histogram aggregator
URL: https://github.com/apache/incubator-druid/pull/6638#issuecomment-440025681
 
 
   I've removed the comparison benchmarks between the fixed bins and approx/sketch aggs, since they were flawed/insufficient. 
   
   I've kept the benchmarks for just the fixed bins to give an idea of its performance (without regards to accuracy), since its performance would be pretty consistent across various distributions and mainly depends on the number of buckets used.
   
   (PS: The add performance could be roughly twice as fast without locking (needed for realtime indexing), perhaps a way to allow aggregators to know when they're running within a context that requires synchronization would be useful in general, as a subject for another PR)
   
   ------------------
   @leerho Thanks for the detailed commentary and feedback!
   
   @b-slim 
   
   > Also any reason this is not a separate extension i do not see why it is merged together.
   
   Mainly, the aggregator works with the min/max/quantile/quantiles post-aggs that are provided by the druid-histogram extension, so it's simpler to have them be in the same extension. The name "druid-histogram" also doesn't preclude a fixed bins implementation, so I felt it was reasonable to keep them in the same extension.
   
   @b-slim 
   
   > Also it is hard to make sense of the benchmarks without touching on the accuracy of the estimate, in the end of the day one would expect to see what it the tradeoff between how much accurate is the estimate vs compute/aggregate/merge time.
   
   Re: accuracy and when to use the fixed bins, I think @leerho gave an excellent breakdown:
   
   > Fixed-bin algorithms: This is the simplest, and most primitive approach, but requires the user to know, up-front, a great deal about the data that they want to collect. The user must make assumptions about the min and max values, and about how the data is likely distributed. The user then decides how many bins to define between the min and the max and assigns specific split-point values for the boundaries of all the bins. This can be very tricky because the resulting accuracy will depend greatly on how the actual data is distributed. Linear, Gaussian, log-normal, and power-law distributions require very different bin spacings. If the input data has gaps (missing values), zeros, spikes or jumps, or exceeds the user's assumed min and max values, the resulting histogram could be useless. Each bin is a simple counter that records how many events fell within the range of that bin. The code is very simple as well, it just performs a simple search or lookup for each value and increments the appropriate counter.
   
   > This approach is nearly always the fastest but space could be a problem. In the attempt to achieve high resolution of the result, the user may just increase the number and density of the bins. This can lead to a huge amount of space consumed. 100,000 bins would consume way more space than the heuristic and sketch algorithms. Because the fixed bin approach is so dependent on the actual data it is the least reliable in terms of accuracy and generally should be avoided except in very specific applications where the input data is well understood beforehand.
   
   Fixed bins would be a "special case", for general use, the sketch aggregators are going to be a better choice.
   
   In addition, a major motivation for this PR is to support users who already have fixed bin histograms from elsewhere and want to import those into Druid.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org