You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2019/03/01 20:45:04 UTC

[GitHub] leerho commented on issue #6099: Inconsistencies in the result of the quantile aggregator

leerho commented on issue #6099: Inconsistencies in the result of the quantile aggregator
URL: https://github.com/apache/incubator-druid/issues/6099#issuecomment-468805588
 
 
   The above code, and references to the Druid [doc](http://druid.io/docs/latest/development/extensions-core/approximate-histograms.html), the referenced [paper](http://jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf), and [blog](https://metamarkets.com/2013/histograms/) all refer to the Druid built-in "Approximate Histogram".   It is an empirical, heuristic algorithm and is very data sensitive and can produce significant errors.  It should be noted that the authors of the paper, which describes the algorithm, specifically mention several caveats:
   - “The second category, to which our algorithm belongs, consists of heuristics that work well empirically and demand low amounts of space, but lack any rigorous accuracy analysis.” Page 856, section 2.4.1.
   - “…when the data distribution is highly skewed, the accuracy of [our] on-line histogram decays …[so] practitioners may prefer to resort to guaranteed accuracy algorithms.” Page 857, first paragraph.
   
   **Do not confuse** this Druid built-in Approximate Histogram with the `DataSketches Quantiles DoublesSketch`, which can also produce approximate histograms.  The above referenced error and size [table](https://datasketches.github.io/docs/Quantiles/QuantilesAccuracy.html) is for the `DataSketches Quantiles DoublesSketch` and **NOT** for the Druid built-in Approximate Histogram.
   
   Visit [Druid Approximate Histogram Study](https://datasketches.github.io/docs/Quantiles/DruidApproxHistogramStudy.html) to get an idea of how poorly it can perform on real data.  
   
   Visit [Quantiles StreamA Study](https://datasketches.github.io/docs/Quantiles/QuantilesStreamAStudy.html) to compare how the  `DataSketches Quantiles DoublesSketch` performs on the same data. 
   
   The Druid Approximate Histogram has no guarantees on error bounds.  It does not qualify as a true "sketch", for this reason and several others.   See discussion #6869 that proposes to deprecate the Druid built-in Approximate Histogram, and replace it with the newer and much more robust and accurate `DataSketches Quantiles DoublesSketch`.  

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org