You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2018/11/18 05:20:12 UTC

[GitHub] leerho commented on issue #6638: Fixed buckets histogram aggregator

leerho commented on issue #6638: Fixed buckets histogram aggregator
URL: https://github.com/apache/incubator-druid/pull/6638#issuecomment-439668705

@leventov

I haven't been following this thread or examined very closely the code that is being put into druid but at a high level and from a practical and historical perspective streaming histogram algorithms fall into basically 3 types: fixed-bin, heuristic/empirical and quantile sketch algorithms.

I should point out that for comparable data (like numerics), histogram algorithms and quantile algorithms are closely related. If you know the quantiles of a distribution you can create a PMF, CDF or histogram from the quantiles.

- Fixed-bin algorithms: This is the simplest, and most primitive approach, but requires the user to know, up-front, a great deal about the data that they want to collect. The user must make assumptions about the min and max values, and about how the data is likely distributed. The user then decides how many bins to define between the min and the max and assigns specific split-point values for the boundaries of all the bins. This can be very tricky because the resulting accuracy will depend greatly on how the actual data is distributed. Linear, Gaussian, log-normal, and power-law distributions require very different bin spacings. If the input data has gaps (missing values), zeros, spikes or jumps, or exceeds the user's assumed min and max values, the resulting histogram could be useless. Each bin is a simple counter that records how many events fell within the range of that bin. The code is very simple as well, it just performs a simple search or lookup for each value and increments the appropriate counter.

This approach is nearly always the fastest but space could be a problem. In the attempt to achieve high resolution of the result, the user may just increase the number and density of the bins. This can lead to a huge amount of space consumed. 100,000 bins would consume way more space than the heuristic and sketch algorithms. Because the fixed bin approach is so dependent on the actual data it is the least reliable in terms of accuracy and generally should be avoided except in very specific applications where the input data is well understood beforehand.

- Heuristic/empirical algorithms: There many types in this category, but I think the main ones are centroid-based, moment-based, and dynamic-bin-based. I'm sure there are others, but they all have various dependencies on the input data stream, but not quite as severe as the fixed-bin algorithms.

They also can be divided into two groups based on whether the input data covers a narrow or wide dynamic range: If the input data ranges over many orders of magnitude, then log-based approaches should be used, where the log of each item is computed prior to further processing. This is only useful for data that is always greater than zero (although zero itself can be handled with simple substitution). Excellent candidates include response time or latency data, file sizes, city populations, pebble sizes, geographic or astronomical distances, etc. For this kind of data the above algorithms become log-centroid-based, log-moment-based, and dynamic, log-bin-based.

I don't want to go into depth about this class of algorithms except to point out that none of them can make rank-accuracy guarantees without making some assumptions about the input data stream. (value-accuracy guarantees are nearly impossible to make without some assumption about the input data) The best that they can do is something like "if the input data is relatively well-behaved and smooth in its distribution, the the accuracy can be pretty good."

A [recently published paper](https://arxiv.org/abs/1803.01969) on moment-quantiles was able to establish upper bounds on its rank-accuracy, but only for the "L1" measurement of error, which, unfortunately, is not a practical measure of error for most users. (The "L1" error is the area between the true distribution and the mathematically computed distribution based on k moments computed from the input stream. If the two distributions fit each other like a glove for the full range of the distribution everything is wonderful. But there is no guarantee that the actual fit is not like a square peg in a round hole. So depending on where along the distribution the user makes his query, the error could be great or horrible. And the user has no idea or warning if this is happening.)

These heuristic/empirical algorithms can be quite small in size and perform quite well in terms of merge speed, but their dependence on the input data is their biggest weakness.

- Sketch algorithms: True sketch algorithms (like the quantiles algorithm in the DataSketches library) are the only algorithms that can claim independence of the input data. In other words, the user does not need to know, beforehand, the min or max of the stream. The input stream can contain, negative values, zeros, gaps, spikes, humps and bumps in its distribution and the values can range over multiple orders of magnitude. The DS-quantiles algorithm (of those benchmarked) is the only one that claim (mathematically) _data independence_, that the rank accuracy guarantees of the sketch will hold with a specified statistical confidence. The DS-quantiles sketch defines its error in terms of "L-infinity" error, which applies to each and all quantile (or rank) queries against the sketch anywhere along the distribution, which is what most users would like to be the case.

Well designed sketch algorithms can be small in size and fast, (but not as small and fast as a well designed moments quantiles algorithm, for example).

As for this particular PR, benchmarking accuracy, size and speed of these various sketches can be fraught with problems:

- The tester must make some assumption about the input distributions streamed into each of the algorithms. The attached benchmarks in this PR mention a random uniform distribution and a gaussian distribution -- neither of which are common in most data. They are also very smooth and well behaved. This kind of input data can perform well for all the algorithms including the fixed-bin and Heuristic/empirical algorithms with proper setup. So it is hardly a test at all as these input distributions will not provide meaningful separation between the test subjects. It is kinda like using a simple arithmetic problem to determine whether a student is ready for an advanced calculus class.

- Performing benchmark accuracy tests can be tricky and it is critical that the tester be very clear about how the algorithm error is being measured: was it relative or absolute error? was the error based on value or rank? How many trials per measurement were used? (It appears that the number of trials was 25 -- it should be about 2500!) How was the error computed from the trials? RMSE, MSE, SE, RSE, average, average-absolute-error, etc. It is not clear. If multiple trials are performed, what about the variance? It is the error variance that says more about the accuracy properties of an algorithm than just the average. Was the error distribution biased? All of these questions should be documented. The user should not have to dig through the benchmark test code to figure it out.

- What error guarantees, if any, is the algorithm under test able to provide given the chosen configuration? Did the algorithm meet its guarantee? Note that if one algorithm specifies an L1 error guarantee and another specifies an L-infinity guarantee, the tester needs to understand how to discriminate between those guarantees in his/her tests because it is like comparing apples and oranges.

- It is the tester's responsibility, when publishing benchmarks like this to use input data distributions that actually will discriminate between the algorithms and document what those are. Then the user can make the decision whether those distributions apply to his/her application. I don't feel that that was done in this particular case so I find these benchmarks uninformative and potentially misleading to a user.

***
I don't believe that Druid benefits by having all of these histogram variants, it can be very confusing to users. Having the moments-quantiles algorithm and the DataSketches quantiles sketch should be sufficient. Then document those two thoroughly in how they should be used and when. The fixed bin and other heuristic/empirical algorithms just add confusion, in spite of decades of their misuse and abuse in the analysis of data.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org