You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2018/12/21 20:33:38 UTC

[GitHub] leerho edited a comment on issue #6743: IncrementalIndex generally overestimates theta sketch size

leerho edited a comment on issue #6743: IncrementalIndex generally overestimates theta sketch size
URL: https://github.com/apache/incubator-druid/issues/6743#issuecomment-449490568

@gianm

> The granddaddy of Druid sketches, hyperUnique, doesn't have a very high max memory footprint, so it was less of an issue back then.

I think you are referring to the [HyperLogLog (HLL) sketch implementation](https://github.com/apache/incubator-druid/blob/master/hll/src/main/java/org/apache/druid/hll/HyperLogLogCollector.java) that was developed for Druid some years ago. There are several things you need to understand about that implementation.
- The implementation is flawed and has been from the beginning. It can produce erroneous results that can be 7 times worse than what it should be had it been implemented correctly. These errors show up during merge operations and is documented in [characterization studies](https://datasketches.github.io/docs/HLL/HllSketchVsDruidHyperLogLogCollector.html) we performed this past year. I strongly advise that this Druid HLL sketch be deprecated and removed from Druid. The unsuspecting user will have no warning whatsoever when this sketch is producing incorrect results.
- The HLL implementation in the DataSketches library is not only much more accurate, but it is faster as well and is approximately the same size. We can work with you to create a PR that would possibly redirect references to the old Druid HLL sketch to our implementation that fixes these problems.
- HLL sketches are small in size because of the nature of the underlying HLL algorithm. If users were to choose the DataSketches HLL implementation it also would be small in size. However, the HLL algorithm is not designed to allow Intersections. If the use-case for counting uniques only requires merging and does not require Intersection operations then the HLL algorithm is a reasonable choice (although there is now an algorithm that is superior to HLL in terms of accuracy per space, and we can provide that as an option to your users as well.) The Theta Sketch from the DS library is a larger sketch (and the one that I was modeling earlier in this thread), but it is designed from the outset to enable set Intersection and set difference operations. Because set expressions are so powerful from an analysis point-of-view, many users choose the Theta Sketch in spite of its larger size. This has also been our experience at Yahoo.
***
I'm not sure I follow your proposal and may be missing something. Nonetheless, one of the challenges of just carving out a chunk of the processing buffer to be managed by by the individual BufferAggregators is that in order truly gain back the "unused memory" would require memory management sophistication similar to the design of a malloc(), free() and a garbage collector, which is non-trivial. Imagine each sketch with an internal hash-table. As individual sketches grow they need to obtain a larger chunk of memory for a bigger hash-table, move their data into the larger chunk and free the previous chunk. If those freed chunks don't get reused, then we will not achieve the optimum memory savings. Allowing the sketches to allocate and free memory directly from the operating system allows us to leverage the already existing and very sophisticated malloc() and free() of the underlying C-libraries and the OS itself.

BTW, there is already a mechanism in the JVM to track usage of off-heap memory used by DirectByteBuffers. We use this same mechanism to track allocations we make against off-heap memory even though we don't use DirectByteBuffers. So the JVM knows and tracks this usage already. Druid could also monitor these same JVM counters if it wanted or needed to.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org