You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2019/07/30 04:44:51 UTC

[GitHub] [incubator-druid] leerho commented on issue #8194: HllSketchMergeBufferAggregator: Speed up init by copying prebuilt sketch.

leerho commented on issue #8194: HllSketchMergeBufferAggregator: Speed up init by copying prebuilt sketch.
URL: https://github.com/apache/incubator-druid/pull/8194#issuecomment-516259077

@gianm
Thank you for this analysis! The current HllSketch starts at a very small size and grows in size by factors of 2 until it is just under the size of the dense sketch and then converts to dense mode, after which it stays fixed. See the bottom chart [here](https://datasketches.github.io/docs/HLL/HLL.html).

I have thought about an option that would eliminate the sparse mode. An empty sketch would be created at full size. Here is where we might be able to take advantage of a high-speed copy of pre-initialized memory upon sketch creation. For the life of the sketch it would never need to allocate or clear any memory. (If you are using HLL-4 mode, there is a little lookup cache that occasionally may need to resize, but that should be rare.)

Druid (so far) always allocates space for the maximum size sketch and cannot take advantage of the fact that the size of the sketch for small cardinalities can be quite small and then grow. If this is going to remain true, then it might make sense to add this option. Druid is wasting a lot of memory because of the fixed allocation sizes and this has been discussed at length in other threads, which I'm not going to repeat here. Nonetheless, I am likely out-of-date with what the latest thinking within the Druid team is on this topic, so please update me :)

Clearly our focus has been to optimize the best possible accuracy for a given size. But if you are willing to allow the sketch to start at maximum size this may make sense for you. It could mean a substantial speed up. When implemented, it will be a configuration option, so you could always play with it and change your mind.

The downside of eliminating the sparse mode is that the accuracy of the sketch for low cardinalities will be worse. It will be within the error guarantees for a given _k_ and behave like the conventional Flajolet-Martin HLL sketch at low cardinalities. Our HLL sparse mode takes advantage of new estimators that were developed for our CPC sketch which outperforms the accuracy / byte of almost any other HLL sketch out there by a large margin.

We are in the middle of this Apache migration, so implementing this may be some months out, unless we could get some help speeding up the migration :) :)

Let me know if this makes sense to you.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org