You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2019/01/15 23:02:39 UTC

[GitHub] drcrallen opened a new pull request #6865: Densify swapped hll buffer

drcrallen opened a new pull request #6865: Densify swapped hll buffer
URL: https://github.com/apache/incubator-druid/pull/6865
 
 
   We had an upstream data producer who was sampling data. The sampling algorithm seemed to be based on Murmur3_128, or at least a related algorithm where the hash collisions were similar. When doing a HLL sketch of the dimension values, we were getting really weird results where all the HLL buckets would end up with values that were not good sketches of the input data (every bucket nibble with a `1` for example). `testCanFillUpOnMod` demonstrates such a scenario.
   
   The unfortunate side effect of this was that the folding operation can easily cause corrupt buffers if the buffer folding in is sparse. `testRegisterSwapWithSparse` will fail against master at `folded.toByteBuffer()` similar to how the jackson serialization of the collector fails on historicals in the error mode we found. 
   
   ```
   java.nio.BufferOverflowException
   	at java.nio.Buffer.nextPutIndex(Buffer.java:527)
   	at java.nio.HeapByteBuffer.putShort(HeapByteBuffer.java:321)
   	at org.apache.druid.hll.HyperLogLogCollector.toByteBuffer(HyperLogLogCollector.java:488)
   ```
   
   
   With this PR applied, the query result does not crash, but does return as sketch that is useless, as demonstrated in the estimate cardinality checks during the added unit tests.
   
   A tangential long term solution here would probably be to also seed the murmur hash with a custom value... but that will break historical compatibility in nasty ways.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org