You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2019/04/17 06:18:06 UTC
[GitHub] [incubator-druid] clintropolis opened a new pull request #7496: refactor druid-bloom-filter aggregators

clintropolis opened a new pull request #7496: refactor druid-bloom-filter aggregators
URL: https://github.com/apache/incubator-druid/pull/7496
 
 
   This PR refactors the `druid-bloom-filter` extension aggregators to strictly use `ByteBuffer` representation of `BloomKFilter` and buffer based methods to manipulate them in place, allowing for combined `Aggregator` and `BufferAggregator` implementations and vastly simplifying the code. It also fixes a bug in the `combine` logic of the current implementation, resulting from the mixed use of `BloomKFilter` and `ByteBuffer` (the current implementation only can combine `BloomKFilter`, but the broker will receive `byte[]` from the historicals, resulting in a cast exception).
   
   Originally introduced in #6397, the PR sort of grew organically over its 41 commit lifetime; from starting out where it only dealt with `BloomKFilter` on heap and suffered the overhead of a lot of serde, to where it ended up, which mixed usage of `BloomKFilter` and `ByteBuffer` after adding some methods to optimize the `BufferAggregator` implementation, done in an attempt to reduce the overhead of serializing and deserializing such potentially large values so often, and picking up a bug or two along the way.
   
   This new approach combines implementations of both types of the aggregators and now _always_ operates with `ByteBuffer`. The `BloomFilterAggregatorFactory` and `BloomFilterMergeAggregatorFactory` construct the `Aggregator` and `BufferAggregator` implementations with a perhaps not greatly named boolean parameter on the constructor, `onHeap`. If `onHeap` is set to `true`, the aggregator will allocate an appropriately sized `ByteBuffer` to hold a `BloomKFilter` allow usage as an `Aggregator`, and if not must rely on the `ByteBuffer` that will be passed to it's methods during it's life as a `BufferAggregator`. 
   
   There is probably room for improvement to make this a bit more elegant, maybe making the aggregator constructors private and adding static methods to more explicitly construct either an `Aggregator` with `onHeap = true`, or a `BufferAggregator` with `onHeap = false`? Regardless, this eliminates all of the extra serde and should be a lot more simple to reason about and troubleshoot.
   
   I have tested with top-n, timeseries, and group-by queries on a small test cluster with multiple-historical spun up on my laptop, and things look good so far.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org