You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2018/11/07 05:26:33 UTC

[GitHub] clintropolis opened a new pull request #6584: fix druid-bloom-filter thread-safety

clintropolis opened a new pull request #6584: fix druid-bloom-filter thread-safety
URL: https://github.com/apache/incubator-druid/pull/6584
 
 
   Alternate approach to what was taken in #6547, instead embedding the `BloomKFilter` implementation in `druid-bloom-filters` extension, and wrapping variables which are re-used instead of re-allocated per function call with `ThreadLocal` to fix the thread safety issues. 
   
   Fixes #6546
   
   The copied `BloomKFilter` is from `hive-storage-api` as of verison 2.7.0 instead of master, since [this commit](https://github.com/apache/hive/commit/87ce36b458350db141c4cb4b6336a9a01796370f#diff-e65fc506757ee058dc951d15a9a526c3L238) seems to have broken backwards compatibility by using 8 bytes instead of 4 to store and test ints/floats. I went with this version specifically to maintain compatibility with the hive version used in the original PR, #6222, despite I think the latest implementation being a bit better since it doesn't use the static variable. The dependency on `hive-storage-api` still exists for use of it's `Murmur3` implementation, though if we embedded that as well we could drop the dependency entirely.
   
   Longer term I think this could prove advantageous having control over the implementation, since it will allow me to implement `ByteBuffer` versions of methods to make #6397 more efficient. 
   
   I collected benchmarks of this implementation of `BloomKFilter` and compared it to `BloomFilter` and they are fairly close, with a slight edge to `BloomKFilter` with larger filters:
   ```
   Benchmark                                     (capacity)   (rows)  Mode  Cnt    Score     Error  Units
   BloomKFilterBenchmark.testStringBloomFilter      1000000  3000000  avgt   20  419.952 ±  14.495  ms/op
   BloomKFilterBenchmark.testStringBloomFilter     10000000  3000000  avgt   20  632.029 ±  91.746  ms/op
   BloomKFilterBenchmark.testStringBloomFilter    100000000  3000000  avgt   20  829.343 ± 214.949  ms/op
   BloomKFilterBenchmark.testStringBloomKFilter     1000000  3000000  avgt   20  430.807 ±  17.243  ms/op
   BloomKFilterBenchmark.testStringBloomKFilter    10000000  3000000  avgt   20  617.019 ± 171.415  ms/op
   BloomKFilterBenchmark.testStringBloomKFilter   100000000  3000000  avgt   20  862.188 ± 270.329  ms/op
   ```
   
   ```
   Benchmark                                     (capacity)    (rows)  Mode  Cnt     Score     Error  Units
   BloomKFilterBenchmark.testStringBloomFilter      1000000  10000000  avgt   20  1577.867 ±  35.568  ms/op
   BloomKFilterBenchmark.testStringBloomFilter     10000000  10000000  avgt   20  2220.278 ±  71.004  ms/op
   BloomKFilterBenchmark.testStringBloomFilter    100000000  10000000  avgt   20  2766.451 ± 496.881  ms/op
   BloomKFilterBenchmark.testStringBloomKFilter     1000000  10000000  avgt   20  1680.831 ±  39.377  ms/op
   BloomKFilterBenchmark.testStringBloomKFilter    10000000  10000000  avgt   20  1983.740 ±  70.657  ms/op
   BloomKFilterBenchmark.testStringBloomKFilter   100000000  10000000  avgt   20  2428.820 ± 480.301  ms/op
   ```
   
   and for reference here is the unmodified `BloomKFilter`, which has similar test speeds as well
   ```
   Benchmark                                     (capacity)    (rows)  Mode  Cnt     Score     Error  Units
   BloomKFilterBenchmark.testStringBloomFilter      1000000  10000000  avgt   20  1599.278 ±  70.931  ms/op
   BloomKFilterBenchmark.testStringBloomFilter     10000000  10000000  avgt   20  2375.007 ± 149.834  ms/op
   BloomKFilterBenchmark.testStringBloomFilter    100000000  10000000  avgt   20  2639.661 ±  84.199  ms/op
   BloomKFilterBenchmark.testStringBloomKFilter     1000000  10000000  avgt   20  1673.166 ±  48.810  ms/op
   BloomKFilterBenchmark.testStringBloomKFilter    10000000  10000000  avgt   20  2045.913 ±  33.302  ms/op
   BloomKFilterBenchmark.testStringBloomKFilter   100000000  10000000  avgt   20  2289.513 ±  44.742  ms/op
   ```
   
   I am going to test these changes on a test cluster, to ensure they function as expected and correctly resolve #6546, and will add a comment with the results here before we should consider merging.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org