You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pinot.apache.org by GitBox <gi...@apache.org> on 2018/11/20 19:51:29 UTC

[GitHub] kishoreg opened a new pull request #3528: Adding support for bloom filter

kishoreg opened a new pull request #3528: Adding support for bloom filter
URL: https://github.com/apache/incubator-pinot/pull/3528
 
 
   Functional but needs clean up and test cases (WIP)
   BloomFilters can be very effective in pruning segments. This PR generates the bloomFilter dynamically based on the tableconfig->indexingconfig->bloomFilterColumns. 
   Enhanced the ColumnValueSegmentPruner to apply bloomFilter if it exists.
   
   Sample stats
   Without bloom filter
       "numSegmentsProcessed": 136,
       "numSegmentsWithNoMatch": 128
   With bloom filter
       "numSegmentsProcessed": 14,
       "numSegmentsWithNoMatch": 6
   The number of segments processed reduces from 136 to 14.
   
   This, of course, comes with the additional overhead of creating and evaluating the bloomfilter.  The current implementation loads the bloom filter on heap. The size of bloom filter can be quite big. For example, the size of bloom filter for real-time segments range from 300 to 700KB.
   
   I am thinking of two options
   1. Limit the size of bloom filter to 1mb and sacrifice accuracy.
   2. Off-heap implementation of bloom filter. 
   
   For now, we will start with 1 and add support for 2 (this should not be hard), we just need an offheap bitset hooked into bloomfilter.
   
   I compared with ClearSpring bloom filter with Gauva (String and Integer). Gauva was slightly better in terms of size but ClearSpring API was much simpler. I will try to add some additional metadata in the serialized data so that we can switch the format later.
   
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pinot.apache.org
For additional commands, e-mail: dev-help@pinot.apache.org