You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@commons.apache.org by GitBox <gi...@apache.org> on 2019/10/17 07:37:47 UTC

[GitHub] [commons-collections] Claudenw commented on issue #83: WIP: Initial bloom filter code contribution

Claudenw commented on issue #83: WIP: Initial bloom filter code contribution
URL: https://github.com/apache/commons-collections/pull/83#issuecomment-543048050
 
 
   If you were not to build a bloom filter what would you do to check if
   something was in the filter?
   
   I am going to guess at an answer here.
   
   1) create a bit array
   2) do some hashing of the object and turn on bits in the array
   3) execute [ array & filter == array ] test to see if it was in there.
   
   In essence you build a bloom filter.  These same steps are done when you
   create a bloom filter instance in the class
   
   So yes you have to build a bloom filter but the overhead is virtually
   non-existent as using the builder pattern removes/replaces steps from your
   process:
   
   Assumeing partA partB and partC of the object are to be hashed, the steps
   using the library are:
   
   // assuming 10K entries in the filter and 1 in 50K collision rate.  This
   only needs to be build once.
   BloomFilterConfiguration config = new BloomFIlterConfiguration( 10000,
   1.0/50000 );/
   
   BloomFilter bf = new StandardBloomFilter( ProtoBloomFilter.builder().with(
   obj.partA ).with( obj.partB ).with( obj.partC).build(), config );
   
   So the over head for comparisson is minimal.
   
   The filters are immutable because if you pass them as parameters you want
   to ensure that they are not modified by external operations (either
   intentionally or through error).  In most use cases once the filter is
   constructed it is used in one or more comparisons.  The only time a filter
   is modified is when another filter is added to the collection.  In my
   experience the rate at which you add filters to collections is much lower
   than the rate at which you compare them.
   
   The other reason is that the equality and the hashCode of a filter are
   based on the bits that are on.  So leaving them mutable would mean that the
   filters could not be used in a Hash based Set, because the hash would
   change after insertion when a new object was added.
   
   Claude
   Claude
   
   On Thu, Oct 17, 2019 at 2:20 AM SilverNarcissus <no...@github.com>
   wrote:
   
   > Hi~ Thanks for your work! I have a question about you use immutable bloom
   > filter and if we want to judge whether an object is in a bloom filter, we
   > have to build a new bloom filter and compare them. There may be a
   > performance impact. How come you choose this design?
   >
   > —
   > You are receiving this because you were mentioned.
   > Reply to this email directly, view it on GitHub
   > <https://github.com/apache/commons-collections/pull/83?email_source=notifications&email_token=AASTVHUGX56CGN73LMHN2IDQO64WXA5CNFSM4IWGTGA2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEBONO2A#issuecomment-542955368>,
   > or unsubscribe
   > <https://github.com/notifications/unsubscribe-auth/AASTVHSGNND3VDZLN72RDBLQO64WXANCNFSM4IWGTGAQ>
   > .
   >
   
   
   -- 
   I like: Like Like - The likeliest place on the web
   <http://like-like.xenei.com>
   LinkedIn: http://www.linkedin.com/in/claudewarren
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services