You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@commons.apache.org by GitBox <gi...@apache.org> on 2019/10/17 07:37:47 UTC
[GitHub] [commons-collections] Claudenw commented on issue #83: WIP: Initial
bloom filter code contribution
Claudenw commented on issue #83: WIP: Initial bloom filter code contribution
URL: https://github.com/apache/commons-collections/pull/83#issuecomment-543048050
If you were not to build a bloom filter what would you do to check if
something was in the filter?
I am going to guess at an answer here.
1) create a bit array
2) do some hashing of the object and turn on bits in the array
3) execute [ array & filter == array ] test to see if it was in there.
In essence you build a bloom filter. These same steps are done when you
create a bloom filter instance in the class
So yes you have to build a bloom filter but the overhead is virtually
non-existent as using the builder pattern removes/replaces steps from your
process:
Assumeing partA partB and partC of the object are to be hashed, the steps
using the library are:
// assuming 10K entries in the filter and 1 in 50K collision rate. This
only needs to be build once.
BloomFilterConfiguration config = new BloomFIlterConfiguration( 10000,
1.0/50000 );/
BloomFilter bf = new StandardBloomFilter( ProtoBloomFilter.builder().with(
obj.partA ).with( obj.partB ).with( obj.partC).build(), config );
So the over head for comparisson is minimal.
The filters are immutable because if you pass them as parameters you want
to ensure that they are not modified by external operations (either
intentionally or through error). In most use cases once the filter is
constructed it is used in one or more comparisons. The only time a filter
is modified is when another filter is added to the collection. In my
experience the rate at which you add filters to collections is much lower
than the rate at which you compare them.
The other reason is that the equality and the hashCode of a filter are
based on the bits that are on. So leaving them mutable would mean that the
filters could not be used in a Hash based Set, because the hash would
change after insertion when a new object was added.
Claude
Claude
On Thu, Oct 17, 2019 at 2:20 AM SilverNarcissus <no...@github.com>
wrote:
> Hi~ Thanks for your work! I have a question about you use immutable bloom
> filter and if we want to judge whether an object is in a bloom filter, we
> have to build a new bloom filter and compare them. There may be a
> performance impact. How come you choose this design?
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <https://github.com/apache/commons-collections/pull/83?email_source=notifications&email_token=AASTVHUGX56CGN73LMHN2IDQO64WXA5CNFSM4IWGTGA2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEBONO2A#issuecomment-542955368>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AASTVHSGNND3VDZLN72RDBLQO64WXANCNFSM4IWGTGAQ>
> .
>
--
I like: Like Like - The likeliest place on the web
<http://like-like.xenei.com>
LinkedIn: http://www.linkedin.com/in/claudewarren
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services