You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@commons.apache.org by "Claude Warren (Jira)" <ji...@apache.org> on 2019/11/01 10:10:00 UTC

[jira] [Commented] (COLLECTIONS-728) BloomFilter contribution

    [ https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16964753#comment-16964753 ] 

Claude Warren commented on COLLECTIONS-728:
-------------------------------------------

=== From the conversation above ===

 Why doesn't Shape keep a reference to Hasher? That would avoid having to pass both to e.g. method contains in BloomFilter (and would make is unnecessary to check their consistency).

 (see the response to how are StaticHasher and DynamicHasher different for some extra background).
 The contains method with the Hasher was requested because one of the other Apache projects was interested in checking the presence of the Bloom filter without building a complete Bloom filter. They have the hasher and they know the shape it should be. So the method actually needs the Hasher for the data.

Now, they can build the DynamicHasher and pass that, or they could build a StaticHasher (which does know it's shape), or they could implement another Hasher type.

Placing the Hasher in the Shape would mean that the shape will always return the same data when what we are doing is trying to send different data with the same shape.

Still it doesn't look right, API-wise; but again I'm lacking a clear view of all the requirements to provide an alternative. 

=== end of quote ===

The methods that have both the Hasher and the Shape are really passing decomposed Bloom filters.  The Hasher contains the hash values and the Shape identifies shape of the decomposed Bloom filter.   True the Hasher could be limited to a StaticHasher and the Shape would not be needed as the Static Hahser contains the Shape it was built with, but in the general case a Hasher does not have a Shape. 

The reason for the separation is the Hasher applies a hash function to data and returns int values.
The shape indicates to the Hasher how many times it should apply the hashing algorithm (k in the maths) to each data item being hashes (item = the X in DynamicHasher.with(X)).  The Shape also give the upper limit for the values the hasher returns on the particular getBits() call.  The upper limit is "m" in the maths.

So Hasher + Shape = an iteration of integers that are the bits turned on the Bloom filter.

A single hasher can generated Bloom filters of an almost infinite number of Shapes.

--- snip --

=== start of quote === 

From KISS and DRY POVs, I disagree with both statements: The factory could be immutable and readily provide all the useful algorithms (see e.g. RandomSource).

=== end of quote ===

There are two issues with the idea that we can provide all the usefull algorithms.

1) We will always be behind when it comes to implementing new algorithms using new hash functions.  It is better to allow the users in the field to add algorithms as they see fit.  I remember an instructor one said the the only valid limits in computer science are zero, one and many.  Providing an implementation with some but not allowing for addition of others is very limiting.

2) There are known cases (Cassandra comes to mind) where an implementation of a Hashing algorithm was incorrectly executed and released into the wild.  If such an application wanted to use this library they would be unable to add their "broken" hashing to the factory would be an inhibitor.

So why not provide a factory that " provide all the useful algorithms" and has the convenience methods to add hash functions?

Once we add hash functions it seems a short step to resetting the factory to default.  At a minimum this would be needed for testing.









> BloomFilter contribution
> ------------------------
>
>                 Key: COLLECTIONS-728
>                 URL: https://issues.apache.org/jira/browse/COLLECTIONS-728
>             Project: Commons Collections
>          Issue Type: Task
>            Reporter: Claude Warren
>            Priority: Minor
>         Attachments: BF_Func.md, BloomFilter.java, BloomFilterI2.java, Usage.md
>
>
> Contribution of BloomFilter library comprising base implementation and gated collections.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)