You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@commons.apache.org by GitBox <gi...@apache.org> on 2019/11/03 15:22:49 UTC

[GitHub] [commons-collections] Claudenw commented on issue #83: WIP: Initial bloom filter code contribution

Claudenw commented on issue #83: WIP: Initial bloom filter code contribution
URL: https://github.com/apache/commons-collections/pull/83#issuecomment-549147466

I built a test suite to see how fast the various options are. I used the GeoNames database as a datasource and built hashed the names of the first 10^6 entries.
I took a sample comprising every 1000th entry.

I ran populations of 10^2 thru 10^6 entries stored in the bloom filter.
The filter was defined as having 10^4 entries with a collision probability of 5x10^4.
An MD5 hash was used.

All hashes were created outside the timing loops.
The values we are looking for are called "target" below. The filter we are comparing with is called "candidate" below.

There were 4 tests:
**Int** - Where an array of integers representing the ON bits for the target where checked against a bitSet built from the candidate. Only the comparison was timed, bitset and array of ints were created outside of the timing loop. Duplicate values were removed from the target array.
**long** - Where arrays of longs representing the ON bits for the target were compared against an array of longs for the candidate. Only the comparison was timed, long arrays were created outside of the timing loop.
**hasher** - Each target was encoded into a StaticHasher (basically a set of integers) and then the hasher was compared with the candidate Bloom filter using the BloomFilter.contains( Hasher ) method.
**filter** - Each target Bloom filter was constructed using the BitSetBloomFilter and then the BloomFilter.contains( BloomFilter ) was called on the candidate. The construction of the target Bloom filter was outside the timing loop.

Each test used the 1000 samples 5 times and an average taken. The result is that The int and the long checks are so fast that they rarely take 1 millisecond. The hasher check is slightly slower but still less than 1 millisecond on average. The full filter creation is slower but never took more than 94 milliseconds on average.

I have attached csv files of the data from the runs and the summary in a zip file. If you want the test code please let me know.

[speedTest.zip](https://github.com/apache/commons-collections/files/3801701/speedTest.zip)

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

With regards,
Apache Git Services