You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@commons.apache.org by "Alex Herbert (Jira)" <ji...@apache.org> on 2022/07/08 11:20:00 UTC
[jira] [Commented] (COLLECTIONS-824) BloomFilter: Optimize SimpleHasher.forEachIndex and SimpleHasher name change
[ https://issues.apache.org/jira/browse/COLLECTIONS-824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17564240#comment-17564240 ]
Alex Herbert commented on COLLECTIONS-824:
------------------------------------------
There is an example of how to perform enhanced double hashing in the [Wikipedia article|https://en.wikipedia.org/wiki/Double_hashing#Enhanced_double_hashing].
The following code is for enhanced double hashing. It requires two checks for overflow per loop iteration:
{code:java}
public boolean forEachIndex(IntPredicate consumer) {
Objects.requireNonNull(consumer, "consumer");
final int bits = shape.getNumberOfBits();
// Enhanced double hashing:
// hash[i] = ( h1(x) + i*h2(x) + (i*i*i - i)/6 ) mod bits
// See: https://en.wikipedia.org/wiki/Double_hashing#Enhanced_double_hashing
//
// Essentially this is computing a wrapped modulus from a start point and an
// increment and an additional term as a tetrahedral number.
// You only need two modulus operations before the loop. Within the loop
// the modulus is handled using the sign bit to detect wrapping to ensure:
// 0 <= index < bits
// 0 <= inc < bits
// The final hash is:
// hash[i] = ( h1(x) - i*h2(x) - (i*i*i - i)/6 ) wrapped in [0, bits)
int index = mod(initial, bits);
int inc = mod(increment, bits);
final int k = shape.getNumberOfHashFunctions();
for (int i = 0; i < k; i++) {
if (!consumer.test(index)) {
return false;
}
// Update index and handle wrapping
index -= inc;
index = index < 0 ? index + bits : index;
// Incorporate the counter into the increment to create a
// tetrahedral number additional term, and handle wrapping.
// NOTE: This assumes k <= bits
inc -= i;
inc = inc < 0 ? inc + bits : inc;
}
return true;
}
{code}
There is a big assumption is the above code that {{{}i < k <= bits{}}}. This may not always be true: if the shape has been created badly it is possible to have more hash functions (k) than bits. The resulting filter would fill up very fast. Assuming a few duplicate indices per item it will take only a few items to saturate the bits.
In this case the counter {{i}} can cause the increment to be more than {{bits}} below zero and the code requires branching and a loop:
{code:java}
inc -= i;
if (inc < 0) {
do {
inc += bits;
} while (inc < 0);
}{code}
In this case it may be better to avoid branching within the main loop using a double loop with the inner loop wrapping up to bits for each outer iteration:
{code:java}
final int k = shape.getNumberOfHashFunctions();
for (int j = k; j > 0;) {
final int block = Math.min(j, bits);
j -= block;
for (int i = 0; i < block; i++) {
if (!consumer.test(index)) {
return false;
}
// Update index and handle wrapping
index -= inc;
index = index < 0 ? index + bits : index;
// Incorporate the counter into the increment to create a
// tetrahedral number additional term, and handle wrapping.
inc -= i;
inc = inc < 0 ? inc + bits : inc;
}
}
{code}
Since this is an edge case it can be returned conditionally based on the Shape:
{code:java}
@Override
public IndexProducer indices(final Shape shape) {
Objects.requireNonNull(shape, "shape");
final int bits = shape.getNumberOfBits();
final int k = shape.getNumberOfHashFunctions();
if (k > bits) {
// Edge case
return new IndexProducer() {
// ...
};
}
// Main code path ...
return new IndexProducer() {
// ...
};
}
{code}
The resulting hasher has a lot of duplicate code, especially when the same code is required for the {{uniqueIndices}} method. An alternative is to throw an exception when {{k > bits}} as this is most likely a mistake.
Note the simple hasher can be updated as:
{code:java}
public boolean forEachIndex(IntPredicate consumer) {
Objects.requireNonNull(consumer, "consumer");
final int bits = shape.getNumberOfBits();
// Must handle bits == 1 to avoid (bits - 1) == 0
if (bits == 1) {
// No need to send k duplicates?
// for (int i = k; i != 0; i--) {
// if (!consumer.test(0)) {
// return false;
// }
// }
// return true;
return consumer.test(0);
}
// Double hashing:
// hash[i] = ( h1(x) + i*h2(x) ) mod bits
// See: https://en.wikipedia.org/wiki/Double_hashing
//
// Essentially this is computing a wrapped modulus from a start point and an
// increment. You only need two modulus operations before the loop. Within the loop
// the modulus is handled using the sign bit to detect wrapping to ensure:
// 0 <= index < bits
// 1 <= inc < bits
// The final hash is:
// hash[i] = ( h1(x) - i*h2(x) ) wrapped in [0, bits)
int index = mod(initial, bits);
// Ensure non-zero increment (Note: bits > 1)
int inc = 1 + mod(increment, bits - 1);
final int k = shape.getNumberOfHashFunctions();
for (int i = 0; i < k; i++) {
if (!consumer.test(index)) {
return false;
}
// Update index and handle wrapping
index -= inc;
index = index < 0 ? index + bits : index;
}
return true;
}
{code}
In this case the simple double hasher removes the requirement for a default increment used in the current code.
> BloomFilter: Optimize SimpleHasher.forEachIndex and SimpleHasher name change
> ----------------------------------------------------------------------------
>
> Key: COLLECTIONS-824
> URL: https://issues.apache.org/jira/browse/COLLECTIONS-824
> Project: Commons Collections
> Issue Type: Improvement
> Components: Collection
> Affects Versions: 4.5
> Reporter: Claude Warren
> Assignee: Claude Warren
> Priority: Major
> Labels: bloom-filter
>
> {noformat}
> public boolean forEachIndex(IntPredicate consumer) {
> Objects.requireNonNull(consumer, "consumer");
> int bits = shape.getNumberOfBits();
> /*
> * Essentially this is computing a wrapped modulus from a start point and an
> * increment. So actually you only need two modulus operations before the loop.
> * This avoids any modulus operation inside the while loop. It uses a long index
> * to avoid overflow.
> */
> long index = mod(initial, bits);
> int inc = mod(increment, bits);
> for (int functionalCount = 0; functionalCount < shape.getNumberOfHashFunctions(); functionalCount++) {
> if (!consumer.test((int) index)) {
> return false;
> }
> index += inc;
> index = index >= bits ? index - bits : index;
> }
> return true;
> }
> {noformat}
> This can be fixed to be non zero by using a smaller modulus:
> {noformat}
> // Ensure non-zero increment
> int inc = 1 + mod(increment, bits - 1);
> {noformat}
> This will prevent the 1/bits occurrence that the increment is zero and the hash is poor (it will be a single index).
> If bits == 1 then this will throw an exception. However if bits==1 this is an edge case that can be handled immediately after obtaining the number of bits as:
> {noformat}
> if (bits == 1) {
> // No need to send k duplicates, the filter shape has been constructed incorrectly.
> return consumer.test(0);
> }
> {noformat}
> https://github.com/Claudenw/commons-collections/blob/9f2945cc98747893456b73f42ab53f46a866ac37/src/main/java/org/apache/commons/collections4/bloomfilter/SimpleHasher.java#L144
> Here we have index as a long. However if the increment (which is always positive) is subtracted we can use an int and compare to zero:
> {noformat}
> index -= inc;
> index = index < 0 ? index + bits : index;
> {noformat}
> This does break a lot of unit tests. These tests have been created using SimpleHasher knowing what the hash modulus will be. The tests should be updated to use a FixedHasher in the test package:
> {noformat}
> /**
> * To be used for testing only. The fixed indices must fit within the filter shape.
> */
> class FixedHasher implements Hasher {
> private final int[] indices;
> private int[] distinct;
> FixedHasher(int... indices) {
> this.indices = indices.clone();
> }
> @Override
> public IndexProducer indices(Shape shape) {
> checkIndices(shape);
> return IndexProducer.fromIndexArray(indices);
> }
> @Override
> public IndexProducer uniqueIndices(Shape shape) {
> checkIndices(shape);
> int[] d = distinct;
> if (d == null) {
> distinct = d = Arrays.stream(indices).distinct().toArray();
> }
> return IndexProducer.fromIndexArray(d);
> }
> private void checkIndices(Shape shape) {
> // Check the hasher is OK for the shape
> int bits = shape.getNumberOfBits();
> for (int i : indices) {
> Assertions.assertTrue(i < bits);
> }
> }
> }
> {noformat}
> This hasher will allow filters to be constructed and manipulated. Or change it to use % bits and drop the assertion in checkIndices.
> The SimpleHasher should just be tested that it outputs random indices. I would suggest creating a few thousand randomly seeded hashers, outputting 5 indices from each and building a histogram of the counts of each index. This should be uniform when tested with a Chi-square test. This can be repeated for a few different shape sizes.
> Updating the filter in this way will still implement a Krisch and Mitzenmacher hasher where:
> {noformat}
> g_i(x) = (h1(x) + i h2(x)) mod m
> {noformat}
> The only difference is the subtraction rather than addition.
> The wikipedia article on [Double Hashing (variants)|https://en.wikipedia.org/wiki/Double_hashing#Variants] notes that this type of hasher when used in Bloom filters will result in a collision rate that is twice as likely as intended.
> An alternative is to provide enhanced double hashing which changes the increment each loop iteration. This would require some additional work on wrapping the modulus of the increment.
> It may be better to change the name to KMHasher and more explicitly state that the hash functions are:
>
> {noformat}
> g_i(x) = (h1(x) + i h2(x)) mod m; i in [1, k]
> {noformat}
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)