You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@commons.apache.org by GitBox <gi...@apache.org> on 2020/03/06 01:44:48 UTC

[GitHub] [commons-collections] aherbert opened a new pull request #137: WIP: CountingBloomFilter

aherbert opened a new pull request #137: WIP: CountingBloomFilter
URL: https://github.com/apache/commons-collections/pull/137
 
 
   This renames the existing CountingBloomFilter to MapCountingBloomFilter. This makes way for the CountingBloomFilter to be created as an interface to extend BloomFilter.
   
   An implementation using a backing array has been added: ArrayCountingBloomFilter.
   
   The now renamed MapCountingBloomFilter has not been updated. Either it is changed to implement the CountingBloomFilter interface or removed from the API.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [commons-collections] aherbert commented on a change in pull request #137: WIP: CountingBloomFilter

Posted by GitBox <gi...@apache.org>.

aherbert commented on a change in pull request #137: WIP: CountingBloomFilter
URL: https://github.com/apache/commons-collections/pull/137#discussion_r389322748
 
 

 ##########
 File path: src/main/java/org/apache/commons/collections4/bloomfilter/ArrayCountingBloomFilter.java
 ##########
 @@ -0,0 +1,396 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.commons.collections4.bloomfilter;
+
+import java.util.BitSet;
+import java.util.HashSet;
+import java.util.NoSuchElementException;
+import java.util.PrimitiveIterator;
+import java.util.PrimitiveIterator.OfInt;
+import java.util.function.Consumer;
+import java.util.function.IntConsumer;
+import java.util.Set;
+
+import org.apache.commons.collections4.bloomfilter.hasher.Hasher;
+import org.apache.commons.collections4.bloomfilter.hasher.Shape;
+import org.apache.commons.collections4.bloomfilter.hasher.StaticHasher;
+
+/**
+ * A counting Bloom filter using an array to track counts for each enabled bit
+ * index.
+ *
+ * <p>Any operation that results in negative counts or integer overflow of counts will
+ * mark this filter as invalid. This transition is not reversible. The counts for the
+ * filter immediately prior to the operation that create invalid counts can be recovered.
+ * See the documentation in {@link #isValid()} for details.
+ *
+ * <p>All the operations in the filter assume the counts are currently valid. Behaviour
+ * of an invalid filter is undefined. It will no longer function identically to a standard
+ * Bloom filter that is the merge of all the Bloom filters that have been added
+ * to and not later subtracted from the counting Bloom filter.
+ *
+ * <p>The maximum supported number of items that can be stored in the filter is
+ * limited by the maximum array size combined with the {@link Shape}. For
+ * example an implementation using a {@link Shape} with a false-positive
+ * probability of 1e-6 and {@link Integer#MAX_VALUE} bits can reversibly store
+ * approximately 75 million items using 20 hash functions per item with a memory
+ * consumption of approximately 8 GB.
+ *
+ * @since 4.5
+ * @see Shape
+ */
+public class ArrayCountingBloomFilter extends AbstractBloomFilter implements CountingBloomFilter {
+
+    /**
+     * The count of each bit index in the filter.
+     */
+    private final int[] counts;
+
+    /**
+     * The state flag. This is a bitwise OR of the entire history of all updated
+     * counts. If negative then a negative count or integer overflow has occurred on
+     * one or more counts in the history of the filter and the state is invalid.
+     *
+     * <p>Maintenance of this state flag is branch-free for improved performance. It
+     * eliminates a conditional check for a negative count during remove/subtract
+     * operations and a conditional check for integer overflow during merge/add
+     * operations.
+     *
+     * <p>Note: Integer overflow is unlikely in realistic usage scenarios. A count
+     * that overflows indicates that the number of items in the filter exceeds the
+     * maximum possible size (number of bits) of any Bloom filter constrained by
+     * integer indices. At this point the filter is most likely full (all bits are
+     * non-zero) and thus useless.
+     *
+     * <p>Negative counts are a concern if the filter is used incorrectly by
+     * removing an item that was never added. It is expected that a user of a
+     * counting Bloom filter will not perform this action as it is a mistake.
+     * Enabling an explicit recovery path for negative or overflow counts is a major
+     * performance burden not deemed necessary for the unlikely scenarios when an
+     * invalid state is created. Maintenance of the state flag is a concession to
+     * flag improper use that should not have a major performance impact.
+     */
+    private int state;
+
+    /**
+     * An iterator of all indexes with non-zero counts.
+     *
+     * <p>In the event that the filter state is invalid any index with a negative count
+     * will also be produced by the iterator.
+     */
+    private class IndexIterator implements PrimitiveIterator.OfInt {
+        /** The next non-zero index (or counts.length). */
+        private int next;
+
+        /**
+         * Create an instance.
+         */
+        IndexIterator() {
+            advance();
+        }
+
+        /**
+         * Advance to the next non-zero index.
+         */
+        void advance() {
+            while (next < counts.length && counts[next] == 0) {
+                next++;
+            }
+        }
+
+        @Override
+        public boolean hasNext() {
+            return next < counts.length;
+        }
+
+        @Override
+        public int nextInt() {
+            if (hasNext()) {
+                final int result = next++;
+                advance();
+                return result;
+            }
+            // Currently unreachable as the iterator is only used by
+            // the StaticHasher which iterates correctly.
+            throw new NoSuchElementException();
+        }
+    }
+
+    /**
+     * Constructs an empty counting Bloom filter with the specified shape.
+     *
+     * @param shape the shape of the filter
+     */
+    public ArrayCountingBloomFilter(final Shape shape) {
+        super(shape);
+        counts = new int[shape.getNumberOfBits()];
+    }
+
+    /**
+     * Constructs a counting Bloom filter from a hasher and a shape.
+     *
+     * <p>The filter will be equal to the result of merging the hasher with an empty
+     * filter; specifically duplicate indexes in the hasher are ignored.
+     *
+     * @param hasher the hasher to build the filter from
+     * @param shape the shape of the filter
+     * @throws IllegalArgumentException if the hasher cannot generate indices for
+     * the shape
+     * @see #merge(Hasher)
+     */
+    public ArrayCountingBloomFilter(final Hasher hasher, final Shape shape) {
+        super(shape);
+        // Given the filter is empty we can optimise the operation of merge(hasher)
+        verifyHasher(hasher);
+        // Delay array allocation until after hasher is verified
+        counts = new int[shape.getNumberOfBits()];
+        // All counts are zero. Ignore duplicates by initialising to 1
+        hasher.getBits(shape).forEachRemaining((IntConsumer) idx -> counts[idx] = 1);
+    }
+
+    @Override
+    public int cardinality() {
+        int size = 0;
+        for (final int c : counts) {
+            if (c != 0) {
+                size++;
+            }
+        }
+        return size;
+    }
+
+    @Override
+    public boolean contains(BloomFilter other) {
+        // The AbstractBloomFilter implementation converts both filters to long[] bits.
+        // This would involve checking all indexes in this filter against zero.
+        // Ideally we use an iterator of bit indexes to allow fail-fast on the
+        // first bit index that is zero.
+        if (other instanceof ArrayCountingBloomFilter) {
+            verifyShape(other);
+            return contains(((ArrayCountingBloomFilter) other).iterator());
+        }
+
+        // Note:
+        // This currently creates a StaticHasher which stores all the indexes.
+        // It would greatly benefit from direct generation of the index iterator
+        // avoiding the intermediate storage.
+        return contains(other.getHasher());
+    }
+
+    @Override
+    public boolean contains(final Hasher hasher) {
+        verifyHasher(hasher);
+        return contains(hasher.getBits(getShape()));
+    }
+
+    /**
+     * Return true if this filter is has non-zero counts for each index in the iterator.
+     *
+     * @param iter the iterator
+     * @return true if this filter contains all the indexes
+     */
+    private boolean contains(final OfInt iter) {
+        while (iter.hasNext()) {
+            if (counts[iter.nextInt()] == 0) {
+                return false;
+            }
+        }
+        return true;
+    }
+
+    @Override
+    public long[] getBits() {
+        final BitSet bs = new BitSet();
+        for (int i = 0; i < counts.length; i++) {
+            if (counts[i] != 0) {
+                bs.set(i);
+            }
+        }
+        return bs.toLongArray();
+    }
+
+    @Override
+    public StaticHasher getHasher() {
+        return new StaticHasher(iterator(), getShape());
+    }
+
+    /**
+     * Returns an iterator over the enabled indexes in this filter.
+     * Any index with a non-zero count is considered enabled.
+     * The iterator returns indexes in their natural order.
+     *
+     * @return an iterator over the enabled indexes
+     */
+    private PrimitiveIterator.OfInt iterator() {
+        return new IndexIterator();
+    }
+
+    @Override
+    public void merge(final BloomFilter other) {
+        applyAsBloomFilter(other, this::increment);
+    }
+
+    @Override
+    public void merge(final Hasher hasher) {
+        applyAsHasher(hasher, this::increment);
+    }
+
+    @Override
+    public boolean remove(BloomFilter other) {
+        applyAsBloomFilter(other, this::decrement);
+        return isValid();
+    }
+
+    @Override
+    public boolean remove(Hasher hasher) {
+        applyAsHasher(hasher, this::decrement);
+        return isValid();
+    }
+
+    @Override
+    public boolean add(CountingBloomFilter other) {
+        applyAsCountingBloomFilter(other, this::add);
+        return isValid();
+    }
+
+    @Override
+    public boolean subtract(CountingBloomFilter other) {
+        applyAsCountingBloomFilter(other, this::subtract);
+        return isValid();
+    }
+
+    /**
+     * {@inheritDoc}
+     *
+     * <p><em>Implementation note</em>
+     *
+     * <p>The state transition to invalid is permanent.
+     *
+     * <p>This implementation does not correct negative counts to zero or integer
+     * overflow counts to {@link Integer#MAX_VALUE}. Thus the operation that
+     * generated invalid counts can be reversed by using the complement of the
+     * original operation with the same Bloom filter. This will restore the counts
+     * to the state prior to the invalid operation. Counts can then be extracted
+     * using {@link #forEachCount(BitCountConsumer)}.
+     */
+    @Override
+    public boolean isValid() {
+        return state >= 0;
+    }
+
+    @Override
+    public void forEachCount(BitCountConsumer action) {
+        for (int i = 0; i < counts.length; i++) {
+            if (counts[i] != 0) {
+                action.accept(i, counts[i]);
+            }
+        }
+    }
+
+    /**
+     * Apply the action for each index in the Bloom filter.
+     */
+    private void applyAsBloomFilter(final BloomFilter other, final IntConsumer action) {
+        verifyShape(other);
+        if (other instanceof ArrayCountingBloomFilter) {
+            // Only use the presence of non-zero and not the counts
+            final int[] counts2 = ((ArrayCountingBloomFilter) other).counts;
+            for (int i = 0; i < counts2.length; i++) {
+                if (counts2[i] != 0) {
+                    action.accept(i);
+                }
+            }
+        } else {
+            BitSet.valueOf(other.getBits()).stream().forEach(action);
+        }
+    }
+
+    /**
+     * Apply the action for each index in the hasher.
+     */
+    private void applyAsHasher(final Hasher hasher, final IntConsumer action) {
+        verifyHasher(hasher);
+        toSet(hasher).forEach(i -> action.accept(i));
 
 Review comment:
   It would work. Note you would need a +1 on the size of the long[] and you have to write the long bit back to the array:
   
   ```java
   // Ensure array is big enough
   filter = new long[BloomFilterIndexer.getLongIndex(shape.getNumberOfBits()) + 1];
   
   // ...
   
   // Write back the bit index
   if ((target & mask) == 0) {
       filter[BloomFilterIndexer.getLongIndex(arg0)] |= mask;
       child.accept(arg0);
   }
   ```
   
   But you are allocating an array of length 1/64 of the number of bits in the filter. We expect a hasher to provide only k-indexes with k being small (e.g. <30).
   
   Certainly there is a potential optimisation to be had by avoiding the use of a Set with boxed integers.
   
   It would be more space efficient to just use an int[] to store up to k-indices. You check each new index using a for loop through those already observed for complexity O(n^2) as search complexity for each index is O(n). In comparison the search complexity of a TreeSet is O(log n) for O(n log n) overall. But with the boxing of the integers and the construction of the tree it would require some performance analysis on typical values of small k to determine which is best.
   
   For now I can mark it as a TODO in the code to investigate using a different data structure to purify the indexes when the expected count of indexes is small/medium/large. I think your class would work for large. Here we require one for small.
   
   Since this may be useful it could be put into a factory to create an appropriate filter for different sizes:
   ```java
   IntConsumer action = ...
   int k = ...
   // UniqueFilter would ensure the action is called with unique values for
   // an expected total of k invocations.
   IntConsumer filteredAction = UniqueFilter.forSize(k, action);
   ```
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [commons-collections] coveralls commented on issue #137: WIP: CountingBloomFilter

Posted by GitBox <gi...@apache.org>.

coveralls commented on issue #137: WIP: CountingBloomFilter
URL: https://github.com/apache/commons-collections/pull/137#issuecomment-595559382
 
 
   
   [![Coverage Status](https://coveralls.io/builds/29168070/badge)](https://coveralls.io/builds/29168070)
   
   Coverage increased (+0.03%) to 89.983% when pulling **4d41f3210c30568d5c86af470fcfee36bdf3a2ff on aherbert:counting-bloom** into **9831773447456466ae93b07e59ea76c8cce31787 on apache:master**.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [commons-collections] Claudenw commented on issue #137: WIP: CountingBloomFilter

Posted by GitBox <gi...@apache.org>.

Claudenw commented on issue #137: WIP: CountingBloomFilter
URL: https://github.com/apache/commons-collections/pull/137#issuecomment-596121410
 
 
   All in all this looks like a good change to me.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [commons-collections] aherbert commented on a change in pull request #137: WIP: CountingBloomFilter

Posted by GitBox <gi...@apache.org>.

aherbert commented on a change in pull request #137: WIP: CountingBloomFilter
URL: https://github.com/apache/commons-collections/pull/137#discussion_r389371416
 
 

 ##########
 File path: src/main/java/org/apache/commons/collections4/bloomfilter/ArrayCountingBloomFilter.java
 ##########
 @@ -0,0 +1,396 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.commons.collections4.bloomfilter;
+
+import java.util.BitSet;
+import java.util.HashSet;
+import java.util.NoSuchElementException;
+import java.util.PrimitiveIterator;
+import java.util.PrimitiveIterator.OfInt;
+import java.util.function.Consumer;
+import java.util.function.IntConsumer;
+import java.util.Set;
+
+import org.apache.commons.collections4.bloomfilter.hasher.Hasher;
+import org.apache.commons.collections4.bloomfilter.hasher.Shape;
+import org.apache.commons.collections4.bloomfilter.hasher.StaticHasher;
+
+/**
+ * A counting Bloom filter using an array to track counts for each enabled bit
+ * index.
+ *
+ * <p>Any operation that results in negative counts or integer overflow of counts will
+ * mark this filter as invalid. This transition is not reversible. The counts for the
+ * filter immediately prior to the operation that create invalid counts can be recovered.
+ * See the documentation in {@link #isValid()} for details.
+ *
+ * <p>All the operations in the filter assume the counts are currently valid. Behaviour
+ * of an invalid filter is undefined. It will no longer function identically to a standard
+ * Bloom filter that is the merge of all the Bloom filters that have been added
+ * to and not later subtracted from the counting Bloom filter.
+ *
+ * <p>The maximum supported number of items that can be stored in the filter is
+ * limited by the maximum array size combined with the {@link Shape}. For
+ * example an implementation using a {@link Shape} with a false-positive
+ * probability of 1e-6 and {@link Integer#MAX_VALUE} bits can reversibly store
+ * approximately 75 million items using 20 hash functions per item with a memory
+ * consumption of approximately 8 GB.
+ *
+ * @since 4.5
+ * @see Shape
+ */
+public class ArrayCountingBloomFilter extends AbstractBloomFilter implements CountingBloomFilter {
+
+    /**
+     * The count of each bit index in the filter.
+     */
+    private final int[] counts;
+
+    /**
+     * The state flag. This is a bitwise OR of the entire history of all updated
+     * counts. If negative then a negative count or integer overflow has occurred on
+     * one or more counts in the history of the filter and the state is invalid.
+     *
+     * <p>Maintenance of this state flag is branch-free for improved performance. It
+     * eliminates a conditional check for a negative count during remove/subtract
+     * operations and a conditional check for integer overflow during merge/add
+     * operations.
+     *
+     * <p>Note: Integer overflow is unlikely in realistic usage scenarios. A count
+     * that overflows indicates that the number of items in the filter exceeds the
+     * maximum possible size (number of bits) of any Bloom filter constrained by
+     * integer indices. At this point the filter is most likely full (all bits are
+     * non-zero) and thus useless.
+     *
+     * <p>Negative counts are a concern if the filter is used incorrectly by
+     * removing an item that was never added. It is expected that a user of a
+     * counting Bloom filter will not perform this action as it is a mistake.
+     * Enabling an explicit recovery path for negative or overflow counts is a major
+     * performance burden not deemed necessary for the unlikely scenarios when an
+     * invalid state is created. Maintenance of the state flag is a concession to
+     * flag improper use that should not have a major performance impact.
+     */
+    private int state;
+
+    /**
+     * An iterator of all indexes with non-zero counts.
+     *
+     * <p>In the event that the filter state is invalid any index with a negative count
+     * will also be produced by the iterator.
+     */
+    private class IndexIterator implements PrimitiveIterator.OfInt {
+        /** The next non-zero index (or counts.length). */
+        private int next;
+
+        /**
+         * Create an instance.
+         */
+        IndexIterator() {
+            advance();
+        }
+
+        /**
+         * Advance to the next non-zero index.
+         */
+        void advance() {
+            while (next < counts.length && counts[next] == 0) {
+                next++;
+            }
+        }
+
+        @Override
+        public boolean hasNext() {
+            return next < counts.length;
+        }
+
+        @Override
+        public int nextInt() {
+            if (hasNext()) {
+                final int result = next++;
+                advance();
+                return result;
+            }
+            // Currently unreachable as the iterator is only used by
+            // the StaticHasher which iterates correctly.
+            throw new NoSuchElementException();
+        }
+    }
+
+    /**
+     * Constructs an empty counting Bloom filter with the specified shape.
+     *
+     * @param shape the shape of the filter
+     */
+    public ArrayCountingBloomFilter(final Shape shape) {
+        super(shape);
+        counts = new int[shape.getNumberOfBits()];
+    }
+
+    /**
+     * Constructs a counting Bloom filter from a hasher and a shape.
+     *
+     * <p>The filter will be equal to the result of merging the hasher with an empty
+     * filter; specifically duplicate indexes in the hasher are ignored.
+     *
+     * @param hasher the hasher to build the filter from
+     * @param shape the shape of the filter
+     * @throws IllegalArgumentException if the hasher cannot generate indices for
+     * the shape
+     * @see #merge(Hasher)
+     */
+    public ArrayCountingBloomFilter(final Hasher hasher, final Shape shape) {
+        super(shape);
+        // Given the filter is empty we can optimise the operation of merge(hasher)
+        verifyHasher(hasher);
+        // Delay array allocation until after hasher is verified
+        counts = new int[shape.getNumberOfBits()];
+        // All counts are zero. Ignore duplicates by initialising to 1
+        hasher.getBits(shape).forEachRemaining((IntConsumer) idx -> counts[idx] = 1);
+    }
+
+    @Override
+    public int cardinality() {
+        int size = 0;
+        for (final int c : counts) {
+            if (c != 0) {
+                size++;
+            }
+        }
+        return size;
+    }
+
+    @Override
+    public boolean contains(BloomFilter other) {
+        // The AbstractBloomFilter implementation converts both filters to long[] bits.
+        // This would involve checking all indexes in this filter against zero.
+        // Ideally we use an iterator of bit indexes to allow fail-fast on the
+        // first bit index that is zero.
+        if (other instanceof ArrayCountingBloomFilter) {
+            verifyShape(other);
+            return contains(((ArrayCountingBloomFilter) other).iterator());
+        }
+
+        // Note:
+        // This currently creates a StaticHasher which stores all the indexes.
+        // It would greatly benefit from direct generation of the index iterator
+        // avoiding the intermediate storage.
+        return contains(other.getHasher());
+    }
+
+    @Override
+    public boolean contains(final Hasher hasher) {
+        verifyHasher(hasher);
+        return contains(hasher.getBits(getShape()));
+    }
+
+    /**
+     * Return true if this filter is has non-zero counts for each index in the iterator.
+     *
+     * @param iter the iterator
+     * @return true if this filter contains all the indexes
+     */
+    private boolean contains(final OfInt iter) {
+        while (iter.hasNext()) {
+            if (counts[iter.nextInt()] == 0) {
+                return false;
+            }
+        }
+        return true;
+    }
+
+    @Override
+    public long[] getBits() {
+        final BitSet bs = new BitSet();
+        for (int i = 0; i < counts.length; i++) {
+            if (counts[i] != 0) {
+                bs.set(i);
+            }
+        }
+        return bs.toLongArray();
+    }
+
+    @Override
+    public StaticHasher getHasher() {
+        return new StaticHasher(iterator(), getShape());
+    }
+
+    /**
+     * Returns an iterator over the enabled indexes in this filter.
+     * Any index with a non-zero count is considered enabled.
+     * The iterator returns indexes in their natural order.
+     *
+     * @return an iterator over the enabled indexes
+     */
+    private PrimitiveIterator.OfInt iterator() {
+        return new IndexIterator();
+    }
+
+    @Override
+    public void merge(final BloomFilter other) {
+        applyAsBloomFilter(other, this::increment);
+    }
+
+    @Override
+    public void merge(final Hasher hasher) {
+        applyAsHasher(hasher, this::increment);
+    }
+
+    @Override
+    public boolean remove(BloomFilter other) {
+        applyAsBloomFilter(other, this::decrement);
+        return isValid();
+    }
+
+    @Override
+    public boolean remove(Hasher hasher) {
+        applyAsHasher(hasher, this::decrement);
+        return isValid();
+    }
+
+    @Override
+    public boolean add(CountingBloomFilter other) {
+        applyAsCountingBloomFilter(other, this::add);
+        return isValid();
+    }
+
+    @Override
+    public boolean subtract(CountingBloomFilter other) {
+        applyAsCountingBloomFilter(other, this::subtract);
+        return isValid();
+    }
+
+    /**
+     * {@inheritDoc}
+     *
+     * <p><em>Implementation note</em>
+     *
+     * <p>The state transition to invalid is permanent.
+     *
+     * <p>This implementation does not correct negative counts to zero or integer
+     * overflow counts to {@link Integer#MAX_VALUE}. Thus the operation that
+     * generated invalid counts can be reversed by using the complement of the
+     * original operation with the same Bloom filter. This will restore the counts
+     * to the state prior to the invalid operation. Counts can then be extracted
+     * using {@link #forEachCount(BitCountConsumer)}.
+     */
+    @Override
+    public boolean isValid() {
+        return state >= 0;
+    }
+
+    @Override
+    public void forEachCount(BitCountConsumer action) {
+        for (int i = 0; i < counts.length; i++) {
+            if (counts[i] != 0) {
+                action.accept(i, counts[i]);
+            }
+        }
+    }
+
+    /**
+     * Apply the action for each index in the Bloom filter.
+     */
+    private void applyAsBloomFilter(final BloomFilter other, final IntConsumer action) {
+        verifyShape(other);
+        if (other instanceof ArrayCountingBloomFilter) {
+            // Only use the presence of non-zero and not the counts
+            final int[] counts2 = ((ArrayCountingBloomFilter) other).counts;
+            for (int i = 0; i < counts2.length; i++) {
+                if (counts2[i] != 0) {
+                    action.accept(i);
+                }
+            }
+        } else {
+            BitSet.valueOf(other.getBits()).stream().forEach(action);
+        }
+    }
+
+    /**
+     * Apply the action for each index in the hasher.
+     */
+    private void applyAsHasher(final Hasher hasher, final IntConsumer action) {
+        verifyHasher(hasher);
+        toSet(hasher).forEach(i -> action.accept(i));
 
 Review comment:
   OK. I've refactored the extraction of unique indexes from a hasher to a new helper class. This can be optimised later when the Hasher interface is updated to support providing the following information:
   
   - size (number of indexes)
   - whether or not the indexes are distinct
   
   This could be satisfied by changing `Hasher.getBits(Shape)` to create a spliterator and not an iterator.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [commons-collections] Claudenw commented on issue #137: WIP: CountingBloomFilter

Posted by GitBox <gi...@apache.org>.

Claudenw commented on issue #137: WIP: CountingBloomFilter
URL: https://github.com/apache/commons-collections/pull/137#issuecomment-599089499
 
 
   looks good to me.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [commons-collections] coveralls edited a comment on issue #137: WIP: CountingBloomFilter

Posted by GitBox <gi...@apache.org>.

coveralls edited a comment on issue #137: WIP: CountingBloomFilter
URL: https://github.com/apache/commons-collections/pull/137#issuecomment-595559382
 
 
   
   [![Coverage Status](https://coveralls.io/builds/29353830/badge)](https://coveralls.io/builds/29353830)
   
   Coverage increased (+0.003%) to 89.996% when pulling **22d161a25b54065a78d0f82dc530bf36091a0c23 on aherbert:counting-bloom** into **90f705e73270ef0c2c21bd14daf998bbe0d5ecc3 on apache:master**.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [commons-collections] coveralls edited a comment on issue #137: WIP: CountingBloomFilter

Posted by GitBox <gi...@apache.org>.

coveralls edited a comment on issue #137: WIP: CountingBloomFilter
URL: https://github.com/apache/commons-collections/pull/137#issuecomment-595559382
 
 
   
   [![Coverage Status](https://coveralls.io/builds/29205938/badge)](https://coveralls.io/builds/29205938)
   
   Coverage increased (+0.07%) to 90.017% when pulling **9eda4eedfa86762fcc70a8e5ced5d07d39329e8b on aherbert:counting-bloom** into **9831773447456466ae93b07e59ea76c8cce31787 on apache:master**.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [commons-collections] aherbert commented on a change in pull request #137: WIP: CountingBloomFilter

Posted by GitBox <gi...@apache.org>.

aherbert commented on a change in pull request #137: WIP: CountingBloomFilter
URL: https://github.com/apache/commons-collections/pull/137#discussion_r389363342
 
 

 ##########
 File path: src/main/java/org/apache/commons/collections4/bloomfilter/ArrayCountingBloomFilter.java
 ##########
 @@ -0,0 +1,396 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.commons.collections4.bloomfilter;
+
+import java.util.BitSet;
+import java.util.HashSet;
+import java.util.NoSuchElementException;
+import java.util.PrimitiveIterator;
+import java.util.PrimitiveIterator.OfInt;
+import java.util.function.Consumer;
+import java.util.function.IntConsumer;
+import java.util.Set;
+
+import org.apache.commons.collections4.bloomfilter.hasher.Hasher;
+import org.apache.commons.collections4.bloomfilter.hasher.Shape;
+import org.apache.commons.collections4.bloomfilter.hasher.StaticHasher;
+
+/**
+ * A counting Bloom filter using an array to track counts for each enabled bit
+ * index.
+ *
+ * <p>Any operation that results in negative counts or integer overflow of counts will
+ * mark this filter as invalid. This transition is not reversible. The counts for the
+ * filter immediately prior to the operation that create invalid counts can be recovered.
+ * See the documentation in {@link #isValid()} for details.
+ *
+ * <p>All the operations in the filter assume the counts are currently valid. Behaviour
+ * of an invalid filter is undefined. It will no longer function identically to a standard
+ * Bloom filter that is the merge of all the Bloom filters that have been added
+ * to and not later subtracted from the counting Bloom filter.
+ *
+ * <p>The maximum supported number of items that can be stored in the filter is
+ * limited by the maximum array size combined with the {@link Shape}. For
+ * example an implementation using a {@link Shape} with a false-positive
+ * probability of 1e-6 and {@link Integer#MAX_VALUE} bits can reversibly store
+ * approximately 75 million items using 20 hash functions per item with a memory
+ * consumption of approximately 8 GB.
+ *
+ * @since 4.5
+ * @see Shape
+ */
+public class ArrayCountingBloomFilter extends AbstractBloomFilter implements CountingBloomFilter {
+
+    /**
+     * The count of each bit index in the filter.
+     */
+    private final int[] counts;
+
+    /**
+     * The state flag. This is a bitwise OR of the entire history of all updated
+     * counts. If negative then a negative count or integer overflow has occurred on
+     * one or more counts in the history of the filter and the state is invalid.
+     *
+     * <p>Maintenance of this state flag is branch-free for improved performance. It
+     * eliminates a conditional check for a negative count during remove/subtract
+     * operations and a conditional check for integer overflow during merge/add
+     * operations.
+     *
+     * <p>Note: Integer overflow is unlikely in realistic usage scenarios. A count
+     * that overflows indicates that the number of items in the filter exceeds the
+     * maximum possible size (number of bits) of any Bloom filter constrained by
+     * integer indices. At this point the filter is most likely full (all bits are
+     * non-zero) and thus useless.
+     *
+     * <p>Negative counts are a concern if the filter is used incorrectly by
+     * removing an item that was never added. It is expected that a user of a
+     * counting Bloom filter will not perform this action as it is a mistake.
+     * Enabling an explicit recovery path for negative or overflow counts is a major
+     * performance burden not deemed necessary for the unlikely scenarios when an
+     * invalid state is created. Maintenance of the state flag is a concession to
+     * flag improper use that should not have a major performance impact.
+     */
+    private int state;
+
+    /**
+     * An iterator of all indexes with non-zero counts.
+     *
+     * <p>In the event that the filter state is invalid any index with a negative count
+     * will also be produced by the iterator.
+     */
+    private class IndexIterator implements PrimitiveIterator.OfInt {
+        /** The next non-zero index (or counts.length). */
+        private int next;
+
+        /**
+         * Create an instance.
+         */
+        IndexIterator() {
+            advance();
+        }
+
+        /**
+         * Advance to the next non-zero index.
+         */
+        void advance() {
+            while (next < counts.length && counts[next] == 0) {
+                next++;
+            }
+        }
+
+        @Override
+        public boolean hasNext() {
+            return next < counts.length;
+        }
+
+        @Override
+        public int nextInt() {
+            if (hasNext()) {
+                final int result = next++;
+                advance();
+                return result;
+            }
+            // Currently unreachable as the iterator is only used by
+            // the StaticHasher which iterates correctly.
+            throw new NoSuchElementException();
+        }
+    }
+
+    /**
+     * Constructs an empty counting Bloom filter with the specified shape.
+     *
+     * @param shape the shape of the filter
+     */
+    public ArrayCountingBloomFilter(final Shape shape) {
+        super(shape);
+        counts = new int[shape.getNumberOfBits()];
+    }
+
+    /**
+     * Constructs a counting Bloom filter from a hasher and a shape.
+     *
+     * <p>The filter will be equal to the result of merging the hasher with an empty
+     * filter; specifically duplicate indexes in the hasher are ignored.
+     *
+     * @param hasher the hasher to build the filter from
+     * @param shape the shape of the filter
+     * @throws IllegalArgumentException if the hasher cannot generate indices for
+     * the shape
+     * @see #merge(Hasher)
+     */
+    public ArrayCountingBloomFilter(final Hasher hasher, final Shape shape) {
+        super(shape);
+        // Given the filter is empty we can optimise the operation of merge(hasher)
+        verifyHasher(hasher);
+        // Delay array allocation until after hasher is verified
+        counts = new int[shape.getNumberOfBits()];
+        // All counts are zero. Ignore duplicates by initialising to 1
+        hasher.getBits(shape).forEachRemaining((IntConsumer) idx -> counts[idx] = 1);
+    }
+
+    @Override
+    public int cardinality() {
+        int size = 0;
+        for (final int c : counts) {
+            if (c != 0) {
+                size++;
+            }
+        }
+        return size;
+    }
+
+    @Override
+    public boolean contains(BloomFilter other) {
+        // The AbstractBloomFilter implementation converts both filters to long[] bits.
+        // This would involve checking all indexes in this filter against zero.
+        // Ideally we use an iterator of bit indexes to allow fail-fast on the
+        // first bit index that is zero.
+        if (other instanceof ArrayCountingBloomFilter) {
+            verifyShape(other);
+            return contains(((ArrayCountingBloomFilter) other).iterator());
+        }
+
+        // Note:
+        // This currently creates a StaticHasher which stores all the indexes.
+        // It would greatly benefit from direct generation of the index iterator
+        // avoiding the intermediate storage.
+        return contains(other.getHasher());
+    }
+
+    @Override
+    public boolean contains(final Hasher hasher) {
+        verifyHasher(hasher);
+        return contains(hasher.getBits(getShape()));
+    }
+
+    /**
+     * Return true if this filter is has non-zero counts for each index in the iterator.
+     *
+     * @param iter the iterator
+     * @return true if this filter contains all the indexes
+     */
+    private boolean contains(final OfInt iter) {
+        while (iter.hasNext()) {
+            if (counts[iter.nextInt()] == 0) {
+                return false;
+            }
+        }
+        return true;
+    }
+
+    @Override
+    public long[] getBits() {
+        final BitSet bs = new BitSet();
+        for (int i = 0; i < counts.length; i++) {
+            if (counts[i] != 0) {
+                bs.set(i);
+            }
+        }
+        return bs.toLongArray();
+    }
+
+    @Override
+    public StaticHasher getHasher() {
+        return new StaticHasher(iterator(), getShape());
+    }
+
+    /**
+     * Returns an iterator over the enabled indexes in this filter.
+     * Any index with a non-zero count is considered enabled.
+     * The iterator returns indexes in their natural order.
+     *
+     * @return an iterator over the enabled indexes
+     */
+    private PrimitiveIterator.OfInt iterator() {
+        return new IndexIterator();
+    }
+
+    @Override
+    public void merge(final BloomFilter other) {
+        applyAsBloomFilter(other, this::increment);
+    }
+
+    @Override
+    public void merge(final Hasher hasher) {
+        applyAsHasher(hasher, this::increment);
+    }
+
+    @Override
+    public boolean remove(BloomFilter other) {
+        applyAsBloomFilter(other, this::decrement);
+        return isValid();
+    }
+
+    @Override
+    public boolean remove(Hasher hasher) {
+        applyAsHasher(hasher, this::decrement);
+        return isValid();
+    }
+
+    @Override
+    public boolean add(CountingBloomFilter other) {
+        applyAsCountingBloomFilter(other, this::add);
+        return isValid();
+    }
+
+    @Override
+    public boolean subtract(CountingBloomFilter other) {
+        applyAsCountingBloomFilter(other, this::subtract);
+        return isValid();
+    }
+
+    /**
+     * {@inheritDoc}
+     *
+     * <p><em>Implementation note</em>
+     *
+     * <p>The state transition to invalid is permanent.
+     *
+     * <p>This implementation does not correct negative counts to zero or integer
+     * overflow counts to {@link Integer#MAX_VALUE}. Thus the operation that
+     * generated invalid counts can be reversed by using the complement of the
+     * original operation with the same Bloom filter. This will restore the counts
+     * to the state prior to the invalid operation. Counts can then be extracted
+     * using {@link #forEachCount(BitCountConsumer)}.
+     */
+    @Override
+    public boolean isValid() {
+        return state >= 0;
+    }
+
+    @Override
+    public void forEachCount(BitCountConsumer action) {
+        for (int i = 0; i < counts.length; i++) {
+            if (counts[i] != 0) {
+                action.accept(i, counts[i]);
+            }
+        }
+    }
+
+    /**
+     * Apply the action for each index in the Bloom filter.
+     */
+    private void applyAsBloomFilter(final BloomFilter other, final IntConsumer action) {
+        verifyShape(other);
+        if (other instanceof ArrayCountingBloomFilter) {
+            // Only use the presence of non-zero and not the counts
+            final int[] counts2 = ((ArrayCountingBloomFilter) other).counts;
+            for (int i = 0; i < counts2.length; i++) {
+                if (counts2[i] != 0) {
+                    action.accept(i);
+                }
+            }
+        } else {
+            BitSet.valueOf(other.getBits()).stream().forEach(action);
+        }
+    }
+
+    /**
+     * Apply the action for each index in the hasher.
+     */
+    private void applyAsHasher(final Hasher hasher, final IntConsumer action) {
+        verifyHasher(hasher);
+        toSet(hasher).forEach(i -> action.accept(i));
 
 Review comment:
   It would work but this is for a case when you are merging possibly duplicate indices for a Bloom filter with a known shape and filled to the expected number of items in the shape. So you are estimating for a different use case where the Hasher is representing a Bloom filter at capacity.
   
   For the use case here it is for the merge of a single item that has been hashed into the filter. So in this case IIUC the Hasher should return only `shape.getNumberOfHashFunctions()` in the iterator.
   
   However there is no mandate to actually do this and the Hasher could return a lot more indexes.
   
   If we require the filtering of indexes which may be duplicated in an optimal way from a Hasher it seems that the Hasher interface should have an estimateSize() function.
   
   There are 2 cases:
   - size is known exactly
   - size is an estimate
   
   This is dealt with in the Spliterator interface using the methods:
   
   ```java
   long estimateSize();
   long getExactSizeIfKnown();
   ```
   
   So we could work with this idea:
   
   1. We change the Hasher to return a spliterator and use the JDK framework. For example a StaticHasher can report that it is a known size, a DynamicHasher may not know its size.
   
   2. We add estimateSize and getExactSizeIfKnown to the Hasher interface
   
   I think option 1 is less clutter in the API. The interface can then have a default method to create an iterator from the spliterator if you want to process items one at a time and allow fast fail.
   
   Thus a BloomFilter.merge(Hasher) operation can obtain the spliterator, get the size and then create a filter to make the indexes unique with an optimal data structure. Or it can skip having to filter the indexes if the Hasher reports that its Spliterator is `Spliterator.DISTINCT`.
   
   Changing the Hasher API is for another PR. For now I will abstract out the UniqueFilter concept that accepts a Hasher and IntConsumer as arguments and ensures only unique values are passed to the consumer. This can be used when you want to merge with all the unique bit indexes in a Hasher.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [commons-collections] Claudenw commented on a change in pull request #137: WIP: CountingBloomFilter

Posted by GitBox <gi...@apache.org>.

Claudenw commented on a change in pull request #137: WIP: CountingBloomFilter
URL: https://github.com/apache/commons-collections/pull/137#discussion_r389364684
 
 

 ##########
 File path: src/main/java/org/apache/commons/collections4/bloomfilter/ArrayCountingBloomFilter.java
 ##########
 @@ -0,0 +1,396 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.commons.collections4.bloomfilter;
+
+import java.util.BitSet;
+import java.util.HashSet;
+import java.util.NoSuchElementException;
+import java.util.PrimitiveIterator;
+import java.util.PrimitiveIterator.OfInt;
+import java.util.function.Consumer;
+import java.util.function.IntConsumer;
+import java.util.Set;
+
+import org.apache.commons.collections4.bloomfilter.hasher.Hasher;
+import org.apache.commons.collections4.bloomfilter.hasher.Shape;
+import org.apache.commons.collections4.bloomfilter.hasher.StaticHasher;
+
+/**
+ * A counting Bloom filter using an array to track counts for each enabled bit
+ * index.
+ *
+ * <p>Any operation that results in negative counts or integer overflow of counts will
+ * mark this filter as invalid. This transition is not reversible. The counts for the
+ * filter immediately prior to the operation that create invalid counts can be recovered.
+ * See the documentation in {@link #isValid()} for details.
+ *
+ * <p>All the operations in the filter assume the counts are currently valid. Behaviour
+ * of an invalid filter is undefined. It will no longer function identically to a standard
+ * Bloom filter that is the merge of all the Bloom filters that have been added
+ * to and not later subtracted from the counting Bloom filter.
+ *
+ * <p>The maximum supported number of items that can be stored in the filter is
+ * limited by the maximum array size combined with the {@link Shape}. For
+ * example an implementation using a {@link Shape} with a false-positive
+ * probability of 1e-6 and {@link Integer#MAX_VALUE} bits can reversibly store
+ * approximately 75 million items using 20 hash functions per item with a memory
+ * consumption of approximately 8 GB.
+ *
+ * @since 4.5
+ * @see Shape
+ */
+public class ArrayCountingBloomFilter extends AbstractBloomFilter implements CountingBloomFilter {
+
+    /**
+     * The count of each bit index in the filter.
+     */
+    private final int[] counts;
+
+    /**
+     * The state flag. This is a bitwise OR of the entire history of all updated
+     * counts. If negative then a negative count or integer overflow has occurred on
+     * one or more counts in the history of the filter and the state is invalid.
+     *
+     * <p>Maintenance of this state flag is branch-free for improved performance. It
+     * eliminates a conditional check for a negative count during remove/subtract
+     * operations and a conditional check for integer overflow during merge/add
+     * operations.
+     *
+     * <p>Note: Integer overflow is unlikely in realistic usage scenarios. A count
+     * that overflows indicates that the number of items in the filter exceeds the
+     * maximum possible size (number of bits) of any Bloom filter constrained by
+     * integer indices. At this point the filter is most likely full (all bits are
+     * non-zero) and thus useless.
+     *
+     * <p>Negative counts are a concern if the filter is used incorrectly by
+     * removing an item that was never added. It is expected that a user of a
+     * counting Bloom filter will not perform this action as it is a mistake.
+     * Enabling an explicit recovery path for negative or overflow counts is a major
+     * performance burden not deemed necessary for the unlikely scenarios when an
+     * invalid state is created. Maintenance of the state flag is a concession to
+     * flag improper use that should not have a major performance impact.
+     */
+    private int state;
+
+    /**
+     * An iterator of all indexes with non-zero counts.
+     *
+     * <p>In the event that the filter state is invalid any index with a negative count
+     * will also be produced by the iterator.
+     */
+    private class IndexIterator implements PrimitiveIterator.OfInt {
+        /** The next non-zero index (or counts.length). */
+        private int next;
+
+        /**
+         * Create an instance.
+         */
+        IndexIterator() {
+            advance();
+        }
+
+        /**
+         * Advance to the next non-zero index.
+         */
+        void advance() {
+            while (next < counts.length && counts[next] == 0) {
+                next++;
+            }
+        }
+
+        @Override
+        public boolean hasNext() {
+            return next < counts.length;
+        }
+
+        @Override
+        public int nextInt() {
+            if (hasNext()) {
+                final int result = next++;
+                advance();
+                return result;
+            }
+            // Currently unreachable as the iterator is only used by
+            // the StaticHasher which iterates correctly.
+            throw new NoSuchElementException();
+        }
+    }
+
+    /**
+     * Constructs an empty counting Bloom filter with the specified shape.
+     *
+     * @param shape the shape of the filter
+     */
+    public ArrayCountingBloomFilter(final Shape shape) {
+        super(shape);
+        counts = new int[shape.getNumberOfBits()];
+    }
+
+    /**
+     * Constructs a counting Bloom filter from a hasher and a shape.
+     *
+     * <p>The filter will be equal to the result of merging the hasher with an empty
+     * filter; specifically duplicate indexes in the hasher are ignored.
+     *
+     * @param hasher the hasher to build the filter from
+     * @param shape the shape of the filter
+     * @throws IllegalArgumentException if the hasher cannot generate indices for
+     * the shape
+     * @see #merge(Hasher)
+     */
+    public ArrayCountingBloomFilter(final Hasher hasher, final Shape shape) {
+        super(shape);
+        // Given the filter is empty we can optimise the operation of merge(hasher)
+        verifyHasher(hasher);
+        // Delay array allocation until after hasher is verified
+        counts = new int[shape.getNumberOfBits()];
+        // All counts are zero. Ignore duplicates by initialising to 1
+        hasher.getBits(shape).forEachRemaining((IntConsumer) idx -> counts[idx] = 1);
+    }
+
+    @Override
+    public int cardinality() {
+        int size = 0;
+        for (final int c : counts) {
+            if (c != 0) {
+                size++;
+            }
+        }
+        return size;
+    }
+
+    @Override
+    public boolean contains(BloomFilter other) {
+        // The AbstractBloomFilter implementation converts both filters to long[] bits.
+        // This would involve checking all indexes in this filter against zero.
+        // Ideally we use an iterator of bit indexes to allow fail-fast on the
+        // first bit index that is zero.
+        if (other instanceof ArrayCountingBloomFilter) {
+            verifyShape(other);
+            return contains(((ArrayCountingBloomFilter) other).iterator());
+        }
+
+        // Note:
+        // This currently creates a StaticHasher which stores all the indexes.
+        // It would greatly benefit from direct generation of the index iterator
+        // avoiding the intermediate storage.
+        return contains(other.getHasher());
+    }
+
+    @Override
+    public boolean contains(final Hasher hasher) {
+        verifyHasher(hasher);
+        return contains(hasher.getBits(getShape()));
+    }
+
+    /**
+     * Return true if this filter is has non-zero counts for each index in the iterator.
+     *
+     * @param iter the iterator
+     * @return true if this filter contains all the indexes
+     */
+    private boolean contains(final OfInt iter) {
+        while (iter.hasNext()) {
+            if (counts[iter.nextInt()] == 0) {
+                return false;
+            }
+        }
+        return true;
+    }
+
+    @Override
+    public long[] getBits() {
+        final BitSet bs = new BitSet();
+        for (int i = 0; i < counts.length; i++) {
+            if (counts[i] != 0) {
+                bs.set(i);
+            }
+        }
+        return bs.toLongArray();
+    }
+
+    @Override
+    public StaticHasher getHasher() {
+        return new StaticHasher(iterator(), getShape());
+    }
+
+    /**
+     * Returns an iterator over the enabled indexes in this filter.
+     * Any index with a non-zero count is considered enabled.
+     * The iterator returns indexes in their natural order.
+     *
+     * @return an iterator over the enabled indexes
+     */
+    private PrimitiveIterator.OfInt iterator() {
+        return new IndexIterator();
+    }
+
+    @Override
+    public void merge(final BloomFilter other) {
+        applyAsBloomFilter(other, this::increment);
+    }
+
+    @Override
+    public void merge(final Hasher hasher) {
+        applyAsHasher(hasher, this::increment);
+    }
+
+    @Override
+    public boolean remove(BloomFilter other) {
+        applyAsBloomFilter(other, this::decrement);
+        return isValid();
+    }
+
+    @Override
+    public boolean remove(Hasher hasher) {
+        applyAsHasher(hasher, this::decrement);
+        return isValid();
+    }
+
+    @Override
+    public boolean add(CountingBloomFilter other) {
+        applyAsCountingBloomFilter(other, this::add);
+        return isValid();
+    }
+
+    @Override
+    public boolean subtract(CountingBloomFilter other) {
+        applyAsCountingBloomFilter(other, this::subtract);
+        return isValid();
+    }
+
+    /**
+     * {@inheritDoc}
+     *
+     * <p><em>Implementation note</em>
+     *
+     * <p>The state transition to invalid is permanent.
+     *
+     * <p>This implementation does not correct negative counts to zero or integer
+     * overflow counts to {@link Integer#MAX_VALUE}. Thus the operation that
+     * generated invalid counts can be reversed by using the complement of the
+     * original operation with the same Bloom filter. This will restore the counts
+     * to the state prior to the invalid operation. Counts can then be extracted
+     * using {@link #forEachCount(BitCountConsumer)}.
+     */
+    @Override
+    public boolean isValid() {
+        return state >= 0;
+    }
+
+    @Override
+    public void forEachCount(BitCountConsumer action) {
+        for (int i = 0; i < counts.length; i++) {
+            if (counts[i] != 0) {
+                action.accept(i, counts[i]);
+            }
+        }
+    }
+
+    /**
+     * Apply the action for each index in the Bloom filter.
+     */
+    private void applyAsBloomFilter(final BloomFilter other, final IntConsumer action) {
+        verifyShape(other);
+        if (other instanceof ArrayCountingBloomFilter) {
+            // Only use the presence of non-zero and not the counts
+            final int[] counts2 = ((ArrayCountingBloomFilter) other).counts;
+            for (int i = 0; i < counts2.length; i++) {
+                if (counts2[i] != 0) {
+                    action.accept(i);
+                }
+            }
+        } else {
+            BitSet.valueOf(other.getBits()).stream().forEach(action);
+        }
+    }
+
+    /**
+     * Apply the action for each index in the hasher.
+     */
+    private void applyAsHasher(final Hasher hasher, final IntConsumer action) {
+        verifyHasher(hasher);
+        toSet(hasher).forEach(i -> action.accept(i));
 
 Review comment:
   The Hasher may return more than `shape.getNumberOfHashFunctions()` as a Hasher may contain more than one item.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [commons-collections] Claudenw commented on a change in pull request #137: WIP: CountingBloomFilter

Posted by GitBox <gi...@apache.org>.

Claudenw commented on a change in pull request #137: WIP: CountingBloomFilter
URL: https://github.com/apache/commons-collections/pull/137#discussion_r389358339
 
 

 ##########
 File path: src/main/java/org/apache/commons/collections4/bloomfilter/ArrayCountingBloomFilter.java
 ##########
 @@ -0,0 +1,396 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.commons.collections4.bloomfilter;
+
+import java.util.BitSet;
+import java.util.HashSet;
+import java.util.NoSuchElementException;
+import java.util.PrimitiveIterator;
+import java.util.PrimitiveIterator.OfInt;
+import java.util.function.Consumer;
+import java.util.function.IntConsumer;
+import java.util.Set;
+
+import org.apache.commons.collections4.bloomfilter.hasher.Hasher;
+import org.apache.commons.collections4.bloomfilter.hasher.Shape;
+import org.apache.commons.collections4.bloomfilter.hasher.StaticHasher;
+
+/**
+ * A counting Bloom filter using an array to track counts for each enabled bit
+ * index.
+ *
+ * <p>Any operation that results in negative counts or integer overflow of counts will
+ * mark this filter as invalid. This transition is not reversible. The counts for the
+ * filter immediately prior to the operation that create invalid counts can be recovered.
+ * See the documentation in {@link #isValid()} for details.
+ *
+ * <p>All the operations in the filter assume the counts are currently valid. Behaviour
+ * of an invalid filter is undefined. It will no longer function identically to a standard
+ * Bloom filter that is the merge of all the Bloom filters that have been added
+ * to and not later subtracted from the counting Bloom filter.
+ *
+ * <p>The maximum supported number of items that can be stored in the filter is
+ * limited by the maximum array size combined with the {@link Shape}. For
+ * example an implementation using a {@link Shape} with a false-positive
+ * probability of 1e-6 and {@link Integer#MAX_VALUE} bits can reversibly store
+ * approximately 75 million items using 20 hash functions per item with a memory
+ * consumption of approximately 8 GB.
+ *
+ * @since 4.5
+ * @see Shape
+ */
+public class ArrayCountingBloomFilter extends AbstractBloomFilter implements CountingBloomFilter {
+
+    /**
+     * The count of each bit index in the filter.
+     */
+    private final int[] counts;
+
+    /**
+     * The state flag. This is a bitwise OR of the entire history of all updated
+     * counts. If negative then a negative count or integer overflow has occurred on
+     * one or more counts in the history of the filter and the state is invalid.
+     *
+     * <p>Maintenance of this state flag is branch-free for improved performance. It
+     * eliminates a conditional check for a negative count during remove/subtract
+     * operations and a conditional check for integer overflow during merge/add
+     * operations.
+     *
+     * <p>Note: Integer overflow is unlikely in realistic usage scenarios. A count
+     * that overflows indicates that the number of items in the filter exceeds the
+     * maximum possible size (number of bits) of any Bloom filter constrained by
+     * integer indices. At this point the filter is most likely full (all bits are
+     * non-zero) and thus useless.
+     *
+     * <p>Negative counts are a concern if the filter is used incorrectly by
+     * removing an item that was never added. It is expected that a user of a
+     * counting Bloom filter will not perform this action as it is a mistake.
+     * Enabling an explicit recovery path for negative or overflow counts is a major
+     * performance burden not deemed necessary for the unlikely scenarios when an
+     * invalid state is created. Maintenance of the state flag is a concession to
+     * flag improper use that should not have a major performance impact.
+     */
+    private int state;
+
+    /**
+     * An iterator of all indexes with non-zero counts.
+     *
+     * <p>In the event that the filter state is invalid any index with a negative count
+     * will also be produced by the iterator.
+     */
+    private class IndexIterator implements PrimitiveIterator.OfInt {
+        /** The next non-zero index (or counts.length). */
+        private int next;
+
+        /**
+         * Create an instance.
+         */
+        IndexIterator() {
+            advance();
+        }
+
+        /**
+         * Advance to the next non-zero index.
+         */
+        void advance() {
+            while (next < counts.length && counts[next] == 0) {
+                next++;
+            }
+        }
+
+        @Override
+        public boolean hasNext() {
+            return next < counts.length;
+        }
+
+        @Override
+        public int nextInt() {
+            if (hasNext()) {
+                final int result = next++;
+                advance();
+                return result;
+            }
+            // Currently unreachable as the iterator is only used by
+            // the StaticHasher which iterates correctly.
+            throw new NoSuchElementException();
+        }
+    }
+
+    /**
+     * Constructs an empty counting Bloom filter with the specified shape.
+     *
+     * @param shape the shape of the filter
+     */
+    public ArrayCountingBloomFilter(final Shape shape) {
+        super(shape);
+        counts = new int[shape.getNumberOfBits()];
+    }
+
+    /**
+     * Constructs a counting Bloom filter from a hasher and a shape.
+     *
+     * <p>The filter will be equal to the result of merging the hasher with an empty
+     * filter; specifically duplicate indexes in the hasher are ignored.
+     *
+     * @param hasher the hasher to build the filter from
+     * @param shape the shape of the filter
+     * @throws IllegalArgumentException if the hasher cannot generate indices for
+     * the shape
+     * @see #merge(Hasher)
+     */
+    public ArrayCountingBloomFilter(final Hasher hasher, final Shape shape) {
+        super(shape);
+        // Given the filter is empty we can optimise the operation of merge(hasher)
+        verifyHasher(hasher);
+        // Delay array allocation until after hasher is verified
+        counts = new int[shape.getNumberOfBits()];
+        // All counts are zero. Ignore duplicates by initialising to 1
+        hasher.getBits(shape).forEachRemaining((IntConsumer) idx -> counts[idx] = 1);
+    }
+
+    @Override
+    public int cardinality() {
+        int size = 0;
+        for (final int c : counts) {
+            if (c != 0) {
+                size++;
+            }
+        }
+        return size;
+    }
+
+    @Override
+    public boolean contains(BloomFilter other) {
+        // The AbstractBloomFilter implementation converts both filters to long[] bits.
+        // This would involve checking all indexes in this filter against zero.
+        // Ideally we use an iterator of bit indexes to allow fail-fast on the
+        // first bit index that is zero.
+        if (other instanceof ArrayCountingBloomFilter) {
+            verifyShape(other);
+            return contains(((ArrayCountingBloomFilter) other).iterator());
+        }
+
+        // Note:
+        // This currently creates a StaticHasher which stores all the indexes.
+        // It would greatly benefit from direct generation of the index iterator
+        // avoiding the intermediate storage.
+        return contains(other.getHasher());
+    }
+
+    @Override
+    public boolean contains(final Hasher hasher) {
+        verifyHasher(hasher);
+        return contains(hasher.getBits(getShape()));
+    }
+
+    /**
+     * Return true if this filter is has non-zero counts for each index in the iterator.
+     *
+     * @param iter the iterator
+     * @return true if this filter contains all the indexes
+     */
+    private boolean contains(final OfInt iter) {
+        while (iter.hasNext()) {
+            if (counts[iter.nextInt()] == 0) {
+                return false;
+            }
+        }
+        return true;
+    }
+
+    @Override
+    public long[] getBits() {
+        final BitSet bs = new BitSet();
+        for (int i = 0; i < counts.length; i++) {
+            if (counts[i] != 0) {
+                bs.set(i);
+            }
+        }
+        return bs.toLongArray();
+    }
+
+    @Override
+    public StaticHasher getHasher() {
+        return new StaticHasher(iterator(), getShape());
+    }
+
+    /**
+     * Returns an iterator over the enabled indexes in this filter.
+     * Any index with a non-zero count is considered enabled.
+     * The iterator returns indexes in their natural order.
+     *
+     * @return an iterator over the enabled indexes
+     */
+    private PrimitiveIterator.OfInt iterator() {
+        return new IndexIterator();
+    }
+
+    @Override
+    public void merge(final BloomFilter other) {
+        applyAsBloomFilter(other, this::increment);
+    }
+
+    @Override
+    public void merge(final Hasher hasher) {
+        applyAsHasher(hasher, this::increment);
+    }
+
+    @Override
+    public boolean remove(BloomFilter other) {
+        applyAsBloomFilter(other, this::decrement);
+        return isValid();
+    }
+
+    @Override
+    public boolean remove(Hasher hasher) {
+        applyAsHasher(hasher, this::decrement);
+        return isValid();
+    }
+
+    @Override
+    public boolean add(CountingBloomFilter other) {
+        applyAsCountingBloomFilter(other, this::add);
+        return isValid();
+    }
+
+    @Override
+    public boolean subtract(CountingBloomFilter other) {
+        applyAsCountingBloomFilter(other, this::subtract);
+        return isValid();
+    }
+
+    /**
+     * {@inheritDoc}
+     *
+     * <p><em>Implementation note</em>
+     *
+     * <p>The state transition to invalid is permanent.
+     *
+     * <p>This implementation does not correct negative counts to zero or integer
+     * overflow counts to {@link Integer#MAX_VALUE}. Thus the operation that
+     * generated invalid counts can be reversed by using the complement of the
+     * original operation with the same Bloom filter. This will restore the counts
+     * to the state prior to the invalid operation. Counts can then be extracted
+     * using {@link #forEachCount(BitCountConsumer)}.
+     */
+    @Override
+    public boolean isValid() {
+        return state >= 0;
+    }
+
+    @Override
+    public void forEachCount(BitCountConsumer action) {
+        for (int i = 0; i < counts.length; i++) {
+            if (counts[i] != 0) {
+                action.accept(i, counts[i]);
+            }
+        }
+    }
+
+    /**
+     * Apply the action for each index in the Bloom filter.
+     */
+    private void applyAsBloomFilter(final BloomFilter other, final IntConsumer action) {
+        verifyShape(other);
+        if (other instanceof ArrayCountingBloomFilter) {
+            // Only use the presence of non-zero and not the counts
+            final int[] counts2 = ((ArrayCountingBloomFilter) other).counts;
+            for (int i = 0; i < counts2.length; i++) {
+                if (counts2[i] != 0) {
+                    action.accept(i);
+                }
+            }
+        } else {
+            BitSet.valueOf(other.getBits()).stream().forEach(action);
+        }
+    }
+
+    /**
+     * Apply the action for each index in the hasher.
+     */
+    private void applyAsHasher(final Hasher hasher, final IntConsumer action) {
+        verifyHasher(hasher);
+        toSet(hasher).forEach(i -> action.accept(i));
 
 Review comment:
   How about this as an approach:
   
   ```java
   import java.util.HashSet;
   import java.util.Set;
   import java.util.function.IntConsumer;
   import java.util.function.IntPredicate;
   
   import org.apache.commons.collections4.bloomfilter.hasher.Shape;
   
   public class UniqueFilter implements IntConsumer {
       final IntConsumer child;
       final IntPredicate predicate;
   
       /**
        * Constructs an IntConsumer that only passes unique values. 
        * @param child the IntConsumer to receive the values.
        * @param shape the shape of the bloom filter assocaited with the values.
        */
       public UniqueFilter( IntConsumer child, Shape shape) {
           this.child=child;
           /* 
            * The type of filter is determined by the median number of bits
            * expected as a proportion of the number of bits.  The assumption
            * is that lower saturations will function more efficiently using 
            * standard sets rather an a collection of bits for each possible value. 
            */
           double median = calculateMedianHamming( shape );
           if ( (median / shape.getNumberOfBits()) > 0.25)
           {
               predicate = new LongFilter( shape );
           } else {
               predicate = new SetFilter( shape );
           }
       }
       
       private double calculateMedianHamming(Shape shape) {
           int k = shape.getNumberOfHashFunctions();
           int n = shape.getNumberOfItems();
           // need a double here for the calculations to work.
           double m = shape.getNumberOfBits();
           
           int kn = k*n;
           
           return kn - m + m* Math.pow((m-1/m),kn);
           
       }
       
       private double estimateMedianHamming(Shape shape) {
           int kn = shape.getNumberOfHashFunctions()*shape.getNumberOfItems();
           int limit = Integer.min(kn, shape.getNumberOfBits());
           int range = limit - shape.getNumberOfHashFunctions();
           int estimate = shape.getNumberOfHashFunctions() + (range/2);
           return estimate;
       }
       
       @Override
       public void accept(int arg0) {
           if (predicate.test( arg0 ))
           {
               child.accept( arg0 );
           }
       }
       
       
       public static class LongFilter implements IntPredicate {
           final long[] filter;
   
           LongFilter(Shape shape ) {
               filter = new long[ BloomFilterIndexer.getLongIndex(shape.getNumberOfBits())+1 ];
           }
           
           @Override
           public boolean test(int arg0) {
               int idx = BloomFilterIndexer.getLongIndex(arg0);
               long mask = BloomFilterIndexer.getLongBit(arg0);
               if ( (filter[ idx ] & mask) == 0)
               {
                   filter[ idx  ] |= mask;
                   return true;
               }
               return false;
               
           }
       }
       
       public static class SetFilter implements IntPredicate {
           final Set<Integer> filter;
   
           SetFilter( Shape shape ) {
               filter = new HashSet<Integer>();
           }
           
           @Override
           public boolean test(int arg0) {
               return filter.add(arg0);
           }
       }
   }
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [commons-collections] asfgit merged pull request #137: WIP: CountingBloomFilter

Posted by GitBox <gi...@apache.org>.

asfgit merged pull request #137: WIP: CountingBloomFilter
URL: https://github.com/apache/commons-collections/pull/137
 
 
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [commons-collections] aherbert commented on a change in pull request #137: WIP: CountingBloomFilter

Posted by GitBox <gi...@apache.org>.

aherbert commented on a change in pull request #137: WIP: CountingBloomFilter
URL: https://github.com/apache/commons-collections/pull/137#discussion_r389319599
 
 

 ##########
 File path: src/main/java/org/apache/commons/collections4/bloomfilter/ArrayCountingBloomFilter.java
 ##########
 @@ -0,0 +1,396 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.commons.collections4.bloomfilter;
+
+import java.util.BitSet;
+import java.util.HashSet;
+import java.util.NoSuchElementException;
+import java.util.PrimitiveIterator;
+import java.util.PrimitiveIterator.OfInt;
+import java.util.function.Consumer;
+import java.util.function.IntConsumer;
+import java.util.Set;
+
+import org.apache.commons.collections4.bloomfilter.hasher.Hasher;
+import org.apache.commons.collections4.bloomfilter.hasher.Shape;
+import org.apache.commons.collections4.bloomfilter.hasher.StaticHasher;
+
+/**
+ * A counting Bloom filter using an array to track counts for each enabled bit
+ * index.
+ *
+ * <p>Any operation that results in negative counts or integer overflow of counts will
+ * mark this filter as invalid. This transition is not reversible. The counts for the
+ * filter immediately prior to the operation that create invalid counts can be recovered.
+ * See the documentation in {@link #isValid()} for details.
+ *
+ * <p>All the operations in the filter assume the counts are currently valid. Behaviour
+ * of an invalid filter is undefined. It will no longer function identically to a standard
+ * Bloom filter that is the merge of all the Bloom filters that have been added
+ * to and not later subtracted from the counting Bloom filter.
+ *
+ * <p>The maximum supported number of items that can be stored in the filter is
+ * limited by the maximum array size combined with the {@link Shape}. For
+ * example an implementation using a {@link Shape} with a false-positive
+ * probability of 1e-6 and {@link Integer#MAX_VALUE} bits can reversibly store
+ * approximately 75 million items using 20 hash functions per item with a memory
+ * consumption of approximately 8 GB.
+ *
+ * @since 4.5
+ * @see Shape
+ */
+public class ArrayCountingBloomFilter extends AbstractBloomFilter implements CountingBloomFilter {
+
+    /**
+     * The count of each bit index in the filter.
+     */
+    private final int[] counts;
+
+    /**
+     * The state flag. This is a bitwise OR of the entire history of all updated
+     * counts. If negative then a negative count or integer overflow has occurred on
+     * one or more counts in the history of the filter and the state is invalid.
+     *
+     * <p>Maintenance of this state flag is branch-free for improved performance. It
+     * eliminates a conditional check for a negative count during remove/subtract
+     * operations and a conditional check for integer overflow during merge/add
+     * operations.
+     *
+     * <p>Note: Integer overflow is unlikely in realistic usage scenarios. A count
+     * that overflows indicates that the number of items in the filter exceeds the
+     * maximum possible size (number of bits) of any Bloom filter constrained by
+     * integer indices. At this point the filter is most likely full (all bits are
+     * non-zero) and thus useless.
+     *
+     * <p>Negative counts are a concern if the filter is used incorrectly by
+     * removing an item that was never added. It is expected that a user of a
+     * counting Bloom filter will not perform this action as it is a mistake.
+     * Enabling an explicit recovery path for negative or overflow counts is a major
+     * performance burden not deemed necessary for the unlikely scenarios when an
+     * invalid state is created. Maintenance of the state flag is a concession to
+     * flag improper use that should not have a major performance impact.
+     */
+    private int state;
+
+    /**
+     * An iterator of all indexes with non-zero counts.
+     *
+     * <p>In the event that the filter state is invalid any index with a negative count
+     * will also be produced by the iterator.
+     */
+    private class IndexIterator implements PrimitiveIterator.OfInt {
+        /** The next non-zero index (or counts.length). */
+        private int next;
+
+        /**
+         * Create an instance.
+         */
+        IndexIterator() {
+            advance();
+        }
+
+        /**
+         * Advance to the next non-zero index.
+         */
+        void advance() {
+            while (next < counts.length && counts[next] == 0) {
+                next++;
+            }
+        }
+
+        @Override
+        public boolean hasNext() {
+            return next < counts.length;
+        }
+
+        @Override
+        public int nextInt() {
+            if (hasNext()) {
+                final int result = next++;
+                advance();
+                return result;
+            }
+            // Currently unreachable as the iterator is only used by
+            // the StaticHasher which iterates correctly.
+            throw new NoSuchElementException();
+        }
+    }
+
+    /**
+     * Constructs an empty counting Bloom filter with the specified shape.
+     *
+     * @param shape the shape of the filter
+     */
+    public ArrayCountingBloomFilter(final Shape shape) {
+        super(shape);
+        counts = new int[shape.getNumberOfBits()];
+    }
+
+    /**
+     * Constructs a counting Bloom filter from a hasher and a shape.
+     *
+     * <p>The filter will be equal to the result of merging the hasher with an empty
+     * filter; specifically duplicate indexes in the hasher are ignored.
+     *
+     * @param hasher the hasher to build the filter from
+     * @param shape the shape of the filter
+     * @throws IllegalArgumentException if the hasher cannot generate indices for
+     * the shape
+     * @see #merge(Hasher)
+     */
+    public ArrayCountingBloomFilter(final Hasher hasher, final Shape shape) {
+        super(shape);
+        // Given the filter is empty we can optimise the operation of merge(hasher)
+        verifyHasher(hasher);
+        // Delay array allocation until after hasher is verified
+        counts = new int[shape.getNumberOfBits()];
+        // All counts are zero. Ignore duplicates by initialising to 1
+        hasher.getBits(shape).forEachRemaining((IntConsumer) idx -> counts[idx] = 1);
+    }
+
+    @Override
+    public int cardinality() {
+        int size = 0;
+        for (final int c : counts) {
+            if (c != 0) {
+                size++;
+            }
+        }
+        return size;
+    }
+
+    @Override
+    public boolean contains(BloomFilter other) {
+        // The AbstractBloomFilter implementation converts both filters to long[] bits.
+        // This would involve checking all indexes in this filter against zero.
+        // Ideally we use an iterator of bit indexes to allow fail-fast on the
+        // first bit index that is zero.
+        if (other instanceof ArrayCountingBloomFilter) {
+            verifyShape(other);
+            return contains(((ArrayCountingBloomFilter) other).iterator());
+        }
+
+        // Note:
+        // This currently creates a StaticHasher which stores all the indexes.
+        // It would greatly benefit from direct generation of the index iterator
+        // avoiding the intermediate storage.
+        return contains(other.getHasher());
 
 Review comment:
   This comment is a prelude to the discussion from the mailing list where the getHasher() method is dropped from the BloomFilter in favour of an iterator of the indices. 
   
   Your example requires a function to convert a BloomFilter to an iterator. Who provides that method? Given the data is within the BloomFilter encapsulation requires it to be the BloomFilter to provide the method. So let's just go with the plan to have a BloomFilter provide an iterator. This would replace the getHasher method and its requirement to use a concrete StaticHasher class as the return type and the inefficiency of doing so for a close to capacity filter (the StaticHasher constructor currently passes an iterator through a Set to make indexes unique). 
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [commons-collections] Claudenw commented on a change in pull request #137: WIP: CountingBloomFilter

Posted by GitBox <gi...@apache.org>.

Claudenw commented on a change in pull request #137: WIP: CountingBloomFilter
URL: https://github.com/apache/commons-collections/pull/137#discussion_r389299774
 
 

 ##########
 File path: src/main/java/org/apache/commons/collections4/bloomfilter/ArrayCountingBloomFilter.java
 ##########
 @@ -0,0 +1,396 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.commons.collections4.bloomfilter;
+
+import java.util.BitSet;
+import java.util.HashSet;
+import java.util.NoSuchElementException;
+import java.util.PrimitiveIterator;
+import java.util.PrimitiveIterator.OfInt;
+import java.util.function.Consumer;
+import java.util.function.IntConsumer;
+import java.util.Set;
+
+import org.apache.commons.collections4.bloomfilter.hasher.Hasher;
+import org.apache.commons.collections4.bloomfilter.hasher.Shape;
+import org.apache.commons.collections4.bloomfilter.hasher.StaticHasher;
+
+/**
+ * A counting Bloom filter using an array to track counts for each enabled bit
+ * index.
+ *
+ * <p>Any operation that results in negative counts or integer overflow of counts will
+ * mark this filter as invalid. This transition is not reversible. The counts for the
+ * filter immediately prior to the operation that create invalid counts can be recovered.
+ * See the documentation in {@link #isValid()} for details.
+ *
+ * <p>All the operations in the filter assume the counts are currently valid. Behaviour
+ * of an invalid filter is undefined. It will no longer function identically to a standard
+ * Bloom filter that is the merge of all the Bloom filters that have been added
+ * to and not later subtracted from the counting Bloom filter.
+ *
+ * <p>The maximum supported number of items that can be stored in the filter is
+ * limited by the maximum array size combined with the {@link Shape}. For
+ * example an implementation using a {@link Shape} with a false-positive
+ * probability of 1e-6 and {@link Integer#MAX_VALUE} bits can reversibly store
+ * approximately 75 million items using 20 hash functions per item with a memory
+ * consumption of approximately 8 GB.
+ *
+ * @since 4.5
+ * @see Shape
+ */
+public class ArrayCountingBloomFilter extends AbstractBloomFilter implements CountingBloomFilter {
+
+    /**
+     * The count of each bit index in the filter.
+     */
+    private final int[] counts;
+
+    /**
+     * The state flag. This is a bitwise OR of the entire history of all updated
+     * counts. If negative then a negative count or integer overflow has occurred on
+     * one or more counts in the history of the filter and the state is invalid.
+     *
+     * <p>Maintenance of this state flag is branch-free for improved performance. It
+     * eliminates a conditional check for a negative count during remove/subtract
+     * operations and a conditional check for integer overflow during merge/add
+     * operations.
+     *
+     * <p>Note: Integer overflow is unlikely in realistic usage scenarios. A count
+     * that overflows indicates that the number of items in the filter exceeds the
+     * maximum possible size (number of bits) of any Bloom filter constrained by
+     * integer indices. At this point the filter is most likely full (all bits are
+     * non-zero) and thus useless.
+     *
+     * <p>Negative counts are a concern if the filter is used incorrectly by
+     * removing an item that was never added. It is expected that a user of a
+     * counting Bloom filter will not perform this action as it is a mistake.
+     * Enabling an explicit recovery path for negative or overflow counts is a major
+     * performance burden not deemed necessary for the unlikely scenarios when an
+     * invalid state is created. Maintenance of the state flag is a concession to
+     * flag improper use that should not have a major performance impact.
+     */
+    private int state;
+
+    /**
+     * An iterator of all indexes with non-zero counts.
+     *
+     * <p>In the event that the filter state is invalid any index with a negative count
+     * will also be produced by the iterator.
+     */
+    private class IndexIterator implements PrimitiveIterator.OfInt {
+        /** The next non-zero index (or counts.length). */
+        private int next;
+
+        /**
+         * Create an instance.
+         */
+        IndexIterator() {
+            advance();
+        }
+
+        /**
+         * Advance to the next non-zero index.
+         */
+        void advance() {
+            while (next < counts.length && counts[next] == 0) {
+                next++;
+            }
+        }
+
+        @Override
+        public boolean hasNext() {
+            return next < counts.length;
+        }
+
+        @Override
+        public int nextInt() {
+            if (hasNext()) {
+                final int result = next++;
+                advance();
+                return result;
+            }
+            // Currently unreachable as the iterator is only used by
+            // the StaticHasher which iterates correctly.
+            throw new NoSuchElementException();
+        }
+    }
+
+    /**
+     * Constructs an empty counting Bloom filter with the specified shape.
+     *
+     * @param shape the shape of the filter
+     */
+    public ArrayCountingBloomFilter(final Shape shape) {
+        super(shape);
+        counts = new int[shape.getNumberOfBits()];
+    }
+
+    /**
+     * Constructs a counting Bloom filter from a hasher and a shape.
+     *
+     * <p>The filter will be equal to the result of merging the hasher with an empty
+     * filter; specifically duplicate indexes in the hasher are ignored.
+     *
+     * @param hasher the hasher to build the filter from
+     * @param shape the shape of the filter
+     * @throws IllegalArgumentException if the hasher cannot generate indices for
+     * the shape
+     * @see #merge(Hasher)
+     */
+    public ArrayCountingBloomFilter(final Hasher hasher, final Shape shape) {
+        super(shape);
+        // Given the filter is empty we can optimise the operation of merge(hasher)
+        verifyHasher(hasher);
+        // Delay array allocation until after hasher is verified
+        counts = new int[shape.getNumberOfBits()];
+        // All counts are zero. Ignore duplicates by initialising to 1
+        hasher.getBits(shape).forEachRemaining((IntConsumer) idx -> counts[idx] = 1);
+    }
+
+    @Override
+    public int cardinality() {
+        int size = 0;
+        for (final int c : counts) {
+            if (c != 0) {
+                size++;
+            }
+        }
+        return size;
+    }
+
+    @Override
+    public boolean contains(BloomFilter other) {
+        // The AbstractBloomFilter implementation converts both filters to long[] bits.
+        // This would involve checking all indexes in this filter against zero.
+        // Ideally we use an iterator of bit indexes to allow fail-fast on the
+        // first bit index that is zero.
+        if (other instanceof ArrayCountingBloomFilter) {
+            verifyShape(other);
+            return contains(((ArrayCountingBloomFilter) other).iterator());
+        }
+
+        // Note:
+        // This currently creates a StaticHasher which stores all the indexes.
+        // It would greatly benefit from direct generation of the index iterator
+        // avoiding the intermediate storage.
+        return contains(other.getHasher());
 
 Review comment:
   Would this work for the BloomFilter Hasher implementation ?
   
   ```java
       public class BloomFilterHasher implements Hasher {
           BloomFilter bf;
           Function<BloomFilter,PrimitiveIterator.OfInt> func;
           
           BloomFilterHasher( BloomFilter bf, Function<BloomFilter,PrimitiveIterator.OfInt> func) {
               this.bf = bf;
               this.func = func;
           }
   
           @Override
           public OfInt getBits(Shape shape) {
               if (!bf.getShape().equals(shape)) {
                   throw new IllegalArgumentException(String.format("Hasher shape (%s) is not the same as shape (%s)",
                       bf.getShape().toString(), shape.toString()));
               }
               return func.apply( bf );
           }
   
           @Override
           public HashFunctionIdentity getHashFunctionIdentity() {
               return bf.getShape().getHashFunctionIdentity();
           }
   
           @Override
           public boolean isEmpty() {
               return bf.cardinality() == 0;
           }
   
       }
   ```
   
   `BloomFilter.getHasher()` would have to be changed to return a `Hasher` (rather than `StaticHasher`) or perhaps a `UniqueHasher` where `UniqueHasher extends Hasher` and is only a marker to indicate that the values do not contain duplicates.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [commons-collections] Claudenw commented on a change in pull request #137: WIP: CountingBloomFilter

Posted by GitBox <gi...@apache.org>.

Claudenw commented on a change in pull request #137: WIP: CountingBloomFilter
URL: https://github.com/apache/commons-collections/pull/137#discussion_r389283827
 
 

 ##########
 File path: src/main/java/org/apache/commons/collections4/bloomfilter/ArrayCountingBloomFilter.java
 ##########
 @@ -0,0 +1,396 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.commons.collections4.bloomfilter;
+
+import java.util.BitSet;
+import java.util.HashSet;
+import java.util.NoSuchElementException;
+import java.util.PrimitiveIterator;
+import java.util.PrimitiveIterator.OfInt;
+import java.util.function.Consumer;
+import java.util.function.IntConsumer;
+import java.util.Set;
+
+import org.apache.commons.collections4.bloomfilter.hasher.Hasher;
+import org.apache.commons.collections4.bloomfilter.hasher.Shape;
+import org.apache.commons.collections4.bloomfilter.hasher.StaticHasher;
+
+/**
+ * A counting Bloom filter using an array to track counts for each enabled bit
+ * index.
+ *
+ * <p>Any operation that results in negative counts or integer overflow of counts will
+ * mark this filter as invalid. This transition is not reversible. The counts for the
+ * filter immediately prior to the operation that create invalid counts can be recovered.
+ * See the documentation in {@link #isValid()} for details.
+ *
+ * <p>All the operations in the filter assume the counts are currently valid. Behaviour
+ * of an invalid filter is undefined. It will no longer function identically to a standard
+ * Bloom filter that is the merge of all the Bloom filters that have been added
+ * to and not later subtracted from the counting Bloom filter.
+ *
+ * <p>The maximum supported number of items that can be stored in the filter is
+ * limited by the maximum array size combined with the {@link Shape}. For
+ * example an implementation using a {@link Shape} with a false-positive
+ * probability of 1e-6 and {@link Integer#MAX_VALUE} bits can reversibly store
+ * approximately 75 million items using 20 hash functions per item with a memory
+ * consumption of approximately 8 GB.
+ *
+ * @since 4.5
+ * @see Shape
+ */
+public class ArrayCountingBloomFilter extends AbstractBloomFilter implements CountingBloomFilter {
+
+    /**
+     * The count of each bit index in the filter.
+     */
+    private final int[] counts;
+
+    /**
+     * The state flag. This is a bitwise OR of the entire history of all updated
+     * counts. If negative then a negative count or integer overflow has occurred on
+     * one or more counts in the history of the filter and the state is invalid.
+     *
+     * <p>Maintenance of this state flag is branch-free for improved performance. It
+     * eliminates a conditional check for a negative count during remove/subtract
+     * operations and a conditional check for integer overflow during merge/add
+     * operations.
+     *
+     * <p>Note: Integer overflow is unlikely in realistic usage scenarios. A count
+     * that overflows indicates that the number of items in the filter exceeds the
+     * maximum possible size (number of bits) of any Bloom filter constrained by
+     * integer indices. At this point the filter is most likely full (all bits are
+     * non-zero) and thus useless.
+     *
+     * <p>Negative counts are a concern if the filter is used incorrectly by
+     * removing an item that was never added. It is expected that a user of a
+     * counting Bloom filter will not perform this action as it is a mistake.
+     * Enabling an explicit recovery path for negative or overflow counts is a major
+     * performance burden not deemed necessary for the unlikely scenarios when an
+     * invalid state is created. Maintenance of the state flag is a concession to
+     * flag improper use that should not have a major performance impact.
+     */
+    private int state;
+
+    /**
+     * An iterator of all indexes with non-zero counts.
+     *
+     * <p>In the event that the filter state is invalid any index with a negative count
+     * will also be produced by the iterator.
+     */
+    private class IndexIterator implements PrimitiveIterator.OfInt {
+        /** The next non-zero index (or counts.length). */
+        private int next;
+
+        /**
+         * Create an instance.
+         */
+        IndexIterator() {
+            advance();
+        }
+
+        /**
+         * Advance to the next non-zero index.
+         */
+        void advance() {
+            while (next < counts.length && counts[next] == 0) {
+                next++;
+            }
+        }
+
+        @Override
+        public boolean hasNext() {
+            return next < counts.length;
+        }
+
+        @Override
+        public int nextInt() {
+            if (hasNext()) {
+                final int result = next++;
+                advance();
+                return result;
+            }
+            // Currently unreachable as the iterator is only used by
+            // the StaticHasher which iterates correctly.
+            throw new NoSuchElementException();
+        }
+    }
+
+    /**
+     * Constructs an empty counting Bloom filter with the specified shape.
+     *
+     * @param shape the shape of the filter
+     */
+    public ArrayCountingBloomFilter(final Shape shape) {
+        super(shape);
+        counts = new int[shape.getNumberOfBits()];
+    }
+
+    /**
+     * Constructs a counting Bloom filter from a hasher and a shape.
+     *
+     * <p>The filter will be equal to the result of merging the hasher with an empty
+     * filter; specifically duplicate indexes in the hasher are ignored.
+     *
+     * @param hasher the hasher to build the filter from
+     * @param shape the shape of the filter
+     * @throws IllegalArgumentException if the hasher cannot generate indices for
+     * the shape
+     * @see #merge(Hasher)
+     */
+    public ArrayCountingBloomFilter(final Hasher hasher, final Shape shape) {
+        super(shape);
+        // Given the filter is empty we can optimise the operation of merge(hasher)
+        verifyHasher(hasher);
+        // Delay array allocation until after hasher is verified
+        counts = new int[shape.getNumberOfBits()];
+        // All counts are zero. Ignore duplicates by initialising to 1
+        hasher.getBits(shape).forEachRemaining((IntConsumer) idx -> counts[idx] = 1);
+    }
+
+    @Override
+    public int cardinality() {
+        int size = 0;
+        for (final int c : counts) {
+            if (c != 0) {
+                size++;
+            }
+        }
+        return size;
+    }
+
+    @Override
+    public boolean contains(BloomFilter other) {
+        // The AbstractBloomFilter implementation converts both filters to long[] bits.
+        // This would involve checking all indexes in this filter against zero.
+        // Ideally we use an iterator of bit indexes to allow fail-fast on the
+        // first bit index that is zero.
+        if (other instanceof ArrayCountingBloomFilter) {
+            verifyShape(other);
+            return contains(((ArrayCountingBloomFilter) other).iterator());
+        }
+
+        // Note:
+        // This currently creates a StaticHasher which stores all the indexes.
+        // It would greatly benefit from direct generation of the index iterator
+        // avoiding the intermediate storage.
+        return contains(other.getHasher());
+    }
+
+    @Override
+    public boolean contains(final Hasher hasher) {
+        verifyHasher(hasher);
+        return contains(hasher.getBits(getShape()));
+    }
+
+    /**
+     * Return true if this filter is has non-zero counts for each index in the iterator.
+     *
+     * @param iter the iterator
+     * @return true if this filter contains all the indexes
+     */
+    private boolean contains(final OfInt iter) {
+        while (iter.hasNext()) {
+            if (counts[iter.nextInt()] == 0) {
+                return false;
+            }
+        }
+        return true;
+    }
+
+    @Override
+    public long[] getBits() {
+        final BitSet bs = new BitSet();
+        for (int i = 0; i < counts.length; i++) {
+            if (counts[i] != 0) {
+                bs.set(i);
+            }
+        }
+        return bs.toLongArray();
+    }
+
+    @Override
+    public StaticHasher getHasher() {
+        return new StaticHasher(iterator(), getShape());
+    }
+
+    /**
+     * Returns an iterator over the enabled indexes in this filter.
+     * Any index with a non-zero count is considered enabled.
+     * The iterator returns indexes in their natural order.
+     *
+     * @return an iterator over the enabled indexes
+     */
+    private PrimitiveIterator.OfInt iterator() {
+        return new IndexIterator();
+    }
+
+    @Override
+    public void merge(final BloomFilter other) {
+        applyAsBloomFilter(other, this::increment);
+    }
+
+    @Override
+    public void merge(final Hasher hasher) {
+        applyAsHasher(hasher, this::increment);
+    }
+
+    @Override
+    public boolean remove(BloomFilter other) {
+        applyAsBloomFilter(other, this::decrement);
+        return isValid();
+    }
+
+    @Override
+    public boolean remove(Hasher hasher) {
+        applyAsHasher(hasher, this::decrement);
+        return isValid();
+    }
+
+    @Override
+    public boolean add(CountingBloomFilter other) {
+        applyAsCountingBloomFilter(other, this::add);
+        return isValid();
+    }
+
+    @Override
+    public boolean subtract(CountingBloomFilter other) {
+        applyAsCountingBloomFilter(other, this::subtract);
+        return isValid();
+    }
+
+    /**
+     * {@inheritDoc}
+     *
+     * <p><em>Implementation note</em>
+     *
+     * <p>The state transition to invalid is permanent.
+     *
+     * <p>This implementation does not correct negative counts to zero or integer
+     * overflow counts to {@link Integer#MAX_VALUE}. Thus the operation that
+     * generated invalid counts can be reversed by using the complement of the
+     * original operation with the same Bloom filter. This will restore the counts
+     * to the state prior to the invalid operation. Counts can then be extracted
+     * using {@link #forEachCount(BitCountConsumer)}.
+     */
+    @Override
+    public boolean isValid() {
+        return state >= 0;
+    }
+
+    @Override
+    public void forEachCount(BitCountConsumer action) {
+        for (int i = 0; i < counts.length; i++) {
+            if (counts[i] != 0) {
+                action.accept(i, counts[i]);
+            }
+        }
+    }
+
+    /**
+     * Apply the action for each index in the Bloom filter.
+     */
+    private void applyAsBloomFilter(final BloomFilter other, final IntConsumer action) {
+        verifyShape(other);
+        if (other instanceof ArrayCountingBloomFilter) {
+            // Only use the presence of non-zero and not the counts
+            final int[] counts2 = ((ArrayCountingBloomFilter) other).counts;
+            for (int i = 0; i < counts2.length; i++) {
+                if (counts2[i] != 0) {
+                    action.accept(i);
+                }
+            }
+        } else {
+            BitSet.valueOf(other.getBits()).stream().forEach(action);
+        }
+    }
+
+    /**
+     * Apply the action for each index in the hasher.
+     */
+    private void applyAsHasher(final Hasher hasher, final IntConsumer action) {
+        verifyHasher(hasher);
+        toSet(hasher).forEach(i -> action.accept(i));
 
 Review comment:
   Rather than toSet() how about a filtering consumer.  Something like:
   
   ```java
   private static class UniqueFilter implements IntConsumer {
           IntConsumer child;
           long[] filter;
           UniqueFilter( IntConsumer child, Shape shape ) {
               this.child=child;
               filter = new long[ BloomFilterIndexer.getLongIndex(shape.getNumberOfBits()) ];
           }
           @Override
           public void accept(int arg0) {
               long target = filter[BloomFilterIndexer.getLongIndex(arg0) ];
               long mask = BloomFilterIndexer.getLongBit(arg0);
               if ( (target & mask) == 0)
               {
                   target |= mask;
                   child.accept( arg0 );
               }
           }
       }
   ```
   the use 
   
   ```java
           UniqueFilter filter = new UniqueFilter( action, shape );
           hasher.getBits(shape).forEachRemaining( filter );
   ```
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services