You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@commons.apache.org by GitBox <gi...@apache.org> on 2020/03/07 23:21:17 UTC

[GitHub] [commons-collections] aherbert commented on a change in pull request #137: WIP: CountingBloomFilter

aherbert commented on a change in pull request #137: WIP: CountingBloomFilter
URL: https://github.com/apache/commons-collections/pull/137#discussion_r389319599
 
 

 ##########
 File path: src/main/java/org/apache/commons/collections4/bloomfilter/ArrayCountingBloomFilter.java
 ##########
 @@ -0,0 +1,396 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.commons.collections4.bloomfilter;
+
+import java.util.BitSet;
+import java.util.HashSet;
+import java.util.NoSuchElementException;
+import java.util.PrimitiveIterator;
+import java.util.PrimitiveIterator.OfInt;
+import java.util.function.Consumer;
+import java.util.function.IntConsumer;
+import java.util.Set;
+
+import org.apache.commons.collections4.bloomfilter.hasher.Hasher;
+import org.apache.commons.collections4.bloomfilter.hasher.Shape;
+import org.apache.commons.collections4.bloomfilter.hasher.StaticHasher;
+
+/**
+ * A counting Bloom filter using an array to track counts for each enabled bit
+ * index.
+ *
+ * <p>Any operation that results in negative counts or integer overflow of counts will
+ * mark this filter as invalid. This transition is not reversible. The counts for the
+ * filter immediately prior to the operation that create invalid counts can be recovered.
+ * See the documentation in {@link #isValid()} for details.
+ *
+ * <p>All the operations in the filter assume the counts are currently valid. Behaviour
+ * of an invalid filter is undefined. It will no longer function identically to a standard
+ * Bloom filter that is the merge of all the Bloom filters that have been added
+ * to and not later subtracted from the counting Bloom filter.
+ *
+ * <p>The maximum supported number of items that can be stored in the filter is
+ * limited by the maximum array size combined with the {@link Shape}. For
+ * example an implementation using a {@link Shape} with a false-positive
+ * probability of 1e-6 and {@link Integer#MAX_VALUE} bits can reversibly store
+ * approximately 75 million items using 20 hash functions per item with a memory
+ * consumption of approximately 8 GB.
+ *
+ * @since 4.5
+ * @see Shape
+ */
+public class ArrayCountingBloomFilter extends AbstractBloomFilter implements CountingBloomFilter {
+
+    /**
+     * The count of each bit index in the filter.
+     */
+    private final int[] counts;
+
+    /**
+     * The state flag. This is a bitwise OR of the entire history of all updated
+     * counts. If negative then a negative count or integer overflow has occurred on
+     * one or more counts in the history of the filter and the state is invalid.
+     *
+     * <p>Maintenance of this state flag is branch-free for improved performance. It
+     * eliminates a conditional check for a negative count during remove/subtract
+     * operations and a conditional check for integer overflow during merge/add
+     * operations.
+     *
+     * <p>Note: Integer overflow is unlikely in realistic usage scenarios. A count
+     * that overflows indicates that the number of items in the filter exceeds the
+     * maximum possible size (number of bits) of any Bloom filter constrained by
+     * integer indices. At this point the filter is most likely full (all bits are
+     * non-zero) and thus useless.
+     *
+     * <p>Negative counts are a concern if the filter is used incorrectly by
+     * removing an item that was never added. It is expected that a user of a
+     * counting Bloom filter will not perform this action as it is a mistake.
+     * Enabling an explicit recovery path for negative or overflow counts is a major
+     * performance burden not deemed necessary for the unlikely scenarios when an
+     * invalid state is created. Maintenance of the state flag is a concession to
+     * flag improper use that should not have a major performance impact.
+     */
+    private int state;
+
+    /**
+     * An iterator of all indexes with non-zero counts.
+     *
+     * <p>In the event that the filter state is invalid any index with a negative count
+     * will also be produced by the iterator.
+     */
+    private class IndexIterator implements PrimitiveIterator.OfInt {
+        /** The next non-zero index (or counts.length). */
+        private int next;
+
+        /**
+         * Create an instance.
+         */
+        IndexIterator() {
+            advance();
+        }
+
+        /**
+         * Advance to the next non-zero index.
+         */
+        void advance() {
+            while (next < counts.length && counts[next] == 0) {
+                next++;
+            }
+        }
+
+        @Override
+        public boolean hasNext() {
+            return next < counts.length;
+        }
+
+        @Override
+        public int nextInt() {
+            if (hasNext()) {
+                final int result = next++;
+                advance();
+                return result;
+            }
+            // Currently unreachable as the iterator is only used by
+            // the StaticHasher which iterates correctly.
+            throw new NoSuchElementException();
+        }
+    }
+
+    /**
+     * Constructs an empty counting Bloom filter with the specified shape.
+     *
+     * @param shape the shape of the filter
+     */
+    public ArrayCountingBloomFilter(final Shape shape) {
+        super(shape);
+        counts = new int[shape.getNumberOfBits()];
+    }
+
+    /**
+     * Constructs a counting Bloom filter from a hasher and a shape.
+     *
+     * <p>The filter will be equal to the result of merging the hasher with an empty
+     * filter; specifically duplicate indexes in the hasher are ignored.
+     *
+     * @param hasher the hasher to build the filter from
+     * @param shape the shape of the filter
+     * @throws IllegalArgumentException if the hasher cannot generate indices for
+     * the shape
+     * @see #merge(Hasher)
+     */
+    public ArrayCountingBloomFilter(final Hasher hasher, final Shape shape) {
+        super(shape);
+        // Given the filter is empty we can optimise the operation of merge(hasher)
+        verifyHasher(hasher);
+        // Delay array allocation until after hasher is verified
+        counts = new int[shape.getNumberOfBits()];
+        // All counts are zero. Ignore duplicates by initialising to 1
+        hasher.getBits(shape).forEachRemaining((IntConsumer) idx -> counts[idx] = 1);
+    }
+
+    @Override
+    public int cardinality() {
+        int size = 0;
+        for (final int c : counts) {
+            if (c != 0) {
+                size++;
+            }
+        }
+        return size;
+    }
+
+    @Override
+    public boolean contains(BloomFilter other) {
+        // The AbstractBloomFilter implementation converts both filters to long[] bits.
+        // This would involve checking all indexes in this filter against zero.
+        // Ideally we use an iterator of bit indexes to allow fail-fast on the
+        // first bit index that is zero.
+        if (other instanceof ArrayCountingBloomFilter) {
+            verifyShape(other);
+            return contains(((ArrayCountingBloomFilter) other).iterator());
+        }
+
+        // Note:
+        // This currently creates a StaticHasher which stores all the indexes.
+        // It would greatly benefit from direct generation of the index iterator
+        // avoiding the intermediate storage.
+        return contains(other.getHasher());
 
 Review comment:
   This comment is a prelude to the discussion from the mailing list where the getHasher() method is dropped from the BloomFilter in favour of an iterator of the indices. 
   
   Your example requires a function to convert a BloomFilter to an iterator. Who provides that method? Given the data is within the BloomFilter encapsulation requires it to be the BloomFilter to provide the method. So let's just go with the plan to have a BloomFilter provide an iterator. This would replace the getHasher method and its requirement to use a concrete StaticHasher class as the return type and the inefficiency of doing so for a close to capacity filter (the StaticHasher constructor currently passes an iterator through a Set to make indexes unique). 
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services