You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@commons.apache.org by GitBox <gi...@apache.org> on 2022/09/08 17:06:50 UTC
[GitHub] [commons-collections] aherbert commented on a diff in pull request #331: Collections 763: Remove BloomFilter constructors that create initial entry

aherbert commented on code in PR #331:
URL: https://github.com/apache/commons-collections/pull/331#discussion_r966184418


##########
src/test/java/org/apache/commons/collections4/bloomfilter/DefaultIndexProducerTest.java:
##########
@@ -0,0 +1,65 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.commons.collections4.bloomfilter;
+
+import static org.junit.jupiter.api.Assertions.assertEquals;
+
+import java.util.function.IntPredicate;
+
+import org.junit.jupiter.api.Test;
+
+public class DefaultIndexProducerTest extends AbstractIndexProducerTest {
+
+    @Override
+    protected IndexProducer createProducer() {
+        return IndexProducer.fromIndexArray(1, 2);
+    }
+
+    @Override
+    protected IndexProducer createEmptyProducer() {
+        return new IndexProducer() {
+
+            @Override
+            public boolean forEachIndex(IntPredicate predicate)
+            {
+                return true;
+            }
+        };
+    }
+    
+    @Test
+    public void testFromBitMapProducer() {
+        IndexProducer ip = IndexProducer.fromBitMapProducer(BitMapProducer.fromBitMapArray( 0x07ffL));
+        int[] ary = ip.asIndexArray();
+        assertEquals(11, ary.length);
+        for (int i=0;i<11;i++)
+        {
+            assertEquals(i, ary[i]);
+        }
+    }
+    
+    @Test
+    public void testFromIndexArray() {
+        IndexProducer ip = IndexProducer.fromIndexArray(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10);

Review Comment:
   Again this is a linear sequence.



##########
src/test/java/org/apache/commons/collections4/bloomfilter/DefaultBitMapProducerTest.java:
##########
@@ -52,4 +56,30 @@ public boolean forEachBitMap(LongPredicate predicate) {
             return true;
         }
     }
+    
+    @Test
+    public void testDefaultExpansion() {
+        BitMapProducer bmp = BitMapProducer.fromBitMapArray(0L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 16L);

Review Comment:
   Testing with a linear sequence is too simple. I would prefer some random indices.



##########
src/main/java/org/apache/commons/collections4/bloomfilter/BloomFilter.java:
##########
@@ -135,28 +135,57 @@ default boolean contains(BitMapProducer bitMapProducer) {
      * @param other The bloom filter to merge into this one.
      * @return true if the merge was successful
      */
-    boolean merge(BloomFilter other);
+    default boolean merge(BloomFilter other) {
+        return (characteristics() & SPARSE) != 0 ? merge( (IndexProducer) other ) : merge( (BitMapProducer) other);

Review Comment:
   remove single whitespace after `merge(`



##########
src/main/java/org/apache/commons/collections4/bloomfilter/CountingBloomFilter.java:
##########
@@ -77,6 +79,89 @@ public interface CountingBloomFilter extends BloomFilter, BitCountProducer {
 
     // Modification Operations
 
+    /**
+     * Merges the specified Bloom filter into this Bloom filter.
+     *
+     * <p>Specifically: all counts for the indexes identified by the {@code other} filter will be incremented by 1,</p>
+     *
+     * <p>Note: If the other filter is a counting Bloom filter the index counts are ignored and it is treated as an
+     * IndexProducer.</p>
+     *
+     * <p>This method will return {@code true} if the filter is valid after the operation.</p>
+     *
+     * @param other the other Bloom filter
+     * @return {@code true} if the removal was successful and the state is valid
+     * @see #isValid()
+     * @see #add(BitCountProducer)
+     */
+    default boolean merge(final BloomFilter other) {
+        Objects.requireNonNull(other, "other");
+        return merge((IndexProducer) other);
+    }
+
+    /**
+     * Merges the specified Hasher into this Bloom filter.
+     *
+     * <p>Specifically: all counts for the unique indexes identified by the {@code hasher} will be incremented by 1,</p>
+     *
+     * <p>This method will return {@code true} if the filter is valid after the operation.</p>
+     *
+     * @param hasher the hasher
+     * @return {@code true} if the removal was successful and the state is valid
+     * @see #isValid()
+     * @see #add(BitCountProducer)
+     */
+    default boolean merge(final Hasher hasher) {
+        Objects.requireNonNull(hasher, "hasher");
+        try {
+            return add(BitCountProducer.from(hasher.uniqueIndices(getShape())));
+        } catch (IndexOutOfBoundsException e) {
+            throw new IllegalArgumentException(
+                    String.format("Filter only accepts values in the [0,%d) range", getShape().getNumberOfBits()), e);
+        }
+    }
+
+    /**
+     * Merges the specified index producer into this Bloom filter.
+     *
+     * <p>Specifically: all counts for the indexes identified by the {@code indexProducer} will be incremented by 1,</p>
+     *
+     * <p>This method will return {@code true} if the filter is valid after the operation.</p>
+     *
+     * <p>Note: Indices that are returned multiple times will be incremented multiple times.</p>
+     *
+     * @param indexProducer the IndexProducer
+     * @return {@code true} if the removal was successful and the state is valid
+     * @see #isValid()
+     * @see #add(BitCountProducer)
+     */
+    default boolean merge(final IndexProducer indexProducer) {
+        Objects.requireNonNull(indexProducer, "producer");

Review Comment:
   `"indexProducer"`



##########
src/main/java/org/apache/commons/collections4/bloomfilter/package-info.java:
##########
@@ -20,58 +20,59 @@
  *
  * <h2>Background:</h2>
  *
- * <p>The Bloom filter is a probabilistic data structure that indicates where things are not.
- * Conceptually it is a bit vector. You create a Bloom filter by creating hashes
- * and converting those to enabled bits in the vector. Multiple Bloom filters may be merged
- * together into one Bloom filter.  It is possible to test if a filter {@code B} has merged into
- * another filter {@code A} by verifying that {@code (A & B) == B}.</p>
- *
- * <p>Bloom filters are generally used where hash
- * tables would be too large, or as a filter front end for longer processes. For example
- * most browsers have a Bloom filter that is built from all known bad URLs (ones that
- * serve up malware). When you enter a URL the browser builds a Bloom filter and checks to
- * see if it is "in" the bad URL filter. If not the URL is good, if it matches, then the
- * expensive lookup on a remote system is made to see if it actually is in the list. There
- * are lots of other uses, and in most cases the reason is to perform a fast check as a
- * gateway for a longer operation. </p>
+ * <p>The Bloom filter is a probabilistic data structure that indicates where things are not. Conceptually it is a bit
+ * vector. You create a Bloom filter by creating hashes and converting those to enabled bits in the vector. Multiple
+ * Bloom filters may be merged together into one Bloom filter. It is possible to test if a filter {@code B} has merged
+ * into another filter {@code A} by verifying that {@code (A & B) == B}.</p>
+ *
+ * <p>Bloom filters are generally used where hash tables would be too large, or as a filter front end for longer processes.
+ * For example most browsers have a Bloom filter that is built from all known bad URLs (ones that serve up malware).
+ * When you enter a URL the browser builds a Bloom filter and checks to see if it is "in" the bad URL filter. If not the
+ * URL is good, if it matches, then the expensive lookup on a remote system is made to see if it actually is in the
+ * list. There are lots of other uses, and in most cases the reason is to perform a fast check as a gateway for a longer
+ * operation.</p>
  *
  * <h3>BloomFilter</h3>
  *
- * <p>The Bloom filter architecture here is designed so that the implementation of the storage of bits is abstracted.
+ * <p>The Bloom filter architecture here is designed for speed of execution, so some methods like {@code merge}, {@code remove},
+ * {@code add}, and {@code subtract} may throw exceptions.  One an exception is thrown the state of the Bloom filter is unknown.
+ * The choice to use not use atomic transactions was made to achive maximum performance under correct usage.</p>
+ *
+ * <p>In addition the architecture is designed so that the implementation of the storage of bits is abstracted.
  * Programs that utilize the Bloom filters may use the {@code BitMapProducer} or {@code IndexProducer} to retrieve a
- * representation of the internal structure.  Additional methods are available in the {@code BitMap} to assist in
+ * representation of the internal structure. Additional methods are available in the {@code BitMap} to assist in
  * manipulation of the representations.</p>
  *
- * <p>The bloom filter code is an interface that requires implementation of 6 methods:</p>
+ * <p>The bloom filter code is an interface that requires implementation of 9 methods:</p>
  * <ul>
- * <li>{@code cardinality()}
- * returns the number of bits enabled in the Bloom filter.</li>
+ * <li>{@code cardinality()} returns the number of bits enabled in the Bloom filter.</li>
  *
- * <li>{@code contains(BitMapProducer)} which
- * returns true if the bits specified by the bit maps generated by the BitMapProducer are enabled in the Bloom filter.</li>
+ * <li>{@code characteristics()} which Returns a bitmap int of characteristics values.</li>
  *
- *  <li>{@code contains(IndexProducer)} which
- * returns true if the bits specified by the indices generated by IndexProducer are enabled in the Bloom filter.</li>
+ * <li>{@code clear()} which resets the Bloomfilter to its initial empty state.</li>
  *
- * <li>{@code getShape()} which
- * returns the shape the Bloom filter was created with.</li>
-
- * <li>{@code isSparse()} which
- * returns true if an the implementation tracks indices natively, false if bit maps are used.  In cases where
- * neither are used the {@code isSparse} return value should reflect which is faster to produce.</li>
+ * <li>{@code contains(IndexProducer)} which returns true if the bits specified by the indices generated by
+ * IndexProducer are enabled in the Bloom filter.</li>
+ *
+ * <li>{@code copy()} which returns a fresh copy of the bitmap.</li>
+ *
+ * <li>{@code getShape()} which returns the shape the Bloom filter was created with.</li>
+ *
+ * <li>{@code characteristics()} which an integer of characteristics flags.</li>
+ *
+ * <li>{@code merge(BitMapProducer)} which Merges the BitMaps from the BitMapProducer into the internal
+ * representation of the Bloom filter.</li>
  *
- * <li>{@code mergeInPlace(BloomFilter)} which
- * utilizes either the {@code BitMapProducer} or {@code IndexProducer} from the argument to enable extra bits
- * in the internal representation of the Bloom filter.</li>
+ * <li>{@code merge(IndexProducer)} which Merges the indices from the IndexProducer into the internal

Review Comment:
   `merges`



##########
src/test/java/org/apache/commons/collections4/bloomfilter/DefaultBloomFilterTest.java:
##########
@@ -163,28 +123,30 @@ public boolean merge(BloomFilter other) {
                             String.format("Value in list %s is less than 0", indices.first()));
                 }
             }
-            return true;
         }
 
         @Override
-        public int cardinality() {
-            return indices.size();
+        public boolean merge(IndexProducer indexProducer) {
+            boolean result = indexProducer.forEachIndex((x) -> {

Review Comment:
   You do not need parentheses around the `(x)` for single argument lambda functions.



##########
src/test/java/org/apache/commons/collections4/bloomfilter/DefaultBitMapProducerTest.java:
##########
@@ -56,26 +56,25 @@ public boolean forEachBitMap(LongPredicate predicate) {
             return true;
         }
     }
-    
+
     @Test
     public void testDefaultExpansion() {
         BitMapProducer bmp = BitMapProducer.fromBitMapArray(0L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 16L);
         long[] ary = bmp.asBitMapArray();
         assertEquals( 17,  ary.length);
-        for (int i=0;i<17;i++)
-        {
-            assertEquals((long)i, ary[i]);
+        for (int i=0; i<17; i++) {

Review Comment:
   Typically we use whitespace around this: `int i = 0; i < 17; i++)`
   
   In this case I think the test should be modified to create an expected array and use assertArrayEquals



##########
src/main/java/org/apache/commons/collections4/bloomfilter/package-info.java:
##########
@@ -20,58 +20,59 @@
  *
  * <h2>Background:</h2>
  *
- * <p>The Bloom filter is a probabilistic data structure that indicates where things are not.
- * Conceptually it is a bit vector. You create a Bloom filter by creating hashes
- * and converting those to enabled bits in the vector. Multiple Bloom filters may be merged
- * together into one Bloom filter.  It is possible to test if a filter {@code B} has merged into
- * another filter {@code A} by verifying that {@code (A & B) == B}.</p>
- *
- * <p>Bloom filters are generally used where hash
- * tables would be too large, or as a filter front end for longer processes. For example
- * most browsers have a Bloom filter that is built from all known bad URLs (ones that
- * serve up malware). When you enter a URL the browser builds a Bloom filter and checks to
- * see if it is "in" the bad URL filter. If not the URL is good, if it matches, then the
- * expensive lookup on a remote system is made to see if it actually is in the list. There
- * are lots of other uses, and in most cases the reason is to perform a fast check as a
- * gateway for a longer operation. </p>
+ * <p>The Bloom filter is a probabilistic data structure that indicates where things are not. Conceptually it is a bit
+ * vector. You create a Bloom filter by creating hashes and converting those to enabled bits in the vector. Multiple
+ * Bloom filters may be merged together into one Bloom filter. It is possible to test if a filter {@code B} has merged
+ * into another filter {@code A} by verifying that {@code (A & B) == B}.</p>
+ *
+ * <p>Bloom filters are generally used where hash tables would be too large, or as a filter front end for longer processes.
+ * For example most browsers have a Bloom filter that is built from all known bad URLs (ones that serve up malware).
+ * When you enter a URL the browser builds a Bloom filter and checks to see if it is "in" the bad URL filter. If not the
+ * URL is good, if it matches, then the expensive lookup on a remote system is made to see if it actually is in the
+ * list. There are lots of other uses, and in most cases the reason is to perform a fast check as a gateway for a longer
+ * operation.</p>
  *
  * <h3>BloomFilter</h3>
  *
- * <p>The Bloom filter architecture here is designed so that the implementation of the storage of bits is abstracted.
+ * <p>The Bloom filter architecture here is designed for speed of execution, so some methods like {@code merge}, {@code remove},
+ * {@code add}, and {@code subtract} may throw exceptions.  One an exception is thrown the state of the Bloom filter is unknown.

Review Comment:
   `Once` an exception is thrown.



##########
src/test/java/org/apache/commons/collections4/bloomfilter/DefaultIndexProducerTest.java:
##########
@@ -0,0 +1,65 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.commons.collections4.bloomfilter;
+
+import static org.junit.jupiter.api.Assertions.assertEquals;
+
+import java.util.function.IntPredicate;
+
+import org.junit.jupiter.api.Test;
+
+public class DefaultIndexProducerTest extends AbstractIndexProducerTest {
+
+    @Override
+    protected IndexProducer createProducer() {
+        return IndexProducer.fromIndexArray(1, 2);
+    }
+
+    @Override
+    protected IndexProducer createEmptyProducer() {
+        return new IndexProducer() {
+
+            @Override
+            public boolean forEachIndex(IntPredicate predicate)
+            {
+                return true;
+            }
+        };
+    }
+    
+    @Test
+    public void testFromBitMapProducer() {
+        IndexProducer ip = IndexProducer.fromBitMapProducer(BitMapProducer.fromBitMapArray( 0x07ffL));

Review Comment:
   These tests use a continuous range. A more generic test could use the BitMap class to create the input:
   ```Java
   // Randomly create
   int[] expected = {13, 3, 18, 1, 31, 57, 7};
   long[] bits = new long[BitMap.numberOfBitMaps(Arrays.stream(expected).max().getAsInt())];
   for (int bitIndex : expected) {
       BitMap.set(bits, bitIndex);
   }
   IndexProducer ip = IndexProducer.fromBitMapProducer(BitMapProducer.fromBitMapArray(bits));
   int[] ary = ip.asIndexArray();
   Arrays.sort(expected);
   assertArrayEquals(expected, ary);
   ```



##########
src/main/java/org/apache/commons/collections4/bloomfilter/SparseBloomFilter.java:
##########
@@ -140,23 +78,27 @@ private boolean add(int idx) {
         return true;
     }
 
-    /**
-     * Performs a merge using an IndexProducer.
-     * @param indexProducer the IndexProducer to merge from.
-     * @throws IllegalArgumentException if producer sends illegal value.
-     */
-    private void merge(IndexProducer indexProducer) {
+    @Override
+    public boolean merge(IndexProducer indexProducer) {
+        Objects.requireNonNull(indexProducer, "indexProducer");
         indexProducer.forEachIndex(this::add);
         if (!this.indices.isEmpty()) {
             if (this.indices.last() >= shape.getNumberOfBits()) {
                 throw new IllegalArgumentException(String.format("Value in list %s is greater than maximum value (%s)",
-                        this.indices.last(), shape.getNumberOfBits()));
+                        this.indices.last(), shape.getNumberOfBits()-1));

Review Comment:
   Whitepsace formatting: `shape.getNumberOfBits() - 1`



##########
src/main/java/org/apache/commons/collections4/bloomfilter/CountingBloomFilter.java:
##########
@@ -77,6 +79,89 @@ public interface CountingBloomFilter extends BloomFilter, BitCountProducer {
 
     // Modification Operations
 
+    /**
+     * Merges the specified Bloom filter into this Bloom filter.
+     *
+     * <p>Specifically: all counts for the indexes identified by the {@code other} filter will be incremented by 1,</p>
+     *
+     * <p>Note: If the other filter is a counting Bloom filter the index counts are ignored and it is treated as an
+     * IndexProducer.</p>
+     *
+     * <p>This method will return {@code true} if the filter is valid after the operation.</p>
+     *
+     * @param other the other Bloom filter
+     * @return {@code true} if the removal was successful and the state is valid
+     * @see #isValid()
+     * @see #add(BitCountProducer)
+     */
+    default boolean merge(final BloomFilter other) {
+        Objects.requireNonNull(other, "other");
+        return merge((IndexProducer) other);
+    }
+
+    /**
+     * Merges the specified Hasher into this Bloom filter.
+     *
+     * <p>Specifically: all counts for the unique indexes identified by the {@code hasher} will be incremented by 1,</p>
+     *
+     * <p>This method will return {@code true} if the filter is valid after the operation.</p>
+     *
+     * @param hasher the hasher
+     * @return {@code true} if the removal was successful and the state is valid
+     * @see #isValid()
+     * @see #add(BitCountProducer)
+     */
+    default boolean merge(final Hasher hasher) {
+        Objects.requireNonNull(hasher, "hasher");
+        try {

Review Comment:
   Why not:
   ```Java
   Objects.requireNonNull(hasher, "hasher");
   merge(hasher.uniqueIndices(getShape()));
   ```
   
   This has parity with the default `remove` method.



##########
src/main/java/org/apache/commons/collections4/bloomfilter/package-info.java:
##########
@@ -20,58 +20,59 @@
  *
  * <h2>Background:</h2>
  *
- * <p>The Bloom filter is a probabilistic data structure that indicates where things are not.
- * Conceptually it is a bit vector. You create a Bloom filter by creating hashes
- * and converting those to enabled bits in the vector. Multiple Bloom filters may be merged
- * together into one Bloom filter.  It is possible to test if a filter {@code B} has merged into
- * another filter {@code A} by verifying that {@code (A & B) == B}.</p>
- *
- * <p>Bloom filters are generally used where hash
- * tables would be too large, or as a filter front end for longer processes. For example
- * most browsers have a Bloom filter that is built from all known bad URLs (ones that
- * serve up malware). When you enter a URL the browser builds a Bloom filter and checks to
- * see if it is "in" the bad URL filter. If not the URL is good, if it matches, then the
- * expensive lookup on a remote system is made to see if it actually is in the list. There
- * are lots of other uses, and in most cases the reason is to perform a fast check as a
- * gateway for a longer operation. </p>
+ * <p>The Bloom filter is a probabilistic data structure that indicates where things are not. Conceptually it is a bit
+ * vector. You create a Bloom filter by creating hashes and converting those to enabled bits in the vector. Multiple
+ * Bloom filters may be merged together into one Bloom filter. It is possible to test if a filter {@code B} has merged
+ * into another filter {@code A} by verifying that {@code (A & B) == B}.</p>
+ *
+ * <p>Bloom filters are generally used where hash tables would be too large, or as a filter front end for longer processes.
+ * For example most browsers have a Bloom filter that is built from all known bad URLs (ones that serve up malware).
+ * When you enter a URL the browser builds a Bloom filter and checks to see if it is "in" the bad URL filter. If not the
+ * URL is good, if it matches, then the expensive lookup on a remote system is made to see if it actually is in the
+ * list. There are lots of other uses, and in most cases the reason is to perform a fast check as a gateway for a longer
+ * operation.</p>
  *
  * <h3>BloomFilter</h3>
  *
- * <p>The Bloom filter architecture here is designed so that the implementation of the storage of bits is abstracted.
+ * <p>The Bloom filter architecture here is designed for speed of execution, so some methods like {@code merge}, {@code remove},
+ * {@code add}, and {@code subtract} may throw exceptions.  One an exception is thrown the state of the Bloom filter is unknown.
+ * The choice to use not use atomic transactions was made to achive maximum performance under correct usage.</p>

Review Comment:
   `achieve`



##########
src/main/java/org/apache/commons/collections4/bloomfilter/CountingBloomFilter.java:
##########
@@ -110,7 +198,52 @@ public interface CountingBloomFilter extends BloomFilter, BitCountProducer {
      * @see #isValid()
      * @see #subtract(BitCountProducer)
      */
-    boolean remove(Hasher hasher);
+    default boolean remove(final Hasher hasher) {
+        Objects.requireNonNull(hasher, "hasher");
+        return remove(hasher.uniqueIndices(getShape()));
+    }
+
+    /**
+     * Removes the values from the specified IndexProducer from the Bloom filter from this Bloom filter.
+     *
+     * <p>Specifically all counts for the unique indices produced by the {@code hasher} will be
+     * decremented by 1.</p>
+     *
+     * <p>This method will return {@code true} if the filter is valid after the operation.</p>
+     *
+     * <p>Node: This method expects index producers that produce unique values.</p>
+     *
+     * @param indexProducer the IndexProducer to provide the indexes
+     * @return {@code true} if the removal was successful and the state is valid
+     * @see #isValid()
+     * @see #subtract(BitCountProducer)
+     */
+    default boolean remove(final IndexProducer indexProducer) {
+        Objects.requireNonNull(indexProducer, "indexProducer");
+        try {
+            return subtract(BitCountProducer.from(indexProducer));
+        } catch (IndexOutOfBoundsException e) {
+            throw new IllegalArgumentException(
+                    String.format("Filter only accepts values in the [0,%d) range", getShape().getNumberOfBits()));
+        }
+    }
+
+    /**
+     * Removes the specified BitMapProducer from this Bloom filter.
+     *
+     * <p>Specifically all counts for the indices produced by the {@code bitMapProducer} will be
+     * decremented by 1.</p>
+     *
+     * <p>This method will return {@code true} if the filter is valid after the operation.</p>
+     *
+     * @param bitMapProducer the BitMapProducer to provide the indexes
+     * @return {@code true} if the removal was successful and the state is valid
+     * @see #isValid()
+     * @see #subtract(BitCountProducer)
+     */
+    default boolean remove(final BitMapProducer bitMapProducer) {
+        return remove(IndexProducer.fromBitMapProducer(bitMapProducer));

Review Comment:
   Missing `Objects.requireNonNull(bitMapProducer, "bitMapProducer");`



##########
src/test/java/org/apache/commons/collections4/bloomfilter/AbstractBloomFilterTest.java:
##########
@@ -93,16 +94,30 @@ protected final Shape getTestShape() {
      *
      */
     @Test
-    public void testConstructWithBadHasher() {
+    public void testMergeWithBadHasher() {
         // value too large
+        final BloomFilter f = createEmptyFilter(getTestShape());
         assertThrows(IllegalArgumentException.class,
-                () -> createFilter(getTestShape(), new BadHasher(getTestShape().getNumberOfBits())));
+                () -> f.merge(new BadHasher(getTestShape().getNumberOfBits())));
         // negative value
-        assertThrows(IllegalArgumentException.class, () -> createFilter(getTestShape(), new BadHasher(-1)));
+        BloomFilter f2 = createEmptyFilter(getTestShape());
+        assertThrows(IllegalArgumentException.class, () -> f2.merge(new BadHasher(-1)));
     }
 
     @Test
-    public void testConstructWitBitMapProducer() {
+    public void testMergeWithHasher() {
+        // value too large
+        final BloomFilter f = createEmptyFilter(getTestShape());
+        f.merge(from1);
+        int[] idx = f.asIndexArray();
+        assertEquals(getTestShape().getNumberOfHashFunctions(), idx.length);
+        for (int i=0; i<idx.length; i++) {

Review Comment:
   Not seeing the fix in the `for` line
   



##########
src/main/java/org/apache/commons/collections4/bloomfilter/package-info.java:
##########
@@ -20,58 +20,59 @@
  *
  * <h2>Background:</h2>
  *
- * <p>The Bloom filter is a probabilistic data structure that indicates where things are not.
- * Conceptually it is a bit vector. You create a Bloom filter by creating hashes
- * and converting those to enabled bits in the vector. Multiple Bloom filters may be merged
- * together into one Bloom filter.  It is possible to test if a filter {@code B} has merged into
- * another filter {@code A} by verifying that {@code (A & B) == B}.</p>
- *
- * <p>Bloom filters are generally used where hash
- * tables would be too large, or as a filter front end for longer processes. For example
- * most browsers have a Bloom filter that is built from all known bad URLs (ones that
- * serve up malware). When you enter a URL the browser builds a Bloom filter and checks to
- * see if it is "in" the bad URL filter. If not the URL is good, if it matches, then the
- * expensive lookup on a remote system is made to see if it actually is in the list. There
- * are lots of other uses, and in most cases the reason is to perform a fast check as a
- * gateway for a longer operation. </p>
+ * <p>The Bloom filter is a probabilistic data structure that indicates where things are not. Conceptually it is a bit
+ * vector. You create a Bloom filter by creating hashes and converting those to enabled bits in the vector. Multiple
+ * Bloom filters may be merged together into one Bloom filter. It is possible to test if a filter {@code B} has merged
+ * into another filter {@code A} by verifying that {@code (A & B) == B}.</p>
+ *
+ * <p>Bloom filters are generally used where hash tables would be too large, or as a filter front end for longer processes.
+ * For example most browsers have a Bloom filter that is built from all known bad URLs (ones that serve up malware).
+ * When you enter a URL the browser builds a Bloom filter and checks to see if it is "in" the bad URL filter. If not the
+ * URL is good, if it matches, then the expensive lookup on a remote system is made to see if it actually is in the
+ * list. There are lots of other uses, and in most cases the reason is to perform a fast check as a gateway for a longer
+ * operation.</p>
  *
  * <h3>BloomFilter</h3>
  *
- * <p>The Bloom filter architecture here is designed so that the implementation of the storage of bits is abstracted.
+ * <p>The Bloom filter architecture here is designed for speed of execution, so some methods like {@code merge}, {@code remove},
+ * {@code add}, and {@code subtract} may throw exceptions.  One an exception is thrown the state of the Bloom filter is unknown.
+ * The choice to use not use atomic transactions was made to achive maximum performance under correct usage.</p>
+ *
+ * <p>In addition the architecture is designed so that the implementation of the storage of bits is abstracted.
  * Programs that utilize the Bloom filters may use the {@code BitMapProducer} or {@code IndexProducer} to retrieve a
- * representation of the internal structure.  Additional methods are available in the {@code BitMap} to assist in
+ * representation of the internal structure. Additional methods are available in the {@code BitMap} to assist in
  * manipulation of the representations.</p>
  *
- * <p>The bloom filter code is an interface that requires implementation of 6 methods:</p>
+ * <p>The bloom filter code is an interface that requires implementation of 9 methods:</p>
  * <ul>
- * <li>{@code cardinality()}
- * returns the number of bits enabled in the Bloom filter.</li>
+ * <li>{@code cardinality()} returns the number of bits enabled in the Bloom filter.</li>
  *
- * <li>{@code contains(BitMapProducer)} which
- * returns true if the bits specified by the bit maps generated by the BitMapProducer are enabled in the Bloom filter.</li>
+ * <li>{@code characteristics()} which Returns a bitmap int of characteristics values.</li>

Review Comment:
   `returns`



##########
src/main/java/org/apache/commons/collections4/bloomfilter/package-info.java:
##########
@@ -20,58 +20,59 @@
  *
  * <h2>Background:</h2>
  *
- * <p>The Bloom filter is a probabilistic data structure that indicates where things are not.
- * Conceptually it is a bit vector. You create a Bloom filter by creating hashes
- * and converting those to enabled bits in the vector. Multiple Bloom filters may be merged
- * together into one Bloom filter.  It is possible to test if a filter {@code B} has merged into
- * another filter {@code A} by verifying that {@code (A & B) == B}.</p>
- *
- * <p>Bloom filters are generally used where hash
- * tables would be too large, or as a filter front end for longer processes. For example
- * most browsers have a Bloom filter that is built from all known bad URLs (ones that
- * serve up malware). When you enter a URL the browser builds a Bloom filter and checks to
- * see if it is "in" the bad URL filter. If not the URL is good, if it matches, then the
- * expensive lookup on a remote system is made to see if it actually is in the list. There
- * are lots of other uses, and in most cases the reason is to perform a fast check as a
- * gateway for a longer operation. </p>
+ * <p>The Bloom filter is a probabilistic data structure that indicates where things are not. Conceptually it is a bit
+ * vector. You create a Bloom filter by creating hashes and converting those to enabled bits in the vector. Multiple
+ * Bloom filters may be merged together into one Bloom filter. It is possible to test if a filter {@code B} has merged
+ * into another filter {@code A} by verifying that {@code (A & B) == B}.</p>
+ *
+ * <p>Bloom filters are generally used where hash tables would be too large, or as a filter front end for longer processes.
+ * For example most browsers have a Bloom filter that is built from all known bad URLs (ones that serve up malware).
+ * When you enter a URL the browser builds a Bloom filter and checks to see if it is "in" the bad URL filter. If not the
+ * URL is good, if it matches, then the expensive lookup on a remote system is made to see if it actually is in the
+ * list. There are lots of other uses, and in most cases the reason is to perform a fast check as a gateway for a longer
+ * operation.</p>
  *
  * <h3>BloomFilter</h3>
  *
- * <p>The Bloom filter architecture here is designed so that the implementation of the storage of bits is abstracted.
+ * <p>The Bloom filter architecture here is designed for speed of execution, so some methods like {@code merge}, {@code remove},
+ * {@code add}, and {@code subtract} may throw exceptions.  One an exception is thrown the state of the Bloom filter is unknown.
+ * The choice to use not use atomic transactions was made to achive maximum performance under correct usage.</p>
+ *
+ * <p>In addition the architecture is designed so that the implementation of the storage of bits is abstracted.
  * Programs that utilize the Bloom filters may use the {@code BitMapProducer} or {@code IndexProducer} to retrieve a
- * representation of the internal structure.  Additional methods are available in the {@code BitMap} to assist in
+ * representation of the internal structure. Additional methods are available in the {@code BitMap} to assist in
  * manipulation of the representations.</p>
  *
- * <p>The bloom filter code is an interface that requires implementation of 6 methods:</p>
+ * <p>The bloom filter code is an interface that requires implementation of 9 methods:</p>

Review Comment:
   If you drop the explicit 9 then you will not be caught out in the future, e.g. implementation of the following methods:
   
   Would it help to have each method inserted not with `@code` but a `@link` to the method. This will be caught by javadoc parsing if the method names are incorrect.
   
   ```
   // Will render as 'cardinality()'
   {@link BloomFilter#cardinality()}
   // Will render as 'cardinality'
   {@link BloomFilter#cardinality() cardinality}
   ```
   



##########
src/main/java/org/apache/commons/collections4/bloomfilter/BloomFilter.java:
##########
@@ -135,28 +135,57 @@ default boolean contains(BitMapProducer bitMapProducer) {
      * @param other The bloom filter to merge into this one.
      * @return true if the merge was successful
      */
-    boolean merge(BloomFilter other);
+    default boolean merge(BloomFilter other) {
+        return (characteristics() & SPARSE) != 0 ? merge( (IndexProducer) other ) : merge( (BitMapProducer) other);
+    }
 
     /**
      * Merges the specified hasher into this Bloom filter. Specifically all
      * bit indexes that are identified by the {@code hasher} will be enabled in this filter.
      *
      * <p><em>Note: This method should return {@code true} even if no additional bit indexes were
      * enabled. A {@code false} result indicates that this filter may or may not contain
-     * the {@code other} Bloom filter.</em>  This state may occur in complex Bloom filter implementations like
+     * the {@code hasher} values.</em>  This state may occur in complex Bloom filter implementations like
      * counting Bloom filters.</p>
      *
      * @param hasher The hasher to merge.
      * @return true if the merge was successful
      */
     default boolean merge(Hasher hasher) {
         Objects.requireNonNull(hasher, "hasher");
-        Shape shape = getShape();
-        // create the Bloom filter that is most likely to merge quickly with this one
-        BloomFilter result = (characteristics() & SPARSE) != 0 ? new SparseBloomFilter(shape, hasher) : new SimpleBloomFilter(shape, hasher);
-        return merge(result);
+        return merge( hasher.indices(getShape()));

Review Comment:
   remove single whitespace after `merge(`



##########
src/main/java/org/apache/commons/collections4/bloomfilter/package-info.java:
##########
@@ -20,58 +20,59 @@
  *
  * <h2>Background:</h2>
  *
- * <p>The Bloom filter is a probabilistic data structure that indicates where things are not.
- * Conceptually it is a bit vector. You create a Bloom filter by creating hashes
- * and converting those to enabled bits in the vector. Multiple Bloom filters may be merged
- * together into one Bloom filter.  It is possible to test if a filter {@code B} has merged into
- * another filter {@code A} by verifying that {@code (A & B) == B}.</p>
- *
- * <p>Bloom filters are generally used where hash
- * tables would be too large, or as a filter front end for longer processes. For example
- * most browsers have a Bloom filter that is built from all known bad URLs (ones that
- * serve up malware). When you enter a URL the browser builds a Bloom filter and checks to
- * see if it is "in" the bad URL filter. If not the URL is good, if it matches, then the
- * expensive lookup on a remote system is made to see if it actually is in the list. There
- * are lots of other uses, and in most cases the reason is to perform a fast check as a
- * gateway for a longer operation. </p>
+ * <p>The Bloom filter is a probabilistic data structure that indicates where things are not. Conceptually it is a bit
+ * vector. You create a Bloom filter by creating hashes and converting those to enabled bits in the vector. Multiple
+ * Bloom filters may be merged together into one Bloom filter. It is possible to test if a filter {@code B} has merged
+ * into another filter {@code A} by verifying that {@code (A & B) == B}.</p>
+ *
+ * <p>Bloom filters are generally used where hash tables would be too large, or as a filter front end for longer processes.
+ * For example most browsers have a Bloom filter that is built from all known bad URLs (ones that serve up malware).
+ * When you enter a URL the browser builds a Bloom filter and checks to see if it is "in" the bad URL filter. If not the
+ * URL is good, if it matches, then the expensive lookup on a remote system is made to see if it actually is in the
+ * list. There are lots of other uses, and in most cases the reason is to perform a fast check as a gateway for a longer
+ * operation.</p>
  *
  * <h3>BloomFilter</h3>
  *
- * <p>The Bloom filter architecture here is designed so that the implementation of the storage of bits is abstracted.
+ * <p>The Bloom filter architecture here is designed for speed of execution, so some methods like {@code merge}, {@code remove},
+ * {@code add}, and {@code subtract} may throw exceptions.  One an exception is thrown the state of the Bloom filter is unknown.
+ * The choice to use not use atomic transactions was made to achive maximum performance under correct usage.</p>
+ *
+ * <p>In addition the architecture is designed so that the implementation of the storage of bits is abstracted.
  * Programs that utilize the Bloom filters may use the {@code BitMapProducer} or {@code IndexProducer} to retrieve a
- * representation of the internal structure.  Additional methods are available in the {@code BitMap} to assist in
+ * representation of the internal structure. Additional methods are available in the {@code BitMap} to assist in
  * manipulation of the representations.</p>
  *
- * <p>The bloom filter code is an interface that requires implementation of 6 methods:</p>
+ * <p>The bloom filter code is an interface that requires implementation of 9 methods:</p>
  * <ul>
- * <li>{@code cardinality()}
- * returns the number of bits enabled in the Bloom filter.</li>
+ * <li>{@code cardinality()} returns the number of bits enabled in the Bloom filter.</li>
  *
- * <li>{@code contains(BitMapProducer)} which
- * returns true if the bits specified by the bit maps generated by the BitMapProducer are enabled in the Bloom filter.</li>
+ * <li>{@code characteristics()} which Returns a bitmap int of characteristics values.</li>
  *
- *  <li>{@code contains(IndexProducer)} which
- * returns true if the bits specified by the indices generated by IndexProducer are enabled in the Bloom filter.</li>
+ * <li>{@code clear()} which resets the Bloomfilter to its initial empty state.</li>
  *
- * <li>{@code getShape()} which
- * returns the shape the Bloom filter was created with.</li>
-
- * <li>{@code isSparse()} which
- * returns true if an the implementation tracks indices natively, false if bit maps are used.  In cases where
- * neither are used the {@code isSparse} return value should reflect which is faster to produce.</li>
+ * <li>{@code contains(IndexProducer)} which returns true if the bits specified by the indices generated by
+ * IndexProducer are enabled in the Bloom filter.</li>
+ *
+ * <li>{@code copy()} which returns a fresh copy of the bitmap.</li>
+ *
+ * <li>{@code getShape()} which returns the shape the Bloom filter was created with.</li>
+ *
+ * <li>{@code characteristics()} which an integer of characteristics flags.</li>
+ *
+ * <li>{@code merge(BitMapProducer)} which Merges the BitMaps from the BitMapProducer into the internal

Review Comment:
   `merges`



##########
src/main/java/org/apache/commons/collections4/bloomfilter/SimpleBloomFilter.java:
##########
@@ -133,12 +81,9 @@ public SimpleBloomFilter copy() {
         return new SimpleBloomFilter(this);
     }
 
-    /**
-     * Performs a merge using an IndexProducer.
-     * @param indexProducer the IndexProducer to merge from.
-     * @throws IllegalArgumentException if producer sends illegal value.
-     */
-    private void merge(IndexProducer indexProducer) {
+    @Override
+    public boolean merge(IndexProducer indexProducer) {
+        Objects.requireNonNull(indexProducer, "indexProducer");
         indexProducer.forEachIndex(idx -> {
             if (idx < 0 || idx >= shape.getNumberOfBits()) {
                 throw new IllegalArgumentException(String.format(

Review Comment:
   Not yet fixed. It should be:
   `"IndexProducer should only send values in the range[0,%s)", shape.getNumberOfBits()));`



##########
src/main/java/org/apache/commons/collections4/bloomfilter/package-info.java:
##########
@@ -80,22 +81,23 @@
  *
  * <h3>Hasher</h3>
  *
- * <p>A Hasher converts bytes into a series of integers based on a Shape.  With the exception of the HasherCollecton,
- *  each hasher represents one item being added to the Bloom filter.  The HasherCollection represents the
- *  number of items as the sum of the number of items represented by the Hashers in the collection.</p>
+ * <p>A Hasher converts bytes into a series of integers based on a Shape. With the exception of the HasherCollecton,
+ * each hasher represents one item being added to the Bloom filter. The HasherCollection represents the number of
+ * items as the sum of the number of items represented by the Hashers in the collection.</p>
  *
- *  <p>The SimpleHasher uses a combinatorial generation technique to create the integers. It is easily
- *  initialized by using a standard {@code MessageDigest} or other Hash function to hash the item to insert and
- *  then splitting the hash bytes in half and considering each as a long value.</p>
+ * <p>The EnhancedDoubleHasher uses a combinatorial generation technique to create the integers. It is easily
+ * initialized by using a byte array returned by the standard {@code MessageDigest} or other Hash function to

Review Comment:
   `hash function`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@commons.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org