You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@commons.apache.org by "Claudenw (via GitHub)" <gi...@apache.org> on 2023/07/16 18:13:08 UTC

[GitHub] [commons-collections] Claudenw opened a new pull request, #406: COLLECTIONS-844 added maxValue() to CountingBloomFilter

Claudenw opened a new pull request, #406:
URL: https://github.com/apache/commons-collections/pull/406

   Counting bloom filter tests are dependant upon the maximum value that can be stored in a cell in a Bloom filter.
   
   This patch adds a `maxValue()` method to the CountingBloomFilter interface.  This is to indicate the maximum value that can be stored in a single cell of the counting filter.  A cell is the counter that replaces the bit in the counting filter.
   
   Changes to tests


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@commons.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [commons-collections] aherbert merged pull request #406: COLLECTIONS-844 - allow counting Bloom filters with cell size other than Integer.SIZE

Posted by "aherbert (via GitHub)" <gi...@apache.org>.
aherbert merged PR #406:
URL: https://github.com/apache/commons-collections/pull/406


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@commons.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [commons-collections] aherbert commented on a diff in pull request #406: COLLECTIONS-844 - allow counting Bloom filters with cell size other than Integer.SIZE

Posted by "aherbert (via GitHub)" <gi...@apache.org>.
aherbert commented on code in PR #406:
URL: https://github.com/apache/commons-collections/pull/406#discussion_r1281192120


##########
src/main/java/org/apache/commons/collections4/bloomfilter/CellProducer.java:
##########
@@ -0,0 +1,170 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.commons.collections4.bloomfilter;
+
+import java.util.TreeMap;
+import java.util.function.IntPredicate;
+
+
+/**
+ * Some Bloom filter implementations use a count rather than a bit flag. The term {@code Cell} is used to
+ * refer to these counts and their associated index.  This class is the equivalent of the index producer except
+ * that it produces cells.
+ *
+ * <p>Note that a CellProducer must not return duplicate indices and must be ordered.</p>
+ *
+ * <p>Implementations must guarantee that:</p>
+ *
+ * <ul>
+ * <li>The IndexProducer implementation returns unique ordered indices.</li>
+ * <li>The cells are produced in IndexProducer order.</li>
+ * <li>For every value produced by the IndexProducer there will be only one matching
+ * cell produced by the CellProducer.</li>
+ * <li>The CellProducer will not generate cells with indices that are not output by the IndexProducer.</li>
+ * <li>The IndexProducer will not generate indices that have a zero count for the cell.</li>
+ * </ul>
+ *
+ * @since 4.5
+ */
+@FunctionalInterface
+public interface CellProducer extends IndexProducer {
+
+    /**
+     * Performs the given action for each {@code cell}  where the cell count is non-zero.
+     *
+     * <p>Some Bloom filter implementations use a count rather than a bit flag.  The term {@code Cell} is used to
+     * refer to these counts.</p>
+     *
+     * <p>Any exceptions thrown by the action are relayed to the caller. The consumer is applied to each
+     * cell. If the consumer returns {@code false} the execution is stopped, {@code false}
+     * is returned, and no further pairs are processed.</p>
+     *
+     * @param consumer the action to be performed for each non-zero cell.
+     * @return {@code true} if all cells return true from consumer, {@code false} otherwise.
+     * @throws NullPointerException if the specified consumer is null
+     */
+    boolean forEachCell(CellConsumer consumer);
+
+    /**
+     * The default implementation returns distinct and ordered indices for all cells with a non-zero count.
+     */
+    @Override
+    default boolean forEachIndex(final IntPredicate predicate) {
+        return forEachCell((i, v) -> predicate.test(i));
+    }
+
+    @Override
+    default IndexProducer uniqueIndices() {
+        return this;
+    }
+
+    /**
+     * Creates a CellProducer from an IndexProducer.
+     *
+     * <p>Note the following properties:
+     * <ul>
+     * <li>Each index returned from the IndexProducer is assumed to have a cell value of 1.</li>
+     * <li>The CellProducer aggregates duplicate indices from the IndexProducer.</li>
+     * </ul>
+     *
+     * <p>A CellProducer that outputs the mapping [(1,2),(2,3),(3,1)] can be created from many combinations
+     * of indices including:
+     * <pre>
+     * [1, 1, 2, 2, 2, 3]
+     * [1, 3, 1, 2, 2, 2]
+     * [3, 2, 1, 2, 1, 2]
+     * ...
+     * </pre>
+     *
+     * @param producer An index producer.
+     * @return A CellProducer with the same indices as the IndexProducer.
+     */
+    static CellProducer from(final IndexProducer producer) {
+        return new CellProducer() {
+            TreeMap<CounterCell, CounterCell> counterCells = new TreeMap<>();
+
+            private void populate() {
+                if (counterCells.isEmpty()) {
+                    producer.forEachIndex( idx -> {
+                        CounterCell cell = new CounterCell(idx, 1);
+                        CounterCell counter = counterCells.get(cell);
+                        if (counter == null) {
+                            counterCells.put(cell, cell);
+                        } else {
+                            counter.count++;
+                        }
+                        return true;
+                    });
+                }
+            }
+
+            @Override
+            public int[] asIndexArray() {
+                populate();
+                return counterCells.keySet().stream().mapToInt( c -> c.idx ).toArray();
+            }
+
+            @Override
+            public boolean forEachCell(CellConsumer consumer) {
+                populate();
+                for (CounterCell cell : counterCells.values()) {
+                    if (!consumer.test(cell.idx, cell.count) ) {
+                        return false;
+                    }
+                }
+                return true;
+            }
+
+            /**
+             * Class to track cell values in the TreeMap.
+             */
+            final class CounterCell implements Comparable<CounterCell> {
+                final int idx;
+                int count;
+
+                CounterCell(int idx, int count) {
+                    this.idx = idx;
+                    this.count = count;
+                }
+
+                @Override
+                public int compareTo(CounterCell other) {
+                    return Integer.compare( idx,  other.idx);
+                }
+            }
+        };
+    }
+
+    /**
+     * Represents an operation that accepts an {@code <index, cell>} pair.

Review Comment:
   Refers to `<index, cell>` pair. The rest of the interface is `<index, count>`.



##########
src/main/java/org/apache/commons/collections4/bloomfilter/CellProducer.java:
##########
@@ -0,0 +1,170 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.commons.collections4.bloomfilter;
+
+import java.util.TreeMap;
+import java.util.function.IntPredicate;
+
+
+/**
+ * Some Bloom filter implementations use a count rather than a bit flag. The term {@code Cell} is used to
+ * refer to these counts and their associated index.  This class is the equivalent of the index producer except
+ * that it produces cells.
+ *
+ * <p>Note that a CellProducer must not return duplicate indices and must be ordered.</p>
+ *
+ * <p>Implementations must guarantee that:</p>
+ *
+ * <ul>
+ * <li>The IndexProducer implementation returns unique ordered indices.</li>
+ * <li>The cells are produced in IndexProducer order.</li>
+ * <li>For every value produced by the IndexProducer there will be only one matching
+ * cell produced by the CellProducer.</li>
+ * <li>The CellProducer will not generate cells with indices that are not output by the IndexProducer.</li>
+ * <li>The IndexProducer will not generate indices that have a zero count for the cell.</li>
+ * </ul>
+ *
+ * @since 4.5
+ */
+@FunctionalInterface
+public interface CellProducer extends IndexProducer {
+
+    /**
+     * Performs the given action for each {@code cell}  where the cell count is non-zero.
+     *
+     * <p>Some Bloom filter implementations use a count rather than a bit flag.  The term {@code Cell} is used to
+     * refer to these counts.</p>
+     *
+     * <p>Any exceptions thrown by the action are relayed to the caller. The consumer is applied to each
+     * cell. If the consumer returns {@code false} the execution is stopped, {@code false}
+     * is returned, and no further pairs are processed.</p>
+     *
+     * @param consumer the action to be performed for each non-zero cell.
+     * @return {@code true} if all cells return true from consumer, {@code false} otherwise.
+     * @throws NullPointerException if the specified consumer is null
+     */
+    boolean forEachCell(CellConsumer consumer);
+
+    /**
+     * The default implementation returns distinct and ordered indices for all cells with a non-zero count.
+     */
+    @Override
+    default boolean forEachIndex(final IntPredicate predicate) {
+        return forEachCell((i, v) -> predicate.test(i));
+    }
+
+    @Override
+    default IndexProducer uniqueIndices() {
+        return this;
+    }
+
+    /**
+     * Creates a CellProducer from an IndexProducer.
+     *
+     * <p>Note the following properties:
+     * <ul>
+     * <li>Each index returned from the IndexProducer is assumed to have a cell value of 1.</li>
+     * <li>The CellProducer aggregates duplicate indices from the IndexProducer.</li>
+     * </ul>
+     *
+     * <p>A CellProducer that outputs the mapping [(1,2),(2,3),(3,1)] can be created from many combinations
+     * of indices including:
+     * <pre>
+     * [1, 1, 2, 2, 2, 3]
+     * [1, 3, 1, 2, 2, 2]
+     * [3, 2, 1, 2, 1, 2]
+     * ...
+     * </pre>
+     *
+     * @param producer An index producer.
+     * @return A CellProducer with the same indices as the IndexProducer.
+     */
+    static CellProducer from(final IndexProducer producer) {
+        return new CellProducer() {
+            TreeMap<CounterCell, CounterCell> counterCells = new TreeMap<>();
+
+            private void populate() {
+                if (counterCells.isEmpty()) {
+                    producer.forEachIndex( idx -> {
+                        CounterCell cell = new CounterCell(idx, 1);
+                        CounterCell counter = counterCells.get(cell);
+                        if (counter == null) {
+                            counterCells.put(cell, cell);
+                        } else {
+                            counter.count++;
+                        }
+                        return true;
+                    });
+                }
+            }
+
+            @Override
+            public int[] asIndexArray() {
+                populate();
+                return counterCells.keySet().stream().mapToInt( c -> c.idx ).toArray();

Review Comment:
   whitespace



##########
src/main/java/org/apache/commons/collections4/bloomfilter/CellProducer.java:
##########
@@ -0,0 +1,170 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.commons.collections4.bloomfilter;
+
+import java.util.TreeMap;
+import java.util.function.IntPredicate;
+
+
+/**
+ * Some Bloom filter implementations use a count rather than a bit flag. The term {@code Cell} is used to
+ * refer to these counts and their associated index.  This class is the equivalent of the index producer except
+ * that it produces cells.
+ *
+ * <p>Note that a CellProducer must not return duplicate indices and must be ordered.</p>
+ *
+ * <p>Implementations must guarantee that:</p>
+ *
+ * <ul>
+ * <li>The IndexProducer implementation returns unique ordered indices.</li>
+ * <li>The cells are produced in IndexProducer order.</li>
+ * <li>For every value produced by the IndexProducer there will be only one matching
+ * cell produced by the CellProducer.</li>
+ * <li>The CellProducer will not generate cells with indices that are not output by the IndexProducer.</li>
+ * <li>The IndexProducer will not generate indices that have a zero count for the cell.</li>
+ * </ul>
+ *
+ * @since 4.5
+ */
+@FunctionalInterface
+public interface CellProducer extends IndexProducer {
+
+    /**
+     * Performs the given action for each {@code cell}  where the cell count is non-zero.
+     *
+     * <p>Some Bloom filter implementations use a count rather than a bit flag.  The term {@code Cell} is used to
+     * refer to these counts.</p>
+     *
+     * <p>Any exceptions thrown by the action are relayed to the caller. The consumer is applied to each
+     * cell. If the consumer returns {@code false} the execution is stopped, {@code false}
+     * is returned, and no further pairs are processed.</p>
+     *
+     * @param consumer the action to be performed for each non-zero cell.
+     * @return {@code true} if all cells return true from consumer, {@code false} otherwise.
+     * @throws NullPointerException if the specified consumer is null
+     */
+    boolean forEachCell(CellConsumer consumer);
+
+    /**
+     * The default implementation returns distinct and ordered indices for all cells with a non-zero count.
+     */
+    @Override
+    default boolean forEachIndex(final IntPredicate predicate) {
+        return forEachCell((i, v) -> predicate.test(i));
+    }
+
+    @Override
+    default IndexProducer uniqueIndices() {
+        return this;
+    }
+
+    /**
+     * Creates a CellProducer from an IndexProducer.
+     *
+     * <p>Note the following properties:
+     * <ul>
+     * <li>Each index returned from the IndexProducer is assumed to have a cell value of 1.</li>
+     * <li>The CellProducer aggregates duplicate indices from the IndexProducer.</li>
+     * </ul>
+     *
+     * <p>A CellProducer that outputs the mapping [(1,2),(2,3),(3,1)] can be created from many combinations
+     * of indices including:
+     * <pre>
+     * [1, 1, 2, 2, 2, 3]
+     * [1, 3, 1, 2, 2, 2]
+     * [3, 2, 1, 2, 1, 2]
+     * ...
+     * </pre>
+     *
+     * @param producer An index producer.
+     * @return A CellProducer with the same indices as the IndexProducer.
+     */
+    static CellProducer from(final IndexProducer producer) {
+        return new CellProducer() {
+            TreeMap<CounterCell, CounterCell> counterCells = new TreeMap<>();
+
+            private void populate() {
+                if (counterCells.isEmpty()) {
+                    producer.forEachIndex( idx -> {
+                        CounterCell cell = new CounterCell(idx, 1);
+                        CounterCell counter = counterCells.get(cell);
+                        if (counter == null) {
+                            counterCells.put(cell, cell);
+                        } else {
+                            counter.count++;
+                        }
+                        return true;
+                    });
+                }
+            }
+
+            @Override
+            public int[] asIndexArray() {
+                populate();
+                return counterCells.keySet().stream().mapToInt( c -> c.idx ).toArray();
+            }
+
+            @Override
+            public boolean forEachCell(CellConsumer consumer) {
+                populate();
+                for (CounterCell cell : counterCells.values()) {
+                    if (!consumer.test(cell.idx, cell.count) ) {
+                        return false;
+                    }
+                }
+                return true;
+            }
+
+            /**
+             * Class to track cell values in the TreeMap.
+             */
+            final class CounterCell implements Comparable<CounterCell> {
+                final int idx;
+                int count;
+
+                CounterCell(int idx, int count) {
+                    this.idx = idx;
+                    this.count = count;
+                }
+
+                @Override
+                public int compareTo(CounterCell other) {
+                    return Integer.compare( idx,  other.idx);

Review Comment:
   Whitespace



##########
src/main/java/org/apache/commons/collections4/bloomfilter/IndexProducer.java:
##########
@@ -32,6 +33,8 @@
 @FunctionalInterface
 public interface IndexProducer {
 
+    int MAX_ARRAY_SIZE = Integer.MAX_VALUE - 8;

Review Comment:
   This should not be in the interface. It makes it a public variable.
   
   I suggest moving the array sizing functionality to a package private class, e.g.:
   ```java
   final class IndexUtils {
       // Ensure the array can add an element at the specified index
       static int[] ensureCapacityForAdd(int[] array, int index) {
           // ...
       }
   }
   ```
   There is no other static helper class suitable, so just create an appropriately named class.



##########
src/main/java/org/apache/commons/collections4/bloomfilter/IndexFilter.java:
##########


Review Comment:
   All these changes are not related to this PR. Please revert.
   
   They can be considered separately. By chaining predicates using `and`, `or` and `negate` you are adding a layer of obfuscation where the method calls may make the filter less efficient. A performance benchmark would useful here.



##########
src/main/java/org/apache/commons/collections4/bloomfilter/IndexProducer.java:
##########
@@ -117,11 +120,60 @@ public boolean test(long word) {
      * @return An int array of the data.
      */
     default int[] asIndexArray() {
-        final BitSet result = new BitSet();
+        class Indices {
+            private int[] data = new int[32];
+            private int size;
+
+            boolean add(final int index) {
+                if (size == data.length) {
+                    data = Arrays.copyOf(data, (int) Math.min(MAX_ARRAY_SIZE, size * 2L));
+                }
+                data[size++] = index;
+                return true;
+            }
+
+            int[] toArray() {
+                // Edge case to avoid a large array copy
+                return size == data.length ? data : Arrays.copyOf(data, size);
+            }
+        }
+        Indices indices = new Indices();
+        forEachIndex(indices::add);
+        return indices.toArray();
+    }
+
+    /**
+     * Creates an IndexProducer of unique indices for this index.
+     *
+     * <p>This is like the `indices(Shape)` method except that it adds the guarantee that no
+     * duplicate values will be returned. The indices produced are equivalent to those returned
+     * from by a Bloom filter created from this hasher.</p>
+     *
+     * @return the iterator of integers
+     * @throws IndexOutOfBoundsException if any index is less than 0
+     */
+    default IndexProducer uniqueIndices() {
+        final BitSet bitSet = new BitSet();
         forEachIndex(i -> {
-            result.set(i);
+            bitSet.set(i);
             return true;
         });
-        return result.stream().toArray();
+
+        return new IndexProducer() {
+            @Override
+            public boolean forEachIndex(IntPredicate predicate) {
+                for (int idx = bitSet.nextSetBit(0); idx >= 0; idx = bitSet.nextSetBit(idx+1)) {

Review Comment:
   `idx + 1`



##########
src/main/java/org/apache/commons/collections4/bloomfilter/IndexProducer.java:
##########
@@ -117,11 +120,60 @@ public boolean test(long word) {
      * @return An int array of the data.
      */
     default int[] asIndexArray() {
-        final BitSet result = new BitSet();
+        class Indices {
+            private int[] data = new int[32];
+            private int size;
+
+            boolean add(final int index) {
+                if (size == data.length) {
+                    data = Arrays.copyOf(data, (int) Math.min(MAX_ARRAY_SIZE, size * 2L));
+                }
+                data[size++] = index;
+                return true;
+            }
+
+            int[] toArray() {
+                // Edge case to avoid a large array copy
+                return size == data.length ? data : Arrays.copyOf(data, size);
+            }
+        }
+        Indices indices = new Indices();
+        forEachIndex(indices::add);
+        return indices.toArray();
+    }
+
+    /**
+     * Creates an IndexProducer of unique indices for this index.
+     *
+     * <p>This is like the `indices(Shape)` method except that it adds the guarantee that no

Review Comment:
   Since you moved this from Hasher it refers to a non-existent method `indices(Shape)`. This paragraph also refers to `this hasher`
   



##########
src/main/java/org/apache/commons/collections4/bloomfilter/BloomFilter.java:
##########
@@ -309,4 +309,12 @@ default int estimateIntersection(final BloomFilter other) {
         }
         return estimate>Integer.MAX_VALUE?Integer.MAX_VALUE:(int) estimate;
     }
+
+    /**
+     * Most Bloom filter's create unique IndexProducers.

Review Comment:
   `filter's` to `filters`
   
   This is low quality javadoc.



##########
src/main/java/org/apache/commons/collections4/bloomfilter/IndexProducer.java:
##########
@@ -117,11 +120,60 @@ public boolean test(long word) {
      * @return An int array of the data.
      */
     default int[] asIndexArray() {
-        final BitSet result = new BitSet();
+        class Indices {
+            private int[] data = new int[32];
+            private int size;
+
+            boolean add(final int index) {
+                if (size == data.length) {
+                    data = Arrays.copyOf(data, (int) Math.min(MAX_ARRAY_SIZE, size * 2L));
+                }
+                data[size++] = index;
+                return true;
+            }
+
+            int[] toArray() {
+                // Edge case to avoid a large array copy
+                return size == data.length ? data : Arrays.copyOf(data, size);
+            }
+        }
+        Indices indices = new Indices();
+        forEachIndex(indices::add);
+        return indices.toArray();
+    }
+
+    /**
+     * Creates an IndexProducer of unique indices for this index.

Review Comment:
   Creates an IndexProducer that returns the same indices as this instance, removing duplicates so that each index will be output only once.



##########
src/test/java/org/apache/commons/collections4/bloomfilter/AbstractCellProducerTest.java:
##########


Review Comment:
   This file has lost the git history from `AbstractBitCountProducerTest`. It is showing as a new file and the old test as deleted.



##########
src/main/java/org/apache/commons/collections4/bloomfilter/CellProducer.java:
##########


Review Comment:
   This file has lost the git history from `BitCountProducer`. It is showing as a new file and the old interface as deleted.



##########
src/main/java/org/apache/commons/collections4/bloomfilter/IndexProducer.java:
##########
@@ -117,11 +120,60 @@ public boolean test(long word) {
      * @return An int array of the data.
      */
     default int[] asIndexArray() {
-        final BitSet result = new BitSet();
+        class Indices {
+            private int[] data = new int[32];
+            private int size;
+
+            boolean add(final int index) {
+                if (size == data.length) {
+                    data = Arrays.copyOf(data, (int) Math.min(MAX_ARRAY_SIZE, size * 2L));
+                }
+                data[size++] = index;
+                return true;
+            }
+
+            int[] toArray() {
+                // Edge case to avoid a large array copy
+                return size == data.length ? data : Arrays.copyOf(data, size);
+            }
+        }
+        Indices indices = new Indices();
+        forEachIndex(indices::add);
+        return indices.toArray();
+    }
+
+    /**
+     * Creates an IndexProducer of unique indices for this index.
+     *
+     * <p>This is like the `indices(Shape)` method except that it adds the guarantee that no
+     * duplicate values will be returned. The indices produced are equivalent to those returned
+     * from by a Bloom filter created from this hasher.</p>
+     *
+     * @return the iterator of integers
+     * @throws IndexOutOfBoundsException if any index is less than 0

Review Comment:
   This IOOB exception is only relevant to the default implementation. As such I would remove it from the `@throws` and document the default implementation. "The default implementation will filter the indices from this instance and return them in ascending order. Any indices less than zero are invalid and will generate a runtime exception."



##########
src/test/java/org/apache/commons/collections4/bloomfilter/AbstractCellProducerTest.java:
##########
@@ -0,0 +1,154 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.commons.collections4.bloomfilter;
+
+import static org.junit.jupiter.api.Assertions.assertArrayEquals;
+import static org.junit.jupiter.api.Assertions.assertEquals;
+import static org.junit.jupiter.api.Assertions.assertFalse;
+import static org.junit.jupiter.api.Assertions.assertTrue;
+import static org.junit.jupiter.api.Assertions.fail;
+
+import java.util.Arrays;
+import java.util.BitSet;
+
+import org.apache.commons.collections4.bloomfilter.CellProducer.CellConsumer;
+import org.junit.jupiter.api.Test;
+
+public abstract class AbstractCellProducerTest extends AbstractIndexProducerTest {
+
+    /**
+     * A testing CellConsumer that always returns true.
+     */
+    private static final CellConsumer TRUE_CONSUMER = (i, j) -> true;
+    /**
+     * A testing CellConsumer that always returns false.
+     */
+    private static final CellConsumer FALSE_CONSUMER = (i, j) -> false;
+
+    /**
+     * Creates an array of expected values that alignes with the expected indices entries.
+     * @return an array of expected values.
+     * @see AbstractIndexProducerTest#getExpectedIndices()
+     */
+    protected abstract int[] getExpectedValues();
+
+    @Override
+    protected final int getAsIndexArrayBehaviour() {
+        return ORDERED | DISTINCT;
+    }
+
+    /**
+     * Creates a producer with some data.
+     * @return a producer with some data
+     */
+    @Override
+    protected abstract CellProducer createProducer();
+
+    /**
+     * Creates a producer without data.
+     * @return a producer that has no data.
+     */
+    @Override
+    protected abstract CellProducer createEmptyProducer();
+
+    @Test
+    public final void testForEachCellPredicates() {
+        final CellProducer populated = createProducer();
+        final CellProducer empty = createEmptyProducer();
+
+        assertFalse(populated.forEachCell(FALSE_CONSUMER), "non-empty should be false");
+        assertTrue(empty.forEachCell(FALSE_CONSUMER), "empty should be true");
+
+        assertTrue(populated.forEachCell(TRUE_CONSUMER), "non-empty should be true");
+        assertTrue(empty.forEachCell(TRUE_CONSUMER), "empty should be true");
+    }
+
+    @Test
+    public final void testEmptyCellProducer() {
+        final CellProducer empty = createEmptyProducer();
+        final int ary[] = empty.asIndexArray();
+        assertEquals(0, ary.length);
+        assertTrue(empty.forEachCell((i, j) -> {
+            fail("forEachCell consumer should not be called");
+            return false;
+        }));
+    }
+
+    @Test
+    public final void testIndexConsistency() {
+        final CellProducer producer = createProducer();
+        final BitSet bs1 = new BitSet();
+        final BitSet bs2 = new BitSet();
+        producer.forEachIndex(i -> {
+            bs1.set(i);
+            return true;
+        });
+        producer.forEachCell((i, j) -> {
+            bs2.set(i);
+            return true;
+        });
+        assertEquals(bs1, bs2);
+    }
+
+    @Test
+    public void testForEachCellValues() {
+        int[] expectedIdx = getExpectedIndices();
+        int[] expectedValue = getExpectedValues();
+        assertEquals( expectedIdx.length, expectedValue.length, "expected index length and value length do not match");
+        int[] idx = {0};
+        createProducer().forEachCell((i, j) -> {
+            assertEquals(expectedIdx[idx[0]], i, "bad index at "+idx[0]);

Review Comment:
   `" + idx`
   Same on next line



##########
src/test/java/org/apache/commons/collections4/bloomfilter/AbstractCountingBloomFilterTest.java:
##########
@@ -289,6 +296,114 @@ public void testExcludesDuplicates() {
         bf1.merge(hasher);
         bf1.remove(hasher);
         assertEquals(0, bf1.cardinality());
-        assertTrue(bf1.forEachCount((x, y) -> false), "Hasher in removes results in value not equal to 0");
+        assertTrue(bf1.forEachCell((x, y) -> false), "Hasher in removes results in value not equal to 0");
+    }
+
+    private void verifyMaxInsert( CountingBloomFilter bf, int from1, int from11) {
+        BloomFilter bfFrom0 = new DefaultBloomFilterTest.SparseDefaultBloomFilter(getTestShape());
+        bfFrom0.merge(new IncrementingHasher(0, 1));
+        BloomFilter bfFrom1 = new DefaultBloomFilterTest.SparseDefaultBloomFilter(getTestShape());
+        bfFrom1.merge(TestingHashers.FROM1);
+        BloomFilter bfFrom11 = new DefaultBloomFilterTest.SparseDefaultBloomFilter(getTestShape());
+        bfFrom11.merge(TestingHashers.FROM11);
+
+        assertEquals(0, bf.getMaxInsert(new IncrementingHasher(0, 1)));
+        assertEquals(0, bf.getMaxInsert(bfFrom0));
+        assertEquals(0, bf.getMaxInsert((BitMapProducer) bfFrom0));
+        assertEquals(0, bf.getMaxInsert((IndexProducer) bfFrom0));
+
+        assertEquals(from1, bf.getMaxInsert(TestingHashers.FROM1));
+        assertEquals(from1, bf.getMaxInsert(bfFrom1));
+        assertEquals(from1, bf.getMaxInsert((BitMapProducer) bfFrom1));
+        assertEquals(from1, bf.getMaxInsert((IndexProducer) bfFrom1));
+
+        assertEquals(from11, bf.getMaxInsert(TestingHashers.FROM11));
+        assertEquals(from11, bf.getMaxInsert(bfFrom11));
+        assertEquals(from11, bf.getMaxInsert((BitMapProducer) bfFrom11));
+        assertEquals(from11, bf.getMaxInsert((IndexProducer) bfFrom11));
+    }
+
+    @Test
+    public void testGetMaxInsert() {
+        CountingBloomFilter bf = createEmptyFilter(getTestShape());
+        verifyMaxInsert(bf, 0, 0);
+        bf.merge(TestingHashers.FROM1);
+        verifyMaxInsert(bf, 1, 0);
+        bf.merge(TestingHashers.FROM1);
+        verifyMaxInsert(bf, 2, 0);
+        bf.merge(TestingHashers.FROM11);
+        verifyMaxInsert(bf, 2, 1);
+        bf.remove(TestingHashers.FROM1);
+        verifyMaxInsert(bf, 1, 1);
+        // verify remove false positive works
+        // Incrementing hasher 5,1 spans the single count cells for both FROM1 and FROM11
+        assertEquals(1, bf.getMaxInsert(new IncrementingHasher(5, 1)));
+        bf.remove(new IncrementingHasher(5, 1));
+        verifyMaxInsert(bf, 0, 0);
+        assertEquals(0, bf.getMaxInsert(new IncrementingHasher(5, 1)));
+    }
+
+    private void assertCell3(CountingBloomFilter bf, int value) {
+        bf.forEachCell((k, v) -> {
+            if (k == 3) {
+                assertEquals(value, v, "Mismatch at position (3) "+k);
+            } else {
+                assertEquals(0, v, "Mismatch at position "+k);

Review Comment:
   `" + k`



##########
src/test/java/org/apache/commons/collections4/bloomfilter/AbstractCountingBloomFilterTest.java:
##########
@@ -289,6 +296,114 @@ public void testExcludesDuplicates() {
         bf1.merge(hasher);
         bf1.remove(hasher);
         assertEquals(0, bf1.cardinality());
-        assertTrue(bf1.forEachCount((x, y) -> false), "Hasher in removes results in value not equal to 0");
+        assertTrue(bf1.forEachCell((x, y) -> false), "Hasher in removes results in value not equal to 0");
+    }
+
+    private void verifyMaxInsert( CountingBloomFilter bf, int from1, int from11) {
+        BloomFilter bfFrom0 = new DefaultBloomFilterTest.SparseDefaultBloomFilter(getTestShape());
+        bfFrom0.merge(new IncrementingHasher(0, 1));
+        BloomFilter bfFrom1 = new DefaultBloomFilterTest.SparseDefaultBloomFilter(getTestShape());
+        bfFrom1.merge(TestingHashers.FROM1);
+        BloomFilter bfFrom11 = new DefaultBloomFilterTest.SparseDefaultBloomFilter(getTestShape());
+        bfFrom11.merge(TestingHashers.FROM11);
+
+        assertEquals(0, bf.getMaxInsert(new IncrementingHasher(0, 1)));
+        assertEquals(0, bf.getMaxInsert(bfFrom0));
+        assertEquals(0, bf.getMaxInsert((BitMapProducer) bfFrom0));
+        assertEquals(0, bf.getMaxInsert((IndexProducer) bfFrom0));
+
+        assertEquals(from1, bf.getMaxInsert(TestingHashers.FROM1));
+        assertEquals(from1, bf.getMaxInsert(bfFrom1));
+        assertEquals(from1, bf.getMaxInsert((BitMapProducer) bfFrom1));
+        assertEquals(from1, bf.getMaxInsert((IndexProducer) bfFrom1));
+
+        assertEquals(from11, bf.getMaxInsert(TestingHashers.FROM11));
+        assertEquals(from11, bf.getMaxInsert(bfFrom11));
+        assertEquals(from11, bf.getMaxInsert((BitMapProducer) bfFrom11));
+        assertEquals(from11, bf.getMaxInsert((IndexProducer) bfFrom11));
+    }
+
+    @Test
+    public void testGetMaxInsert() {
+        CountingBloomFilter bf = createEmptyFilter(getTestShape());
+        verifyMaxInsert(bf, 0, 0);
+        bf.merge(TestingHashers.FROM1);
+        verifyMaxInsert(bf, 1, 0);
+        bf.merge(TestingHashers.FROM1);
+        verifyMaxInsert(bf, 2, 0);
+        bf.merge(TestingHashers.FROM11);
+        verifyMaxInsert(bf, 2, 1);
+        bf.remove(TestingHashers.FROM1);
+        verifyMaxInsert(bf, 1, 1);
+        // verify remove false positive works
+        // Incrementing hasher 5,1 spans the single count cells for both FROM1 and FROM11
+        assertEquals(1, bf.getMaxInsert(new IncrementingHasher(5, 1)));
+        bf.remove(new IncrementingHasher(5, 1));
+        verifyMaxInsert(bf, 0, 0);
+        assertEquals(0, bf.getMaxInsert(new IncrementingHasher(5, 1)));
+    }
+
+    private void assertCell3(CountingBloomFilter bf, int value) {
+        bf.forEachCell((k, v) -> {
+            if (k == 3) {
+                assertEquals(value, v, "Mismatch at position (3) "+k);
+            } else {
+                assertEquals(0, v, "Mismatch at position "+k);
+            }
+            return true;
+        });
+    }
+
+    @Test
+    public void mergeIncrementsAllCellsTest() {
+        CountingBloomFilter f1 = createEmptyFilter(Shape.fromKM(1, 10));
+        CountingBloomFilter f2 = f1.copy();
+        CountingBloomFilter f3 = f1.copy();
+        // index producer produces 3 two times.
+        IndexProducer ip = p -> {
+            p.test(3);
+            p.test(3);
+            return true;
+        };
+        // The merge should increment cell 3 by 1
+        f1.merge(ip);
+        assertCell3(f1, 1);
+
+        // The add should increment cells 3 by 2
+        f2.add(CellProducer.from(ip));
+        assertCell3(f2, 2);
+
+        // This merge will increment by 1 as the round-trip makes the indices unique

Review Comment:
   This test is redundant. The round trip does not make the indices unique. So the test is effectively:
   ```java
   f3.merge(ip)
   ```
   This is same as the test using f1.



##########
src/test/java/org/apache/commons/collections4/bloomfilter/CellProducerFromDefaultIndexProducerTest.java:
##########
@@ -0,0 +1,45 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.commons.collections4.bloomfilter;
+
+public class CellProducerFromDefaultIndexProducerTest extends AbstractCellProducerTest {
+
+    int[] data = { 0, 63, 1, 64, 128, 1, 127 };

Review Comment:
   whitespace



##########
src/test/java/org/apache/commons/collections4/bloomfilter/CellProducerFromArrayCountingBloomFilterTest.java:
##########
@@ -0,0 +1,45 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.commons.collections4.bloomfilter;
+
+public class CellProducerFromArrayCountingBloomFilterTest extends AbstractCellProducerTest {
+
+    protected Shape shape = Shape.fromKM(17, 72);
+
+    @Override
+    protected CellProducer createProducer() {
+        final ArrayCountingBloomFilter filter = new ArrayCountingBloomFilter(shape);
+        filter.merge(new IncrementingHasher(0, 1));
+        filter.merge(new IncrementingHasher(5, 1));
+        return filter;
+    }
+
+    @Override
+    protected CellProducer createEmptyProducer() {
+        return new ArrayCountingBloomFilter(shape);
+    }
+
+    @Override
+    protected int[] getExpectedIndices() {
+        return new int[] {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21};
+    }
+
+    @Override
+    protected int[] getExpectedValues() {
+        return new int[] { 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1};

Review Comment:
   whitespace after {



##########
src/test/java/org/apache/commons/collections4/bloomfilter/DefaultIndexProducerTest.java:
##########
@@ -119,4 +118,42 @@ public void testFromIndexArray() {
             assertArrayEquals(expected, ip.asIndexArray());
         }
     }
+
+    @Test
+    public void testMoreThan32Entries() {
+        int[] values = new int[33];
+        for (int i=0; i<33; i++) {

Review Comment:
   int[] values = IntStream.range(0, 33).toArray();



##########
src/main/java/org/apache/commons/collections4/bloomfilter/CellProducer.java:
##########
@@ -0,0 +1,170 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.commons.collections4.bloomfilter;
+
+import java.util.TreeMap;
+import java.util.function.IntPredicate;
+
+
+/**
+ * Some Bloom filter implementations use a count rather than a bit flag. The term {@code Cell} is used to
+ * refer to these counts and their associated index.  This class is the equivalent of the index producer except
+ * that it produces cells.
+ *
+ * <p>Note that a CellProducer must not return duplicate indices and must be ordered.</p>
+ *
+ * <p>Implementations must guarantee that:</p>
+ *
+ * <ul>
+ * <li>The IndexProducer implementation returns unique ordered indices.</li>
+ * <li>The cells are produced in IndexProducer order.</li>
+ * <li>For every value produced by the IndexProducer there will be only one matching
+ * cell produced by the CellProducer.</li>
+ * <li>The CellProducer will not generate cells with indices that are not output by the IndexProducer.</li>
+ * <li>The IndexProducer will not generate indices that have a zero count for the cell.</li>
+ * </ul>
+ *
+ * @since 4.5
+ */
+@FunctionalInterface
+public interface CellProducer extends IndexProducer {
+
+    /**
+     * Performs the given action for each {@code cell}  where the cell count is non-zero.
+     *
+     * <p>Some Bloom filter implementations use a count rather than a bit flag.  The term {@code Cell} is used to
+     * refer to these counts.</p>
+     *
+     * <p>Any exceptions thrown by the action are relayed to the caller. The consumer is applied to each
+     * cell. If the consumer returns {@code false} the execution is stopped, {@code false}
+     * is returned, and no further pairs are processed.</p>
+     *
+     * @param consumer the action to be performed for each non-zero cell.
+     * @return {@code true} if all cells return true from consumer, {@code false} otherwise.
+     * @throws NullPointerException if the specified consumer is null
+     */
+    boolean forEachCell(CellConsumer consumer);
+
+    /**
+     * The default implementation returns distinct and ordered indices for all cells with a non-zero count.
+     */
+    @Override
+    default boolean forEachIndex(final IntPredicate predicate) {
+        return forEachCell((i, v) -> predicate.test(i));
+    }
+
+    @Override
+    default IndexProducer uniqueIndices() {
+        return this;
+    }
+
+    /**
+     * Creates a CellProducer from an IndexProducer.
+     *
+     * <p>Note the following properties:
+     * <ul>
+     * <li>Each index returned from the IndexProducer is assumed to have a cell value of 1.</li>
+     * <li>The CellProducer aggregates duplicate indices from the IndexProducer.</li>
+     * </ul>
+     *
+     * <p>A CellProducer that outputs the mapping [(1,2),(2,3),(3,1)] can be created from many combinations
+     * of indices including:
+     * <pre>
+     * [1, 1, 2, 2, 2, 3]
+     * [1, 3, 1, 2, 2, 2]
+     * [3, 2, 1, 2, 1, 2]
+     * ...
+     * </pre>
+     *
+     * @param producer An index producer.
+     * @return A CellProducer with the same indices as the IndexProducer.
+     */
+    static CellProducer from(final IndexProducer producer) {
+        return new CellProducer() {
+            TreeMap<CounterCell, CounterCell> counterCells = new TreeMap<>();
+
+            private void populate() {
+                if (counterCells.isEmpty()) {
+                    producer.forEachIndex( idx -> {
+                        CounterCell cell = new CounterCell(idx, 1);
+                        CounterCell counter = counterCells.get(cell);
+                        if (counter == null) {
+                            counterCells.put(cell, cell);
+                        } else {
+                            counter.count++;
+                        }
+                        return true;
+                    });
+                }
+            }
+
+            @Override
+            public int[] asIndexArray() {
+                populate();
+                return counterCells.keySet().stream().mapToInt( c -> c.idx ).toArray();
+            }
+
+            @Override
+            public boolean forEachCell(CellConsumer consumer) {
+                populate();
+                for (CounterCell cell : counterCells.values()) {
+                    if (!consumer.test(cell.idx, cell.count) ) {

Review Comment:
   whitespace



##########
src/test/java/org/apache/commons/collections4/bloomfilter/DefaultIndexProducerTest.java:
##########
@@ -119,4 +118,42 @@ public void testFromIndexArray() {
             assertArrayEquals(expected, ip.asIndexArray());
         }
     }
+
+    @Test
+    public void testMoreThan32Entries() {
+        int[] values = new int[33];
+        for (int i=0; i<33; i++) {
+            values[i] = i;
+        }
+        IndexProducer producer =  predicate -> {
+            Objects.requireNonNull(predicate);
+            for (final int i : values) {
+                if (!predicate.test(i)) {
+                    return false;
+                }
+            }
+            return true;
+        };
+        int[] other = producer.asIndexArray();
+        assertArrayEquals(values, other);
+    }
+
+    @Test
+    public void test32Entries() {

Review Comment:
   You can combine this with the test above as:
   ```java
   @ParameterizedTest
   @ValueSource(ints = {32, 33})
   void testEntries(int size) {
       int[] values = IntStream.range(0, size).toArray();
   ```
   



##########
src/test/java/org/apache/commons/collections4/bloomfilter/AbstractCellProducerTest.java:
##########
@@ -0,0 +1,154 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.commons.collections4.bloomfilter;
+
+import static org.junit.jupiter.api.Assertions.assertArrayEquals;
+import static org.junit.jupiter.api.Assertions.assertEquals;
+import static org.junit.jupiter.api.Assertions.assertFalse;
+import static org.junit.jupiter.api.Assertions.assertTrue;
+import static org.junit.jupiter.api.Assertions.fail;
+
+import java.util.Arrays;
+import java.util.BitSet;
+
+import org.apache.commons.collections4.bloomfilter.CellProducer.CellConsumer;
+import org.junit.jupiter.api.Test;
+
+public abstract class AbstractCellProducerTest extends AbstractIndexProducerTest {
+
+    /**
+     * A testing CellConsumer that always returns true.
+     */
+    private static final CellConsumer TRUE_CONSUMER = (i, j) -> true;
+    /**
+     * A testing CellConsumer that always returns false.
+     */
+    private static final CellConsumer FALSE_CONSUMER = (i, j) -> false;
+
+    /**
+     * Creates an array of expected values that alignes with the expected indices entries.
+     * @return an array of expected values.
+     * @see AbstractIndexProducerTest#getExpectedIndices()
+     */
+    protected abstract int[] getExpectedValues();
+
+    @Override
+    protected final int getAsIndexArrayBehaviour() {
+        return ORDERED | DISTINCT;
+    }
+
+    /**
+     * Creates a producer with some data.
+     * @return a producer with some data
+     */
+    @Override
+    protected abstract CellProducer createProducer();
+
+    /**
+     * Creates a producer without data.
+     * @return a producer that has no data.
+     */
+    @Override
+    protected abstract CellProducer createEmptyProducer();
+
+    @Test
+    public final void testForEachCellPredicates() {
+        final CellProducer populated = createProducer();
+        final CellProducer empty = createEmptyProducer();
+
+        assertFalse(populated.forEachCell(FALSE_CONSUMER), "non-empty should be false");
+        assertTrue(empty.forEachCell(FALSE_CONSUMER), "empty should be true");
+
+        assertTrue(populated.forEachCell(TRUE_CONSUMER), "non-empty should be true");
+        assertTrue(empty.forEachCell(TRUE_CONSUMER), "empty should be true");
+    }
+
+    @Test
+    public final void testEmptyCellProducer() {
+        final CellProducer empty = createEmptyProducer();
+        final int ary[] = empty.asIndexArray();
+        assertEquals(0, ary.length);
+        assertTrue(empty.forEachCell((i, j) -> {
+            fail("forEachCell consumer should not be called");
+            return false;
+        }));
+    }
+
+    @Test
+    public final void testIndexConsistency() {
+        final CellProducer producer = createProducer();
+        final BitSet bs1 = new BitSet();
+        final BitSet bs2 = new BitSet();
+        producer.forEachIndex(i -> {
+            bs1.set(i);
+            return true;
+        });
+        producer.forEachCell((i, j) -> {
+            bs2.set(i);
+            return true;
+        });
+        assertEquals(bs1, bs2);
+    }
+
+    @Test
+    public void testForEachCellValues() {
+        int[] expectedIdx = getExpectedIndices();
+        int[] expectedValue = getExpectedValues();
+        assertEquals( expectedIdx.length, expectedValue.length, "expected index length and value length do not match");

Review Comment:
   whitespace



##########
src/main/java/org/apache/commons/collections4/bloomfilter/CountingBloomFilter.java:
##########
@@ -103,40 +172,41 @@ default boolean merge(final BloomFilter other) {
     /**
      * Merges the specified Hasher into this Bloom filter.
      *
-     * <p>Specifically: all counts for the unique indexes identified by the {@code hasher} will be incremented by 1.</p>
+     * <p>Specifically: all cells for the unique indexes identified by the {@code hasher} will be incremented by 1.</p>
      *
      * <p>This method will return {@code true} if the filter is valid after the operation.</p>
      *
      * @param hasher the hasher
      * @return {@code true} if the removal was successful and the state is valid
      * @see #isValid()
-     * @see #add(BitCountProducer)
+     * @see #add(CellProducer)
      */
     @Override
     default boolean merge(final Hasher hasher) {
         Objects.requireNonNull(hasher, "hasher");
-        return merge(hasher.uniqueIndices(getShape()));
+        return merge(hasher.indices(getShape()));
     }
 
     /**
      * Merges the specified index producer into this Bloom filter.
      *
-     * <p>Specifically: all counts for the indexes identified by the {@code indexProducer} will be incremented by 1.</p>
+     * <p>Specifically: all unique cells for the indices identified by the {@code indexProducer} will be incremented by 1.</p>
      *
      * <p>This method will return {@code true} if the filter is valid after the operation.</p>
      *
-     * <p>Note: Indices that are returned multiple times will be incremented multiple times.</p>
+     * <p>Note: If indices that are returned multiple times should be incremented multiple times convert the IndexProducer
+     * to a CellProducer and add that.</p>
      *
      * @param indexProducer the IndexProducer
      * @return {@code true} if the removal was successful and the state is valid
      * @see #isValid()
-     * @see #add(BitCountProducer)
+     * @see #add(CellProducer)
      */
     @Override
     default boolean merge(final IndexProducer indexProducer) {
         Objects.requireNonNull(indexProducer, "indexProducer");
         try {
-            return add(BitCountProducer.from(indexProducer));
+            return add(CellProducer.from(indexProducer.uniqueIndices()));
         } catch (final IndexOutOfBoundsException e) {

Review Comment:
   Note: Although not related to this PR I find it presumptuous that the code may throw an index out of bounds exception. This exception is possible from the `uniqueIndices` default implementation (see my comments on that elsewhere in this PR), and if the `indexProducer` has its own unique indices and is broken, it would be from the `add` method of the implementing counting Bloom filter. That is allowed to throw whatever runtime exception it desires. So why are we wrapping this one possible exception, and then re-throwing it as another RTE when neither are documented? This should be discussed somewhere (e.g. mailing list).



##########
src/test/java/org/apache/commons/collections4/bloomfilter/AbstractCellProducerTest.java:
##########
@@ -0,0 +1,154 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.commons.collections4.bloomfilter;
+
+import static org.junit.jupiter.api.Assertions.assertArrayEquals;
+import static org.junit.jupiter.api.Assertions.assertEquals;
+import static org.junit.jupiter.api.Assertions.assertFalse;
+import static org.junit.jupiter.api.Assertions.assertTrue;
+import static org.junit.jupiter.api.Assertions.fail;
+
+import java.util.Arrays;
+import java.util.BitSet;
+
+import org.apache.commons.collections4.bloomfilter.CellProducer.CellConsumer;
+import org.junit.jupiter.api.Test;
+
+public abstract class AbstractCellProducerTest extends AbstractIndexProducerTest {
+
+    /**
+     * A testing CellConsumer that always returns true.
+     */
+    private static final CellConsumer TRUE_CONSUMER = (i, j) -> true;
+    /**
+     * A testing CellConsumer that always returns false.
+     */
+    private static final CellConsumer FALSE_CONSUMER = (i, j) -> false;
+
+    /**
+     * Creates an array of expected values that alignes with the expected indices entries.

Review Comment:
   `aligns`



##########
src/test/java/org/apache/commons/collections4/bloomfilter/AbstractCellProducerTest.java:
##########
@@ -0,0 +1,154 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.commons.collections4.bloomfilter;
+
+import static org.junit.jupiter.api.Assertions.assertArrayEquals;
+import static org.junit.jupiter.api.Assertions.assertEquals;
+import static org.junit.jupiter.api.Assertions.assertFalse;
+import static org.junit.jupiter.api.Assertions.assertTrue;
+import static org.junit.jupiter.api.Assertions.fail;
+
+import java.util.Arrays;
+import java.util.BitSet;
+
+import org.apache.commons.collections4.bloomfilter.CellProducer.CellConsumer;
+import org.junit.jupiter.api.Test;
+
+public abstract class AbstractCellProducerTest extends AbstractIndexProducerTest {
+
+    /**
+     * A testing CellConsumer that always returns true.
+     */
+    private static final CellConsumer TRUE_CONSUMER = (i, j) -> true;
+    /**
+     * A testing CellConsumer that always returns false.
+     */
+    private static final CellConsumer FALSE_CONSUMER = (i, j) -> false;
+
+    /**
+     * Creates an array of expected values that alignes with the expected indices entries.
+     * @return an array of expected values.
+     * @see AbstractIndexProducerTest#getExpectedIndices()
+     */
+    protected abstract int[] getExpectedValues();
+
+    @Override
+    protected final int getAsIndexArrayBehaviour() {
+        return ORDERED | DISTINCT;
+    }
+
+    /**
+     * Creates a producer with some data.
+     * @return a producer with some data
+     */
+    @Override
+    protected abstract CellProducer createProducer();
+
+    /**
+     * Creates a producer without data.
+     * @return a producer that has no data.
+     */
+    @Override
+    protected abstract CellProducer createEmptyProducer();
+
+    @Test
+    public final void testForEachCellPredicates() {
+        final CellProducer populated = createProducer();
+        final CellProducer empty = createEmptyProducer();
+
+        assertFalse(populated.forEachCell(FALSE_CONSUMER), "non-empty should be false");
+        assertTrue(empty.forEachCell(FALSE_CONSUMER), "empty should be true");
+
+        assertTrue(populated.forEachCell(TRUE_CONSUMER), "non-empty should be true");
+        assertTrue(empty.forEachCell(TRUE_CONSUMER), "empty should be true");
+    }
+
+    @Test
+    public final void testEmptyCellProducer() {
+        final CellProducer empty = createEmptyProducer();
+        final int ary[] = empty.asIndexArray();
+        assertEquals(0, ary.length);
+        assertTrue(empty.forEachCell((i, j) -> {
+            fail("forEachCell consumer should not be called");
+            return false;
+        }));
+    }
+
+    @Test
+    public final void testIndexConsistency() {
+        final CellProducer producer = createProducer();
+        final BitSet bs1 = new BitSet();
+        final BitSet bs2 = new BitSet();
+        producer.forEachIndex(i -> {
+            bs1.set(i);
+            return true;
+        });
+        producer.forEachCell((i, j) -> {
+            bs2.set(i);
+            return true;
+        });
+        assertEquals(bs1, bs2);
+    }
+
+    @Test
+    public void testForEachCellValues() {
+        int[] expectedIdx = getExpectedIndices();
+        int[] expectedValue = getExpectedValues();
+        assertEquals( expectedIdx.length, expectedValue.length, "expected index length and value length do not match");
+        int[] idx = {0};
+        createProducer().forEachCell((i, j) -> {
+            assertEquals(expectedIdx[idx[0]], i, "bad index at "+idx[0]);
+            assertEquals(expectedValue[idx[0]], j, "bad value at "+idx[0]);
+            idx[0]++;
+            return true;
+        });
+    }
+
+    /**
+     * Test the behavior of {@link CellProducer#forEachCell(CellConsumer)} with respect
+     * to ordered and distinct indices. Currently the behavior is assumed to be the same as
+     * {@link IndexProducer#forEachIndex(java.util.function.IntPredicate)}.
+     */
+    @Test
+    public final void testBehaviourForEachCell() {
+        final IntList list = new IntList();
+        createProducer().forEachCell((i, j) -> list.add(i));
+        final int[] actual = list.toArray();
+        // check order
+        final int[] expected = Arrays.stream(actual).sorted().toArray();
+        assertArrayEquals(expected, actual);
+        // check distinct
+        final long count = Arrays.stream(actual).distinct().count();
+        assertEquals(count, actual.length);
+    }
+
+    @Test
+    public void testForEachCellEarlyExit() {
+        final int[] passes = new int[1];
+        assertTrue(createEmptyProducer().forEachCell((i, j) -> {
+            passes[0]++;
+            return false;
+        }));
+        assertEquals(0, passes[0]);
+
+        assertFalse(createProducer().forEachCell((i, j) -> {

Review Comment:
   This test heavily assumes that the `createProducer` will have more than 1 cell populated so you can early exit.



##########
src/main/java/org/apache/commons/collections4/bloomfilter/IndexProducer.java:
##########
@@ -117,11 +120,60 @@ public boolean test(long word) {
      * @return An int array of the data.
      */
     default int[] asIndexArray() {
-        final BitSet result = new BitSet();
+        class Indices {

Review Comment:
   Note: This method now avoids the BitSet. As such I think we can remove the javadoc warning that the default implementation is slow. The note about returning in unique values in order is wrong. The default implementation now returns an array using the same order and cardinality as the `forEachIndex` method.



##########
src/test/java/org/apache/commons/collections4/bloomfilter/AbstractCountingBloomFilterTest.java:
##########
@@ -289,6 +296,114 @@ public void testExcludesDuplicates() {
         bf1.merge(hasher);
         bf1.remove(hasher);
         assertEquals(0, bf1.cardinality());
-        assertTrue(bf1.forEachCount((x, y) -> false), "Hasher in removes results in value not equal to 0");
+        assertTrue(bf1.forEachCell((x, y) -> false), "Hasher in removes results in value not equal to 0");
+    }
+
+    private void verifyMaxInsert( CountingBloomFilter bf, int from1, int from11) {

Review Comment:
   `(Counting`



##########
src/test/java/org/apache/commons/collections4/bloomfilter/DefaultCellProducerTest.java:
##########
@@ -16,21 +16,27 @@
  */
 package org.apache.commons.collections4.bloomfilter;
 
-public class DefaultBitCountProducerTest extends AbstractBitCountProducerTest {
+public class DefaultCellProducerTest extends AbstractCellProducerTest {
 
     /** Make forEachIndex unordered and contain duplicates. */
-    private final int[] values = {10, 1, 10, 1};
+    private final int[] indices = {1, 2, 3, 5};
+    private final int[] values = {1, 4, 9, 25};
 
     @Override
     protected int[] getExpectedIndices() {
+        return indices;
+    }
+
+    @Override
+    protected int[] getExpectedValues() {
         return values;
     }
 
     @Override
-    protected BitCountProducer createProducer() {
+    protected CellProducer createProducer() {
         return consumer -> {
-            for (final int i : values) {
-                if (!consumer.test(i, 1)) {
+            for (int i=0; i<indices.length; i++) {

Review Comment:
   `for (int i = 0; i < indices.length; i++) {`



##########
src/test/java/org/apache/commons/collections4/bloomfilter/AbstractCountingBloomFilterTest.java:
##########
@@ -289,6 +296,114 @@ public void testExcludesDuplicates() {
         bf1.merge(hasher);
         bf1.remove(hasher);
         assertEquals(0, bf1.cardinality());
-        assertTrue(bf1.forEachCount((x, y) -> false), "Hasher in removes results in value not equal to 0");
+        assertTrue(bf1.forEachCell((x, y) -> false), "Hasher in removes results in value not equal to 0");
+    }
+
+    private void verifyMaxInsert( CountingBloomFilter bf, int from1, int from11) {
+        BloomFilter bfFrom0 = new DefaultBloomFilterTest.SparseDefaultBloomFilter(getTestShape());
+        bfFrom0.merge(new IncrementingHasher(0, 1));
+        BloomFilter bfFrom1 = new DefaultBloomFilterTest.SparseDefaultBloomFilter(getTestShape());
+        bfFrom1.merge(TestingHashers.FROM1);
+        BloomFilter bfFrom11 = new DefaultBloomFilterTest.SparseDefaultBloomFilter(getTestShape());
+        bfFrom11.merge(TestingHashers.FROM11);
+
+        assertEquals(0, bf.getMaxInsert(new IncrementingHasher(0, 1)));
+        assertEquals(0, bf.getMaxInsert(bfFrom0));
+        assertEquals(0, bf.getMaxInsert((BitMapProducer) bfFrom0));
+        assertEquals(0, bf.getMaxInsert((IndexProducer) bfFrom0));
+
+        assertEquals(from1, bf.getMaxInsert(TestingHashers.FROM1));
+        assertEquals(from1, bf.getMaxInsert(bfFrom1));
+        assertEquals(from1, bf.getMaxInsert((BitMapProducer) bfFrom1));
+        assertEquals(from1, bf.getMaxInsert((IndexProducer) bfFrom1));
+
+        assertEquals(from11, bf.getMaxInsert(TestingHashers.FROM11));
+        assertEquals(from11, bf.getMaxInsert(bfFrom11));
+        assertEquals(from11, bf.getMaxInsert((BitMapProducer) bfFrom11));
+        assertEquals(from11, bf.getMaxInsert((IndexProducer) bfFrom11));
+    }
+
+    @Test
+    public void testGetMaxInsert() {
+        CountingBloomFilter bf = createEmptyFilter(getTestShape());
+        verifyMaxInsert(bf, 0, 0);
+        bf.merge(TestingHashers.FROM1);
+        verifyMaxInsert(bf, 1, 0);
+        bf.merge(TestingHashers.FROM1);
+        verifyMaxInsert(bf, 2, 0);
+        bf.merge(TestingHashers.FROM11);
+        verifyMaxInsert(bf, 2, 1);
+        bf.remove(TestingHashers.FROM1);
+        verifyMaxInsert(bf, 1, 1);
+        // verify remove false positive works
+        // Incrementing hasher 5,1 spans the single count cells for both FROM1 and FROM11
+        assertEquals(1, bf.getMaxInsert(new IncrementingHasher(5, 1)));
+        bf.remove(new IncrementingHasher(5, 1));
+        verifyMaxInsert(bf, 0, 0);
+        assertEquals(0, bf.getMaxInsert(new IncrementingHasher(5, 1)));
+    }
+
+    private void assertCell3(CountingBloomFilter bf, int value) {
+        bf.forEachCell((k, v) -> {
+            if (k == 3) {
+                assertEquals(value, v, "Mismatch at position (3) "+k);

Review Comment:
   Drop the `+k`, or drop the `(3)` as they are the same.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@commons.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [commons-collections] Claudenw commented on a diff in pull request #406: COLLECTIONS-844 - allow counting Bloom filters with cell size other than Integer.SIZE

Posted by "Claudenw (via GitHub)" <gi...@apache.org>.
Claudenw commented on code in PR #406:
URL: https://github.com/apache/commons-collections/pull/406#discussion_r1283528489


##########
src/main/java/org/apache/commons/collections4/bloomfilter/CellProducer.java:
##########
@@ -0,0 +1,170 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.commons.collections4.bloomfilter;
+
+import java.util.TreeMap;
+import java.util.function.IntPredicate;
+
+
+/**
+ * Some Bloom filter implementations use a count rather than a bit flag. The term {@code Cell} is used to
+ * refer to these counts and their associated index.  This class is the equivalent of the index producer except
+ * that it produces cells.
+ *
+ * <p>Note that a CellProducer must not return duplicate indices and must be ordered.</p>
+ *
+ * <p>Implementations must guarantee that:</p>
+ *
+ * <ul>
+ * <li>The IndexProducer implementation returns unique ordered indices.</li>
+ * <li>The cells are produced in IndexProducer order.</li>
+ * <li>For every value produced by the IndexProducer there will be only one matching
+ * cell produced by the CellProducer.</li>
+ * <li>The CellProducer will not generate cells with indices that are not output by the IndexProducer.</li>
+ * <li>The IndexProducer will not generate indices that have a zero count for the cell.</li>
+ * </ul>
+ *
+ * @since 4.5
+ */
+@FunctionalInterface
+public interface CellProducer extends IndexProducer {
+
+    /**
+     * Performs the given action for each {@code cell}  where the cell count is non-zero.
+     *
+     * <p>Some Bloom filter implementations use a count rather than a bit flag.  The term {@code Cell} is used to
+     * refer to these counts.</p>
+     *
+     * <p>Any exceptions thrown by the action are relayed to the caller. The consumer is applied to each
+     * cell. If the consumer returns {@code false} the execution is stopped, {@code false}
+     * is returned, and no further pairs are processed.</p>
+     *
+     * @param consumer the action to be performed for each non-zero cell.
+     * @return {@code true} if all cells return true from consumer, {@code false} otherwise.
+     * @throws NullPointerException if the specified consumer is null
+     */
+    boolean forEachCell(CellConsumer consumer);
+
+    /**
+     * The default implementation returns distinct and ordered indices for all cells with a non-zero count.
+     */
+    @Override
+    default boolean forEachIndex(final IntPredicate predicate) {
+        return forEachCell((i, v) -> predicate.test(i));
+    }
+
+    @Override
+    default IndexProducer uniqueIndices() {
+        return this;
+    }
+
+    /**
+     * Creates a CellProducer from an IndexProducer.
+     *
+     * <p>Note the following properties:
+     * <ul>
+     * <li>Each index returned from the IndexProducer is assumed to have a cell value of 1.</li>
+     * <li>The CellProducer aggregates duplicate indices from the IndexProducer.</li>
+     * </ul>
+     *
+     * <p>A CellProducer that outputs the mapping [(1,2),(2,3),(3,1)] can be created from many combinations
+     * of indices including:
+     * <pre>
+     * [1, 1, 2, 2, 2, 3]
+     * [1, 3, 1, 2, 2, 2]
+     * [3, 2, 1, 2, 1, 2]
+     * ...
+     * </pre>
+     *
+     * @param producer An index producer.
+     * @return A CellProducer with the same indices as the IndexProducer.
+     */
+    static CellProducer from(final IndexProducer producer) {
+        return new CellProducer() {
+            TreeMap<CounterCell, CounterCell> counterCells = new TreeMap<>();
+
+            private void populate() {
+                if (counterCells.isEmpty()) {
+                    producer.forEachIndex( idx -> {
+                        CounterCell cell = new CounterCell(idx, 1);
+                        CounterCell counter = counterCells.get(cell);
+                        if (counter == null) {
+                            counterCells.put(cell, cell);
+                        } else {
+                            counter.count++;
+                        }
+                        return true;
+                    });
+                }
+            }
+
+            @Override
+            public int[] asIndexArray() {
+                populate();
+                return counterCells.keySet().stream().mapToInt( c -> c.idx ).toArray();
+            }
+
+            @Override
+            public boolean forEachCell(CellConsumer consumer) {
+                populate();
+                for (CounterCell cell : counterCells.values()) {
+                    if (!consumer.test(cell.idx, cell.count) ) {
+                        return false;
+                    }
+                }
+                return true;
+            }
+
+            /**
+             * Class to track cell values in the TreeMap.
+             */
+            final class CounterCell implements Comparable<CounterCell> {
+                final int idx;
+                int count;
+
+                CounterCell(int idx, int count) {
+                    this.idx = idx;
+                    this.count = count;
+                }
+
+                @Override
+                public int compareTo(CounterCell other) {
+                    return Integer.compare( idx,  other.idx);
+                }
+            }
+        };
+    }
+
+    /**
+     * Represents an operation that accepts an {@code <index, cell>} pair.

Review Comment:
   fixed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@commons.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [commons-collections] aherbert commented on a diff in pull request #406: COLLECTIONS-844 - allow counting Bloom filters with cell size other than Integer.SIZE

Posted by "aherbert (via GitHub)" <gi...@apache.org>.
aherbert commented on code in PR #406:
URL: https://github.com/apache/commons-collections/pull/406#discussion_r1276350791


##########
src/main/java/org/apache/commons/collections4/bloomfilter/CountingBloomFilter.java:
##########
@@ -121,7 +186,7 @@ default boolean merge(final Hasher hasher) {
     /**
      * Merges the specified index producer into this Bloom filter.
      *
-     * <p>Specifically: all counts for the indexes identified by the {@code indexProducer} will be incremented by 1.</p>
+     * <p>Specifically: all cells for the indexes identified by the {@code indexProducer} will be incremented by 1.</p>

Review Comment:
   I don't think your conversion to a CellProducer upholds this contract. The cells will not be incremented by 1; they will be incremented by the count of the indices because merge then calls add.
   
   IIUC a merge is from BloomFilter and should flatten all arguments to an effective bitmap, removing duplicates. The add operation should not remove duplicate indices because they are treated as cells with a count above 1.
   
   We should have some tests to check these definitions are upheld.
   
   ```Java
   ArrayCountingBloomFilter f1 = new ArrayCountingBloomFilter(Shape.fromKM(1, 10));
   ArrayCountingBloomFilter f2 = f1.copy();
   ArrayCountingBloomFilter f3 = f1.copy();
   IndexProducer ip = p -> {
       p.test(3);
       p.test(3);
       return true;
   };
   // The merge SHOULD increment cells by 1
   f1.merge(ip);
   // The add should increment cells by 2
   f2.add(CellProducer.from(ip));
   // This merge will increment by 1 as the round-trip makes the indices unique
   f3.merge(IndexProducer.fromIndexArray(ip.asIndexArray()));
   
   f1.forEachCell((k, v) -> {
       System.out.printf("%d = %d%n", k, v);
       return true;
   });
   f2.forEachCell((k, v) -> {
       System.out.printf("%d = %d%n", k, v);
       return true;
   });
   f3.forEachCell((k, v) -> {
       System.out.printf("%d = %d%n", k, v);
       return true;
   });
   ```
   Prints:
   ```
   3 = 2    // ERROR !
   3 = 2
   3 = 1
   ```
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@commons.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [commons-collections] aherbert commented on a diff in pull request #406: COLLECTIONS-844 - allow counting Bloom filters with cell size other than Integer.SIZE

Posted by "aherbert (via GitHub)" <gi...@apache.org>.
aherbert commented on code in PR #406:
URL: https://github.com/apache/commons-collections/pull/406#discussion_r1278285658


##########
src/main/java/org/apache/commons/collections4/bloomfilter/CountingBloomFilter.java:
##########
@@ -121,7 +186,7 @@ default boolean merge(final Hasher hasher) {
     /**
      * Merges the specified index producer into this Bloom filter.
      *
-     * <p>Specifically: all counts for the indexes identified by the {@code indexProducer} will be incremented by 1.</p>
+     * <p>Specifically: all cells for the indexes identified by the {@code indexProducer} will be incremented by 1.</p>

Review Comment:
   If you wish to return a distinct array then the default method in the interface using a bit set will be fine.
   
   If you wish to return a count of duplicate indices then the current implementation will work if you just sort the array it currently creates. This may be faster than using a complicated intermediate data structure. The hasher is likely to have a small array and the speed of the sort will be fast.
   
   This may be an opportunity to use some more characteristics flags where an index producer can declare if the indices are sorted and or duplicates. Then the interface can provide default methods to remove duplicates or sort them.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@commons.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [commons-collections] Claudenw commented on a diff in pull request #406: COLLECTIONS-844 - allow counting Bloom filters with cell size other than Integer.SIZE

Posted by "Claudenw (via GitHub)" <gi...@apache.org>.
Claudenw commented on code in PR #406:
URL: https://github.com/apache/commons-collections/pull/406#discussion_r1278304303


##########
src/main/java/org/apache/commons/collections4/bloomfilter/CountingBloomFilter.java:
##########
@@ -121,7 +186,7 @@ default boolean merge(final Hasher hasher) {
     /**
      * Merges the specified index producer into this Bloom filter.
      *
-     * <p>Specifically: all counts for the indexes identified by the {@code indexProducer} will be incremented by 1.</p>
+     * <p>Specifically: all cells for the indexes identified by the {@code indexProducer} will be incremented by 1.</p>

Review Comment:
   The last bit of my previous message was not meant to indicate that this should not be merged, just that there are 2 more changes coming as different pull requests.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@commons.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [commons-collections] aherbert commented on pull request #406: COLLECTIONS-844 - allow counting Bloom filters with cell size other than Integer.SIZE

Posted by "aherbert (via GitHub)" <gi...@apache.org>.
aherbert commented on PR #406:
URL: https://github.com/apache/commons-collections/pull/406#issuecomment-1641796871

   Go ahead. If if more closely matches the literature then this is what we should do.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@commons.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [commons-collections] Claudenw commented on a diff in pull request #406: COLLECTIONS-844 - allow counting Bloom filters with cell size other than Integer.SIZE

Posted by "Claudenw (via GitHub)" <gi...@apache.org>.
Claudenw commented on code in PR #406:
URL: https://github.com/apache/commons-collections/pull/406#discussion_r1283532699


##########
src/test/java/org/apache/commons/collections4/bloomfilter/AbstractCellProducerTest.java:
##########


Review Comment:
   Every action I tried left github with delete and new.  Local git showed renamed during commit.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@commons.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [commons-collections] aherbert commented on a diff in pull request #406: COLLECTIONS-844 - allow counting Bloom filters with cell size other than Integer.SIZE

Posted by "aherbert (via GitHub)" <gi...@apache.org>.
aherbert commented on code in PR #406:
URL: https://github.com/apache/commons-collections/pull/406#discussion_r1287749787


##########
src/main/java/org/apache/commons/collections4/bloomfilter/IndexUtils.java:
##########
@@ -0,0 +1,47 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.commons.collections4.bloomfilter;
+
+import java.util.Arrays;
+
+/**
+ * Provides functions to assist in IndexProducer creation and manipulation.
+ * @see IndexProducer
+ */
+public class IndexUtils {
+
+    /**
+     * The maximum array size for the methods in this class.
+     */
+    public static final int MAX_ARRAY_SIZE = Integer.MAX_VALUE - 8;

Review Comment:
   Should not be public



##########
src/main/java/org/apache/commons/collections4/bloomfilter/IndexUtils.java:
##########
@@ -0,0 +1,47 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.commons.collections4.bloomfilter;
+
+import java.util.Arrays;
+
+/**
+ * Provides functions to assist in IndexProducer creation and manipulation.
+ * @see IndexProducer
+ */
+public class IndexUtils {

Review Comment:
   Should not be public; should be final.



##########
src/main/java/org/apache/commons/collections4/bloomfilter/IndexFilter.java:
##########
@@ -72,7 +72,10 @@ public boolean test(final int number) {
         if (number >= size) {
             throw new IndexOutOfBoundsException(String.format("number too large %d >= %d", number, size));
         }
-        return !tracker.test(number) || consumer.test(number);
+        if (tracker.test(number)) {

Review Comment:
   You reverted the IndexFilter file but left this change. I think the statements are identical. Since the plan is to replace this with a separate change to the IndexFilter then can you revert this too. The IndexFilter should be unchanged by this PR.



##########
src/main/java/org/apache/commons/collections4/bloomfilter/IndexUtils.java:
##########
@@ -0,0 +1,47 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.commons.collections4.bloomfilter;
+
+import java.util.Arrays;
+
+/**
+ * Provides functions to assist in IndexProducer creation and manipulation.
+ * @see IndexProducer
+ */
+public class IndexUtils {
+
+    /**
+     * The maximum array size for the methods in this class.
+     */
+    public static final int MAX_ARRAY_SIZE = Integer.MAX_VALUE - 8;
+
+    // do not instantiate
+    private IndexUtils() {}
+
+    /**
+     * Ensure the array can add an element at the specified index.
+     * @param array the array to check.
+     * @param index the index to add at.
+     * @return the array or a newly allocated copy of the array.
+     */
+    static int[] ensureCapacityForAdd(int[] array, int index) {
+        if (index >= array.length) {
+            return Arrays.copyOf(array, (int) Math.min(IndexUtils.MAX_ARRAY_SIZE, Math.max( array.length * 2L, index+1)));

Review Comment:
   whitespace: `Math.max(array.length * 2L, index + 1)`



##########
src/main/java/org/apache/commons/collections4/bloomfilter/CellProducer.java:
##########
@@ -143,14 +138,14 @@ final class CounterCell implements Comparable<CounterCell> {
 
                 @Override
                 public int compareTo(CounterCell other) {
-                    return Integer.compare( idx,  other.idx);
+                    return Integer.compare(idx,  other.idx);

Review Comment:
   No need for two spaces after the comma



##########
src/main/java/org/apache/commons/collections4/bloomfilter/IndexProducer.java:
##########
@@ -125,9 +121,7 @@ class Indices {
             private int size;
 
             boolean add(final int index) {
-                if (size == data.length) {
-                    data = Arrays.copyOf(data, (int) Math.min(MAX_ARRAY_SIZE, size * 2L));
-                }
+                data = IndexUtils.ensureCapacityForAdd(data, index);

Review Comment:
   This is wrong. It should use `size` when checking the capacity:
   ```java
   data = IndexUtils.ensureCapacityForAdd(data, size);
   ```
   I put this into `IndexProducerTest`:
   ```java
   @ParameterizedTest
   @ValueSource(ints = {32, 33})
   void testAsIndexArray(int n) {
       IndexProducer ip = i -> {
           for (int j = 0; j < n; j++) {
               // Always test index zero
               i.test(0);
           }
           return true;
       };
       Assertions.assertArrayEquals(new int[n], ip.asIndexArray());
   }
   ```
   The size 33 fails as the initial array of capacity 32 is not enlarged.
   



##########
src/main/java/org/apache/commons/collections4/bloomfilter/CellProducer.java:
##########
@@ -115,14 +110,14 @@ private void populate() {
             @Override
             public int[] asIndexArray() {
                 populate();
-                return counterCells.keySet().stream().mapToInt( c -> c.idx ).toArray();
+                return counterCells.keySet().stream().mapToInt(c -> c.idx ).toArray();

Review Comment:
   Still trailing whitespace: `(c -> c.idx)`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@commons.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [commons-collections] codecov-commenter commented on pull request #406: COLLECTIONS-844 - allow counting Bloom filters with cell size other than Integer.SIZE

Posted by "codecov-commenter (via GitHub)" <gi...@apache.org>.
codecov-commenter commented on PR #406:
URL: https://github.com/apache/commons-collections/pull/406#issuecomment-1637155296

   ## [Codecov](https://app.codecov.io/gh/apache/commons-collections/pull/406?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) Report
   > Merging [#406](https://app.codecov.io/gh/apache/commons-collections/pull/406?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) (aba6993) into [master](https://app.codecov.io/gh/apache/commons-collections/commit/99473dfa205a7c476a22190ee01f0b211091987b?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) (99473df) will **increase** coverage by `0.00%`.
   > The diff coverage is `100.00%`.
   
   ```diff
   @@            Coverage Diff            @@
   ##             master     #406   +/-   ##
   =========================================
     Coverage     81.20%   81.20%           
   - Complexity     4625     4626    +1     
   =========================================
     Files           290      290           
     Lines         13483    13484    +1     
     Branches       1993     1993           
   =========================================
   + Hits          10949    10950    +1     
     Misses         1933     1933           
     Partials        601      601           
   ```
   
   
   | [Impacted Files](https://app.codecov.io/gh/apache/commons-collections/pull/406?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | Coverage Δ | |
   |---|---|---|
   | [.../collections4/bloomfilter/CountingBloomFilter.java](https://app.codecov.io/gh/apache/commons-collections/pull/406?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#diff-c3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2NvbW1vbnMvY29sbGVjdGlvbnM0L2Jsb29tZmlsdGVyL0NvdW50aW5nQmxvb21GaWx0ZXIuamF2YQ==) | `100.00% <ø> (ø)` | |
   | [...ections4/bloomfilter/ArrayCountingBloomFilter.java](https://app.codecov.io/gh/apache/commons-collections/pull/406?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#diff-c3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2NvbW1vbnMvY29sbGVjdGlvbnM0L2Jsb29tZmlsdGVyL0FycmF5Q291bnRpbmdCbG9vbUZpbHRlci5qYXZh) | `100.00% <100.00%> (ø)` | |
   
   :mega: We’re building smart automated test selection to slash your CI/CD build times. [Learn more](https://about.codecov.io/iterative-testing/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@commons.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [commons-collections] aherbert commented on a diff in pull request #406: COLLECTIONS-844 - allow counting Bloom filters with cell size other than Integer.SIZE

Posted by "aherbert (via GitHub)" <gi...@apache.org>.
aherbert commented on code in PR #406:
URL: https://github.com/apache/commons-collections/pull/406#discussion_r1275113962


##########
src/main/java/org/apache/commons/collections4/bloomfilter/CountingBloomFilter.java:
##########
@@ -77,22 +77,79 @@ public interface CountingBloomFilter extends BloomFilter, BitCountProducer {
      */
     boolean isValid();
 
+    /**
+     * Returns maximum value for a cell in this Counting filter.
+     * @return the maximum value for a cell in this Counting filter.
+     */
+    int getMaxCell();
+
+    /**
+     * Determines the maximum number of times the Bloom filter could have been inserted
+     * into this counting filter.
+     * @param bloomFilter the Bloom filter the check for.
+     * @return the maximum number of times the Bloom filter could have been inserted.
+     */
+    default int getMaxInsert(BloomFilter bloomFilter) {
+        return getMaxInsert((BitMapProducer) bloomFilter);
+    }
+
+    /**
+     * Determines the maximum number of times the IndexProducer could have been inserted
+     * into this counting filter.
+     * @param idxProducer the producer to drive the count check.
+     * @return the maximum number of times the IndexProducer could have been inserted.
+     */
+    default int getMaxInsert(IndexProducer idxProducer) {
+        return getMaxInsert( BitMapProducer.fromIndexProducer(idxProducer, getShape().getNumberOfBits()));

Review Comment:
   whitespace `( BitMap`



##########
src/test/java/org/apache/commons/collections4/bloomfilter/AbstractCountingBloomFilterTest.java:
##########
@@ -289,6 +291,50 @@ public void testExcludesDuplicates() {
         bf1.merge(hasher);
         bf1.remove(hasher);
         assertEquals(0, bf1.cardinality());
-        assertTrue(bf1.forEachCount((x, y) -> false), "Hasher in removes results in value not equal to 0");
+        assertTrue(bf1.forEachCell((x, y) -> false), "Hasher in removes results in value not equal to 0");
+    }
+
+    private void verifyMaxInsert( CountingBloomFilter bf, int from1, int from11) {
+        BloomFilter bfFrom0 = new DefaultBloomFilterTest.SparseDefaultBloomFilter(getTestShape());
+        bfFrom0.merge(new IncrementingHasher(0, 1));
+        BloomFilter bfFrom1 = new DefaultBloomFilterTest.SparseDefaultBloomFilter(getTestShape());
+        bfFrom1.merge(TestingHashers.FROM1);
+        BloomFilter bfFrom11 = new DefaultBloomFilterTest.SparseDefaultBloomFilter(getTestShape());
+        bfFrom11.merge(TestingHashers.FROM11);
+
+        assertEquals( 0, bf.getMaxInsert(new IncrementingHasher(0, 1)));

Review Comment:
   No whitespace after `assertEquals(`



##########
src/main/java/org/apache/commons/collections4/bloomfilter/CountingBloomFilter.java:
##########
@@ -164,17 +221,17 @@ default boolean merge(final BitMapProducer bitMapProducer) {
     /**
      * Removes the specified Bloom filter from this Bloom filter.
      *
-     * <p>Specifically: all counts for the indexes identified by the {@code other} filter will be decremented by 1.</p>
+     * <p>Specifically: all cells for the indexes identified by the {@code other} filter will be decremented by 1.</p>
      *
-     * <p>Note: If the other filter is a counting Bloom filter the index counts are ignored and it is treated as an
+     * <p>Note: If the other filter is a counting Bloom filter the othre filter's cells are ignored and it is treated as an

Review Comment:
   `othre`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@commons.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [commons-collections] aherbert commented on a diff in pull request #406: COLLECTIONS-844 - allow counting Bloom filters with cell size other than Integer.SIZE

Posted by "aherbert (via GitHub)" <gi...@apache.org>.
aherbert commented on code in PR #406:
URL: https://github.com/apache/commons-collections/pull/406#discussion_r1276350791


##########
src/main/java/org/apache/commons/collections4/bloomfilter/CountingBloomFilter.java:
##########
@@ -121,7 +186,7 @@ default boolean merge(final Hasher hasher) {
     /**
      * Merges the specified index producer into this Bloom filter.
      *
-     * <p>Specifically: all counts for the indexes identified by the {@code indexProducer} will be incremented by 1.</p>
+     * <p>Specifically: all cells for the indexes identified by the {@code indexProducer} will be incremented by 1.</p>

Review Comment:
   I don't think your conversion to a CellProducer upholds this contract. The cells will not be incremented by 1; they will be incremented by the count of the indices because merge then calls add.
   
   IIUC a merge is from BloomFilter and should flatten all arguments to an effective bitmap, removing duplicates. The add operation should not remove duplicate indices because they are treated as cells with a count above 1.
   
   We should have some tests to check these definitions are upheld.
   
   ```Java
   ArrayCountingBloomFilter f1 = new ArrayCountingBloomFilter(Shape.fromKM(1, 10));
   ArrayCountingBloomFilter f2 = f1.copy();
   ArrayCountingBloomFilter f3 = f1.copy();
   IndexProducer ip = p -> {
       p.test(3);
       p.test(3);
       return true;
   };
   // The merge SHOULD increment cells by 1
   f1.merge(ip);
   // The add should increment cells by 2
   f2.add(CellProducer.from(ip));
   // This merge will increment by 1 as the round-trip makes the indices unique
   f3.merge(IndexProducer.fromIndexArray(ip.asIndexArray()));
   
   f1.forEachCell((k, v) -> {
       System.out.printf("%d = %d%n", k, v);
       return true;
   });
   f2.forEachCell((k, v) -> {
       System.out.printf("%d = %d%n", k, v);
       return true;
   });
   f3.forEachCell((k, v) -> {
       System.out.printf("%d = %d%n", k, v);
       return true;
   });
   ```
   Prints:
   ```
   3 = 2
   3 = 2    // ERROR !
   3 = 1
   ```
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@commons.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [commons-collections] Claudenw commented on pull request #406: COLLECTIONS-844 - allow counting Bloom filters with cell size other than Integer.SIZE

Posted by "Claudenw (via GitHub)" <gi...@apache.org>.
Claudenw commented on PR #406:
URL: https://github.com/apache/commons-collections/pull/406#issuecomment-1653102952

   I fixed the issues with getMaxInsert by requiring the CellProducer return distinct ordered indices.  It made much of the code simpler, but added more test cases.  I will go back and review all the places where Cell is referenced in java doc and ensure that there is some sort of description.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@commons.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [commons-collections] Claudenw commented on pull request #406: COLLECTIONS-844 - allow counting Bloom filters with cell size other than Integer.SIZE

Posted by "Claudenw (via GitHub)" <gi...@apache.org>.
Claudenw commented on PR #406:
URL: https://github.com/apache/commons-collections/pull/406#issuecomment-1641712706

   Cells are used in stable Bloom filters which are technically not a counting bloom filter they are in a class called "Element decay" Bloom filters.  They do use cells, not bits, but only report and accept bits as output and input.  There are several Bloom filters in this class.
   
   This change is to make the development of the stable and a variable size CountingBloomFilter possible.  In that change I was going to propose changing `BitCountProducer` and `BitCountConsumer` to `CellProdudcer` and `CellConsumer` respectively and change the method call from `forEachCount` to `forEachCell`.  However, I see that it may make sense to do that in this change to minimize the changes for the stable and variable counting filters.
   
   Are you ok with my proceeding to;
   
   1. Update the documentation to talk about "cells"
   2. Change BitCountProducer/BitCountConsumer to CellProducer/CellConsumer as noted above?
   3. Change `forEachCount` to `forEachCell`
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@commons.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [commons-collections] Claudenw commented on pull request #406: COLLECTIONS-844 - allow counting Bloom filters with cell size other than Integer.SIZE

Posted by "Claudenw (via GitHub)" <gi...@apache.org>.
Claudenw commented on PR #406:
URL: https://github.com/apache/commons-collections/pull/406#issuecomment-1664384896

   > Can you try and update the PR so that the git history is kept for two of the renamed files. You should be able to restore the files from an old commit, use `git mv` to rename them and then replace the contents with the new code files you have here.
   > 
   > Please remove changes to IndexFilter. They may be useful. They are not related to this change.
   
   No luck trying to get the move to be recognized.  Tried restoring and moving, restoring editing and then moving, restoring clearing the git cache of the new name, moving.  Nothing I tried resulted in the  github display showing the file as changed.  Only deleted and new.
   
   If I revert the IndexFilter change the code coverage will not be 100% and no way to get to 100%.  If that is the better solution at this point I will do that and make the change later as a separate pull request.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@commons.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [commons-collections] Claudenw commented on a diff in pull request #406: COLLECTIONS-844 - allow counting Bloom filters with cell size other than Integer.SIZE

Posted by "Claudenw (via GitHub)" <gi...@apache.org>.
Claudenw commented on code in PR #406:
URL: https://github.com/apache/commons-collections/pull/406#discussion_r1283581562


##########
src/main/java/org/apache/commons/collections4/bloomfilter/IndexProducer.java:
##########
@@ -117,11 +120,60 @@ public boolean test(long word) {
      * @return An int array of the data.
      */
     default int[] asIndexArray() {
-        final BitSet result = new BitSet();
+        class Indices {

Review Comment:
   updated



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@commons.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [commons-collections] aherbert commented on a diff in pull request #406: COLLECTIONS-844 - allow counting Bloom filters with cell size other than Integer.SIZE

Posted by "aherbert (via GitHub)" <gi...@apache.org>.
aherbert commented on code in PR #406:
URL: https://github.com/apache/commons-collections/pull/406#discussion_r1283620888


##########
src/test/java/org/apache/commons/collections4/bloomfilter/AbstractCellProducerTest.java:
##########
@@ -0,0 +1,154 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.commons.collections4.bloomfilter;
+
+import static org.junit.jupiter.api.Assertions.assertArrayEquals;
+import static org.junit.jupiter.api.Assertions.assertEquals;
+import static org.junit.jupiter.api.Assertions.assertFalse;
+import static org.junit.jupiter.api.Assertions.assertTrue;
+import static org.junit.jupiter.api.Assertions.fail;
+
+import java.util.Arrays;
+import java.util.BitSet;
+
+import org.apache.commons.collections4.bloomfilter.CellProducer.CellConsumer;
+import org.junit.jupiter.api.Test;
+
+public abstract class AbstractCellProducerTest extends AbstractIndexProducerTest {
+
+    /**
+     * A testing CellConsumer that always returns true.
+     */
+    private static final CellConsumer TRUE_CONSUMER = (i, j) -> true;
+    /**
+     * A testing CellConsumer that always returns false.
+     */
+    private static final CellConsumer FALSE_CONSUMER = (i, j) -> false;
+
+    /**
+     * Creates an array of expected values that alignes with the expected indices entries.
+     * @return an array of expected values.
+     * @see AbstractIndexProducerTest#getExpectedIndices()
+     */
+    protected abstract int[] getExpectedValues();
+
+    @Override
+    protected final int getAsIndexArrayBehaviour() {
+        return ORDERED | DISTINCT;
+    }
+
+    /**
+     * Creates a producer with some data.
+     * @return a producer with some data
+     */
+    @Override
+    protected abstract CellProducer createProducer();
+
+    /**
+     * Creates a producer without data.
+     * @return a producer that has no data.
+     */
+    @Override
+    protected abstract CellProducer createEmptyProducer();
+
+    @Test
+    public final void testForEachCellPredicates() {
+        final CellProducer populated = createProducer();
+        final CellProducer empty = createEmptyProducer();
+
+        assertFalse(populated.forEachCell(FALSE_CONSUMER), "non-empty should be false");
+        assertTrue(empty.forEachCell(FALSE_CONSUMER), "empty should be true");
+
+        assertTrue(populated.forEachCell(TRUE_CONSUMER), "non-empty should be true");
+        assertTrue(empty.forEachCell(TRUE_CONSUMER), "empty should be true");
+    }
+
+    @Test
+    public final void testEmptyCellProducer() {
+        final CellProducer empty = createEmptyProducer();
+        final int ary[] = empty.asIndexArray();
+        assertEquals(0, ary.length);
+        assertTrue(empty.forEachCell((i, j) -> {
+            fail("forEachCell consumer should not be called");
+            return false;
+        }));
+    }
+
+    @Test
+    public final void testIndexConsistency() {
+        final CellProducer producer = createProducer();
+        final BitSet bs1 = new BitSet();
+        final BitSet bs2 = new BitSet();
+        producer.forEachIndex(i -> {
+            bs1.set(i);
+            return true;
+        });
+        producer.forEachCell((i, j) -> {
+            bs2.set(i);
+            return true;
+        });
+        assertEquals(bs1, bs2);
+    }
+
+    @Test
+    public void testForEachCellValues() {
+        int[] expectedIdx = getExpectedIndices();
+        int[] expectedValue = getExpectedValues();
+        assertEquals( expectedIdx.length, expectedValue.length, "expected index length and value length do not match");
+        int[] idx = {0};
+        createProducer().forEachCell((i, j) -> {
+            assertEquals(expectedIdx[idx[0]], i, "bad index at "+idx[0]);
+            assertEquals(expectedValue[idx[0]], j, "bad value at "+idx[0]);
+            idx[0]++;
+            return true;
+        });
+    }
+
+    /**
+     * Test the behavior of {@link CellProducer#forEachCell(CellConsumer)} with respect
+     * to ordered and distinct indices. Currently the behavior is assumed to be the same as
+     * {@link IndexProducer#forEachIndex(java.util.function.IntPredicate)}.
+     */
+    @Test
+    public final void testBehaviourForEachCell() {
+        final IntList list = new IntList();
+        createProducer().forEachCell((i, j) -> list.add(i));
+        final int[] actual = list.toArray();
+        // check order
+        final int[] expected = Arrays.stream(actual).sorted().toArray();
+        assertArrayEquals(expected, actual);
+        // check distinct
+        final long count = Arrays.stream(actual).distinct().count();
+        assertEquals(count, actual.length);
+    }
+
+    @Test
+    public void testForEachCellEarlyExit() {
+        final int[] passes = new int[1];
+        assertTrue(createEmptyProducer().forEachCell((i, j) -> {
+            passes[0]++;
+            return false;
+        }));
+        assertEquals(0, passes[0]);
+
+        assertFalse(createProducer().forEachCell((i, j) -> {

Review Comment:
   My point was that the test is not checking the producer has more than 1 entry: it is assuming this. Thus it cannot ensure that the exit was early. If the producer only has 1 entry then there is no way to distinguish an early exit. But at least you can mark the test as skipped. To do this would require the test for the empty producer is separated.
   
   If we separate the test of the non-empty producer then a skipped test will be identified in the JUnit summary and should prompt an investigation as to why it was skipped (i.e. the producer only produces 1 cell).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@commons.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [commons-collections] aherbert commented on pull request #406: COLLECTIONS-844 - allow counting Bloom filters with cell size other than Integer.SIZE

Posted by "aherbert (via GitHub)" <gi...@apache.org>.
aherbert commented on PR #406:
URL: https://github.com/apache/commons-collections/pull/406#issuecomment-1638300622

   The functional changes are fine, i.e. expose some measure of the capacity of the counts. But without an exact use case I am not certain of how it would be used.
   
   I think this could do with some naming improvement. You cannot `get` values from a counting Bloom filter. So `getMaxValue` seems out of context. Plus no other docs in the class refer to a 'cell'. The filter always uses the term 'bit index'.
   
   Should this property instead reflect that it is the maximum count of the same bit index that can be added to the filter before the state is invalidated. So change to `getMaxCount` and document that it is the maximum count that can be held for any bit index. If this is exceeded then the filter is invalidated (see `#isValid`).
   
   Q. Is there a reference computation for an estimate of the number of items that can be added before saturating a bit index with a certain probability. It may be useful to point a reader in the direction of some calculations. Thus they can use the Shape and this property to get an estimate of capacity.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@commons.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [commons-collections] aherbert commented on a diff in pull request #406: COLLECTIONS-844 - allow counting Bloom filters with cell size other than Integer.SIZE

Posted by "aherbert (via GitHub)" <gi...@apache.org>.
aherbert commented on code in PR #406:
URL: https://github.com/apache/commons-collections/pull/406#discussion_r1278317763


##########
src/main/java/org/apache/commons/collections4/bloomfilter/CountingBloomFilter.java:
##########
@@ -121,7 +186,7 @@ default boolean merge(final Hasher hasher) {
     /**
      * Merges the specified index producer into this Bloom filter.
      *
-     * <p>Specifically: all counts for the indexes identified by the {@code indexProducer} will be incremented by 1.</p>
+     * <p>Specifically: all cells for the indexes identified by the {@code indexProducer} will be incremented by 1.</p>

Review Comment:
   PS. I just saw that you changed the default implementation of asIndexArray to respect the order of forEachIndex. I noted a problem with the internal comment in that method about the capacity.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@commons.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [commons-collections] aherbert commented on a diff in pull request #406: COLLECTIONS-844 - allow counting Bloom filters with cell size other than Integer.SIZE

Posted by "aherbert (via GitHub)" <gi...@apache.org>.
aherbert commented on code in PR #406:
URL: https://github.com/apache/commons-collections/pull/406#discussion_r1278317579


##########
src/main/java/org/apache/commons/collections4/bloomfilter/IndexProducer.java:
##########
@@ -117,11 +118,64 @@ public boolean test(long word) {
      * @return An int array of the data.
      */
     default int[] asIndexArray() {
-        final BitSet result = new BitSet();
+        class Indices {
+            private int[] data = new int[32];
+            private int size;
+
+            boolean add(final int index) {
+                if (size == data.length) {
+                    // This will throw an out-of-memory error if there are too many bits.
+                    // Since bits are addressed using 32-bit signed integer indices
+                    // the maximum length should be ~2^31 / 2^6 = ~2^25.

Review Comment:
   This is wrong. You are not using the int[] as a bitmap. You are using it to store indices. So you can get out of memory. However at some point you will not be able to return the array if it needs more than the max capacity of an array. A larger limit is:
   ```java
   data = Arrays.copyOf(data, (int) Math.min(MAX_ARRAY_SIZE, size * 2L);
   ```
   with MAX_ARRAY_SIZE something big. Previous versions of the JDK java.util.ArrayList use Integer.MAX_VALUE - 8.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@commons.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [commons-collections] Claudenw commented on a diff in pull request #406: COLLECTIONS-844 - allow counting Bloom filters with cell size other than Integer.SIZE

Posted by "Claudenw (via GitHub)" <gi...@apache.org>.
Claudenw commented on code in PR #406:
URL: https://github.com/apache/commons-collections/pull/406#discussion_r1283548806


##########
src/main/java/org/apache/commons/collections4/bloomfilter/IndexProducer.java:
##########
@@ -32,6 +33,8 @@
 @FunctionalInterface
 public interface IndexProducer {
 
+    int MAX_ARRAY_SIZE = Integer.MAX_VALUE - 8;

Review Comment:
   done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@commons.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [commons-collections] Claudenw commented on pull request #406: COLLECTIONS-844 - allow counting Bloom filters with cell size other than Integer.SIZE

Posted by "Claudenw (via GitHub)" <gi...@apache.org>.
Claudenw commented on PR #406:
URL: https://github.com/apache/commons-collections/pull/406#issuecomment-1661746467

   @aherbert Do you have a tool that finds all the screwy whitespace?  I try to keep the code clean but my historical style keeps slipping through.  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@commons.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [commons-collections] Claudenw commented on a diff in pull request #406: COLLECTIONS-844 - allow counting Bloom filters with cell size other than Integer.SIZE

Posted by "Claudenw (via GitHub)" <gi...@apache.org>.
Claudenw commented on code in PR #406:
URL: https://github.com/apache/commons-collections/pull/406#discussion_r1292804433


##########
src/main/java/org/apache/commons/collections4/bloomfilter/IndexUtils.java:
##########
@@ -0,0 +1,47 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.commons.collections4.bloomfilter;
+
+import java.util.Arrays;
+
+/**
+ * Provides functions to assist in IndexProducer creation and manipulation.
+ * @see IndexProducer
+ */
+public class IndexUtils {

Review Comment:
   fixed



##########
src/main/java/org/apache/commons/collections4/bloomfilter/IndexUtils.java:
##########
@@ -0,0 +1,47 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.commons.collections4.bloomfilter;
+
+import java.util.Arrays;
+
+/**
+ * Provides functions to assist in IndexProducer creation and manipulation.
+ * @see IndexProducer
+ */
+public class IndexUtils {
+
+    /**
+     * The maximum array size for the methods in this class.
+     */
+    public static final int MAX_ARRAY_SIZE = Integer.MAX_VALUE - 8;

Review Comment:
   fixed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@commons.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [commons-collections] Claudenw commented on a diff in pull request #406: COLLECTIONS-844 - allow counting Bloom filters with cell size other than Integer.SIZE

Posted by "Claudenw (via GitHub)" <gi...@apache.org>.
Claudenw commented on code in PR #406:
URL: https://github.com/apache/commons-collections/pull/406#discussion_r1283558387


##########
src/main/java/org/apache/commons/collections4/bloomfilter/IndexProducer.java:
##########
@@ -117,11 +120,60 @@ public boolean test(long word) {
      * @return An int array of the data.
      */
     default int[] asIndexArray() {
-        final BitSet result = new BitSet();
+        class Indices {
+            private int[] data = new int[32];
+            private int size;
+
+            boolean add(final int index) {
+                if (size == data.length) {
+                    data = Arrays.copyOf(data, (int) Math.min(MAX_ARRAY_SIZE, size * 2L));
+                }
+                data[size++] = index;
+                return true;
+            }
+
+            int[] toArray() {
+                // Edge case to avoid a large array copy
+                return size == data.length ? data : Arrays.copyOf(data, size);
+            }
+        }
+        Indices indices = new Indices();
+        forEachIndex(indices::add);
+        return indices.toArray();
+    }
+
+    /**
+     * Creates an IndexProducer of unique indices for this index.

Review Comment:
   changed text



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@commons.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [commons-collections] aherbert commented on pull request #406: COLLECTIONS-844 - allow counting Bloom filters with cell size other than Integer.SIZE

Posted by "aherbert (via GitHub)" <gi...@apache.org>.
aherbert commented on PR #406:
URL: https://github.com/apache/commons-collections/pull/406#issuecomment-1637156007

   Some extra checkstyle files were added. I hope to have time to review in the next few days.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@commons.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [commons-collections] Claudenw commented on a diff in pull request #406: COLLECTIONS-844 - allow counting Bloom filters with cell size other than Integer.SIZE

Posted by "Claudenw (via GitHub)" <gi...@apache.org>.
Claudenw commented on code in PR #406:
URL: https://github.com/apache/commons-collections/pull/406#discussion_r1278292441


##########
src/main/java/org/apache/commons/collections4/bloomfilter/CountingBloomFilter.java:
##########
@@ -121,7 +186,7 @@ default boolean merge(final Hasher hasher) {
     /**
      * Merges the specified index producer into this Bloom filter.
      *
-     * <p>Specifically: all counts for the indexes identified by the {@code indexProducer} will be incremented by 1.</p>
+     * <p>Specifically: all cells for the indexes identified by the {@code indexProducer} will be incremented by 1.</p>

Review Comment:
   If we have IndexProducer.asArray output the indices in the order that they would be returned by IndexProducer.forEachIndex() then we are not changing the data based on the representation.
   
   This also means that sorting can be done by generating an array, using Array.sort() and then using the result to create an IndexProducer.fromIndexArray() to create the sorted index.
   
   The characteristics flag may be a good idea. We already have them in the test code.
   
   How about we move this forward as is and merge it.  I think the change is large enough (I was trying to keep is small).
   
   I still need to get the LayeredBloom filter added and it will need to be updated once we have this in place. 
   Finally, I have the StableBloom filter, which may clarify some of the questions we have a bout characteristics and whether we have chosen the proper restrictions on IndexProducer and CellProducers.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@commons.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [commons-collections] aherbert commented on a diff in pull request #406: COLLECTIONS-844 - allow counting Bloom filters with cell size other than Integer.SIZE

Posted by "aherbert (via GitHub)" <gi...@apache.org>.
aherbert commented on code in PR #406:
URL: https://github.com/apache/commons-collections/pull/406#discussion_r1275110227


##########
src/main/java/org/apache/commons/collections4/bloomfilter/CountingBloomFilter.java:
##########
@@ -19,7 +19,7 @@
 import java.util.Objects;
 
 /**
- * The interface that describes a Bloom filter that associates a count with each
+ * The interface that describes a Bloom filter that associates a cell with each

Review Comment:
   Changing count to cell has led to some loss of information. The word count is obviously a tracking of occurrences. The word cell is not. IIUC a cell is a count for an index, thus it encapsulates both terms, i.e. you cannot have a cell without an index: cell = (index, count). If a cell did not have an associated index then it is simply a count.
   
   Where does this leave the naming of `CountingBloomFilter` and `ArrayCountingBloomFilter`?
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@commons.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [commons-collections] Claudenw commented on a diff in pull request #406: COLLECTIONS-844 - allow counting Bloom filters with cell size other than Integer.SIZE

Posted by "Claudenw (via GitHub)" <gi...@apache.org>.
Claudenw commented on code in PR #406:
URL: https://github.com/apache/commons-collections/pull/406#discussion_r1276206536


##########
src/main/java/org/apache/commons/collections4/bloomfilter/CountingBloomFilter.java:
##########
@@ -19,7 +19,7 @@
 import java.util.Objects;
 
 /**
- * The interface that describes a Bloom filter that associates a count with each
+ * The interface that describes a Bloom filter that associates a cell with each

Review Comment:
   Definition of cell and usage of cell vs count was reviewed and corrections made



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@commons.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [commons-collections] Claudenw commented on a diff in pull request #406: COLLECTIONS-844 - allow counting Bloom filters with cell size other than Integer.SIZE

Posted by "Claudenw (via GitHub)" <gi...@apache.org>.
Claudenw commented on code in PR #406:
URL: https://github.com/apache/commons-collections/pull/406#discussion_r1278265080


##########
src/main/java/org/apache/commons/collections4/bloomfilter/CountingBloomFilter.java:
##########
@@ -121,7 +186,7 @@ default boolean merge(final Hasher hasher) {
     /**
      * Merges the specified index producer into this Bloom filter.
      *
-     * <p>Specifically: all counts for the indexes identified by the {@code indexProducer} will be incremented by 1.</p>
+     * <p>Specifically: all cells for the indexes identified by the {@code indexProducer} will be incremented by 1.</p>

Review Comment:
   Agree this is incorrect.   I am adding this as a test to the AbstractCountingFilterTest.
   
   I think that we should make the following changes.
   
   for counting filters
   
   - merge always increments the cell value by at most 1.
   - remove always decrements the cell value by at most 1.
   - add and subtract as used to increment/decrement by more than 1.
   
   for index producers
   
   - as array always returns the indices in the number and order they are produced.
   - move the "unique" method from hashers to index producers. Hashers create index producers.  Let the use of the producer determine if it needs to be unique.
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@commons.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [commons-collections] Claudenw commented on a diff in pull request #406: COLLECTIONS-844 - allow counting Bloom filters with cell size other than Integer.SIZE

Posted by "Claudenw (via GitHub)" <gi...@apache.org>.
Claudenw commented on code in PR #406:
URL: https://github.com/apache/commons-collections/pull/406#discussion_r1278270734


##########
src/main/java/org/apache/commons/collections4/bloomfilter/CountingBloomFilter.java:
##########
@@ -121,7 +186,7 @@ default boolean merge(final Hasher hasher) {
     /**
      * Merges the specified index producer into this Bloom filter.
      *
-     * <p>Specifically: all counts for the indexes identified by the {@code indexProducer} will be incremented by 1.</p>
+     * <p>Specifically: all cells for the indexes identified by the {@code indexProducer} will be incremented by 1.</p>

Review Comment:
   Is it "required" no, but when working in complex arrangements it mean that the simply converting to an array does not change the state of the data.  In most cases this is not an issue.   The only case where it is an issue is when a Hasher backed IndexProducer.asArray() is called.  In all other cases (I think) the array either already exists or exists in a different form (BitMaps, CellProducers).   BitMaps and CellProducers always return distinct values in order.
   
   My plan was to lift the code from BitMapProducer that produces bitMaps and use it to produce an array of indices from a Hasher.
   
   Perhaps my original comment is not clear.  Basically, however the underlying storage produces the indices is the order and number that will be produced in the array.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@commons.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [commons-collections] aherbert commented on pull request #406: COLLECTIONS-844 - allow counting Bloom filters with cell size other than Integer.SIZE

Posted by "aherbert (via GitHub)" <gi...@apache.org>.
aherbert commented on PR #406:
URL: https://github.com/apache/commons-collections/pull/406#issuecomment-1666514599

   > I think that IndexUtils method should be moved to ArrayUtils in the root package.
   
   Note that class is package private, and copied and adapted from Commons Lang.
   
   To make the method public requires a bit more effort. For example duplicating for all 8 primitive array types and also a generic type. It may also require adding a method to expand the array for adding more than 1 item. (The current method only requires expanding the array to add 1 item.) This generic functionality may be better suited to ArrayUtils in commons-lang.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@commons.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [commons-collections] Claudenw commented on pull request #406: COLLECTIONS-844 - allow counting Bloom filters with cell size other than Integer.SIZE

Posted by "Claudenw (via GitHub)" <gi...@apache.org>.
Claudenw commented on PR #406:
URL: https://github.com/apache/commons-collections/pull/406#issuecomment-1666548559

   OK.  If there are any issues remaining please let me know.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@commons.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [commons-collections] Claudenw commented on a diff in pull request #406: COLLECTIONS-844 - allow counting Bloom filters with cell size other than Integer.SIZE

Posted by "Claudenw (via GitHub)" <gi...@apache.org>.
Claudenw commented on code in PR #406:
URL: https://github.com/apache/commons-collections/pull/406#discussion_r1283590210


##########
src/test/java/org/apache/commons/collections4/bloomfilter/DefaultIndexProducerTest.java:
##########
@@ -119,4 +118,42 @@ public void testFromIndexArray() {
             assertArrayEquals(expected, ip.asIndexArray());
         }
     }
+
+    @Test
+    public void testMoreThan32Entries() {
+        int[] values = new int[33];
+        for (int i=0; i<33; i++) {
+            values[i] = i;
+        }
+        IndexProducer producer =  predicate -> {
+            Objects.requireNonNull(predicate);
+            for (final int i : values) {
+                if (!predicate.test(i)) {
+                    return false;
+                }
+            }
+            return true;
+        };
+        int[] other = producer.asIndexArray();
+        assertArrayEquals(values, other);
+    }
+
+    @Test
+    public void test32Entries() {

Review Comment:
   fixed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@commons.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [commons-collections] Claudenw commented on a diff in pull request #406: COLLECTIONS-844 - allow counting Bloom filters with cell size other than Integer.SIZE

Posted by "Claudenw (via GitHub)" <gi...@apache.org>.
Claudenw commented on code in PR #406:
URL: https://github.com/apache/commons-collections/pull/406#discussion_r1283569310


##########
src/test/java/org/apache/commons/collections4/bloomfilter/AbstractCellProducerTest.java:
##########
@@ -0,0 +1,154 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.commons.collections4.bloomfilter;
+
+import static org.junit.jupiter.api.Assertions.assertArrayEquals;
+import static org.junit.jupiter.api.Assertions.assertEquals;
+import static org.junit.jupiter.api.Assertions.assertFalse;
+import static org.junit.jupiter.api.Assertions.assertTrue;
+import static org.junit.jupiter.api.Assertions.fail;
+
+import java.util.Arrays;
+import java.util.BitSet;
+
+import org.apache.commons.collections4.bloomfilter.CellProducer.CellConsumer;
+import org.junit.jupiter.api.Test;
+
+public abstract class AbstractCellProducerTest extends AbstractIndexProducerTest {
+
+    /**
+     * A testing CellConsumer that always returns true.
+     */
+    private static final CellConsumer TRUE_CONSUMER = (i, j) -> true;
+    /**
+     * A testing CellConsumer that always returns false.
+     */
+    private static final CellConsumer FALSE_CONSUMER = (i, j) -> false;
+
+    /**
+     * Creates an array of expected values that alignes with the expected indices entries.
+     * @return an array of expected values.
+     * @see AbstractIndexProducerTest#getExpectedIndices()
+     */
+    protected abstract int[] getExpectedValues();
+
+    @Override
+    protected final int getAsIndexArrayBehaviour() {
+        return ORDERED | DISTINCT;
+    }
+
+    /**
+     * Creates a producer with some data.
+     * @return a producer with some data
+     */
+    @Override
+    protected abstract CellProducer createProducer();
+
+    /**
+     * Creates a producer without data.
+     * @return a producer that has no data.
+     */
+    @Override
+    protected abstract CellProducer createEmptyProducer();
+
+    @Test
+    public final void testForEachCellPredicates() {
+        final CellProducer populated = createProducer();
+        final CellProducer empty = createEmptyProducer();
+
+        assertFalse(populated.forEachCell(FALSE_CONSUMER), "non-empty should be false");
+        assertTrue(empty.forEachCell(FALSE_CONSUMER), "empty should be true");
+
+        assertTrue(populated.forEachCell(TRUE_CONSUMER), "non-empty should be true");
+        assertTrue(empty.forEachCell(TRUE_CONSUMER), "empty should be true");
+    }
+
+    @Test
+    public final void testEmptyCellProducer() {
+        final CellProducer empty = createEmptyProducer();
+        final int ary[] = empty.asIndexArray();
+        assertEquals(0, ary.length);
+        assertTrue(empty.forEachCell((i, j) -> {
+            fail("forEachCell consumer should not be called");
+            return false;
+        }));
+    }
+
+    @Test
+    public final void testIndexConsistency() {
+        final CellProducer producer = createProducer();
+        final BitSet bs1 = new BitSet();
+        final BitSet bs2 = new BitSet();
+        producer.forEachIndex(i -> {
+            bs1.set(i);
+            return true;
+        });
+        producer.forEachCell((i, j) -> {
+            bs2.set(i);
+            return true;
+        });
+        assertEquals(bs1, bs2);
+    }
+
+    @Test
+    public void testForEachCellValues() {
+        int[] expectedIdx = getExpectedIndices();
+        int[] expectedValue = getExpectedValues();
+        assertEquals( expectedIdx.length, expectedValue.length, "expected index length and value length do not match");
+        int[] idx = {0};
+        createProducer().forEachCell((i, j) -> {
+            assertEquals(expectedIdx[idx[0]], i, "bad index at "+idx[0]);
+            assertEquals(expectedValue[idx[0]], j, "bad value at "+idx[0]);
+            idx[0]++;
+            return true;
+        });
+    }
+
+    /**
+     * Test the behavior of {@link CellProducer#forEachCell(CellConsumer)} with respect
+     * to ordered and distinct indices. Currently the behavior is assumed to be the same as
+     * {@link IndexProducer#forEachIndex(java.util.function.IntPredicate)}.
+     */
+    @Test
+    public final void testBehaviourForEachCell() {
+        final IntList list = new IntList();
+        createProducer().forEachCell((i, j) -> list.add(i));
+        final int[] actual = list.toArray();
+        // check order
+        final int[] expected = Arrays.stream(actual).sorted().toArray();
+        assertArrayEquals(expected, actual);
+        // check distinct
+        final long count = Arrays.stream(actual).distinct().count();
+        assertEquals(count, actual.length);
+    }
+
+    @Test
+    public void testForEachCellEarlyExit() {
+        final int[] passes = new int[1];
+        assertTrue(createEmptyProducer().forEachCell((i, j) -> {
+            passes[0]++;
+            return false;
+        }));
+        assertEquals(0, passes[0]);
+
+        assertFalse(createProducer().forEachCell((i, j) -> {

Review Comment:
   The test requires that the producer have at least 1 entry and we force an exit on the first one.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@commons.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [commons-collections] Claudenw commented on a diff in pull request #406: COLLECTIONS-844 - allow counting Bloom filters with cell size other than Integer.SIZE

Posted by "Claudenw (via GitHub)" <gi...@apache.org>.
Claudenw commented on code in PR #406:
URL: https://github.com/apache/commons-collections/pull/406#discussion_r1283555852


##########
src/main/java/org/apache/commons/collections4/bloomfilter/IndexProducer.java:
##########
@@ -117,11 +120,60 @@ public boolean test(long word) {
      * @return An int array of the data.
      */
     default int[] asIndexArray() {
-        final BitSet result = new BitSet();
+        class Indices {
+            private int[] data = new int[32];
+            private int size;
+
+            boolean add(final int index) {
+                if (size == data.length) {
+                    data = Arrays.copyOf(data, (int) Math.min(MAX_ARRAY_SIZE, size * 2L));
+                }
+                data[size++] = index;
+                return true;
+            }
+
+            int[] toArray() {
+                // Edge case to avoid a large array copy
+                return size == data.length ? data : Arrays.copyOf(data, size);
+            }
+        }
+        Indices indices = new Indices();
+        forEachIndex(indices::add);
+        return indices.toArray();
+    }
+
+    /**
+     * Creates an IndexProducer of unique indices for this index.
+     *
+     * <p>This is like the `indices(Shape)` method except that it adds the guarantee that no

Review Comment:
   fixed and adde more info



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@commons.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [commons-collections] aherbert commented on pull request #406: COLLECTIONS-844 - allow counting Bloom filters with cell size other than Integer.SIZE

Posted by "aherbert (via GitHub)" <gi...@apache.org>.
aherbert commented on PR #406:
URL: https://github.com/apache/commons-collections/pull/406#issuecomment-1638985470

   There are two terms here used through the API: _bit index_ which refers to a position within the shape; and the _count_ of the number of observations of that index. In that context you are stating a cell is a position that can be filled with counts. So it is a countable bit index. Would the term _cell_ ever be used for non-counting Bloom filters?
   
   Where does that leave the naming of BitCountProducer? CellCountProducer; CellProducer? This is why I questioned it. Adding new terminology should make sense to those reading the API for the first time. Perhaps it can be done by adding a note to the class javadoc for CountingBloomFilter that bit indices that can be counted are also known as cells (if this is correct).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@commons.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [commons-collections] aherbert commented on a diff in pull request #406: COLLECTIONS-844 - allow counting Bloom filters with cell size other than Integer.SIZE

Posted by "aherbert (via GitHub)" <gi...@apache.org>.
aherbert commented on code in PR #406:
URL: https://github.com/apache/commons-collections/pull/406#discussion_r1278317045


##########
src/main/java/org/apache/commons/collections4/bloomfilter/CountingBloomFilter.java:
##########
@@ -121,7 +186,7 @@ default boolean merge(final Hasher hasher) {
     /**
      * Merges the specified index producer into this Bloom filter.
      *
-     * <p>Specifically: all counts for the indexes identified by the {@code indexProducer} will be incremented by 1.</p>
+     * <p>Specifically: all cells for the indexes identified by the {@code indexProducer} will be incremented by 1.</p>

Review Comment:
   I misread your previous statement "as array always returns the indices in the number and order they are produced." I took that to mean in order, i.e. sorted. Re-reading it I see that you are stating that the order matches the output produced by forEachIndex.
   
   This is fine. But it now means the default implementation of asIndexArray (which creates sorted, unique indices) is now in conflict with the javadoc.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@commons.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [commons-collections] Claudenw commented on pull request #406: COLLECTIONS-844 - allow counting Bloom filters with cell size other than Integer.SIZE

Posted by "Claudenw (via GitHub)" <gi...@apache.org>.
Claudenw commented on PR #406:
URL: https://github.com/apache/commons-collections/pull/406#issuecomment-1638721687

   @aherbert 
   
   Cell is a standard term in the literature.  I am preparing two more pull requests.  
   
   1. A stable Bloom filter (weird beast) that acts sorta like a counting Bloom filter with random deletions.
   2. Another implementation of counting Bloom filter that uses the core classes from stable Bloom filter to implement counting Bloom filters that can use from 1 to 32 bits (1 is a bit ridiculous but it is supported).
   
   Both the new counting implementation and the stable Bloom filter are dependant upon a CellManager class and some supporting classes.  I could present the pull requests in either order, the first one being largish (like the layerd change) with the second being a couple of classes for the specific implementation. 
   
   I can update the current counting Bloom filter with the term "cell", but I would like to keep the terminology.  I thought I had change the method name to `getMaxCount`, I'll figure out where that change went.
   
   Would you be amenable to updating the documentation to refer to cell when we have "bits" that are values more than bits?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@commons.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [commons-collections] Claudenw commented on pull request #406: COLLECTIONS-844 - allow counting Bloom filters with cell size other than Integer.SIZE

Posted by "Claudenw (via GitHub)" <gi...@apache.org>.
Claudenw commented on PR #406:
URL: https://github.com/apache/commons-collections/pull/406#issuecomment-1649833256

   @aherbert Please take a look again and see if you are OK to merge this change.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@commons.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [commons-collections] Claudenw commented on a diff in pull request #406: COLLECTIONS-844 - allow counting Bloom filters with cell size other than Integer.SIZE

Posted by "Claudenw (via GitHub)" <gi...@apache.org>.
Claudenw commented on code in PR #406:
URL: https://github.com/apache/commons-collections/pull/406#discussion_r1292800714


##########
src/main/java/org/apache/commons/collections4/bloomfilter/IndexFilter.java:
##########
@@ -72,7 +72,10 @@ public boolean test(final int number) {
         if (number >= size) {
             throw new IndexOutOfBoundsException(String.format("number too large %d >= %d", number, size));
         }
-        return !tracker.test(number) || consumer.test(number);
+        if (tracker.test(number)) {

Review Comment:
   This change fixes code coverage check.  In the original case there are 4 branches, one of which can not be tested.  In this case there are only 3 and the code coverage report shows 100% coverage for the Bloomfilter code.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@commons.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [commons-collections] aherbert commented on a diff in pull request #406: COLLECTIONS-844 - allow counting Bloom filters with cell size other than Integer.SIZE

Posted by "aherbert (via GitHub)" <gi...@apache.org>.
aherbert commented on code in PR #406:
URL: https://github.com/apache/commons-collections/pull/406#discussion_r1278266476


##########
src/main/java/org/apache/commons/collections4/bloomfilter/CountingBloomFilter.java:
##########
@@ -121,7 +186,7 @@ default boolean merge(final Hasher hasher) {
     /**
      * Merges the specified index producer into this Bloom filter.
      *
-     * <p>Specifically: all counts for the indexes identified by the {@code indexProducer} will be incremented by 1.</p>
+     * <p>Specifically: all cells for the indexes identified by the {@code indexProducer} will be incremented by 1.</p>

Review Comment:
   Agree with all points except:
   
   "as array always returns the indices in the number and order they are produced."
   
   This constraint may impact performance. Is it required?
   
   A hasher asIndexArray may not wish to do this as the indices are random. The EnhancedDoubleHasher current returns the indices unordered.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@commons.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [commons-collections] Claudenw commented on a diff in pull request #406: COLLECTIONS-844 - allow counting Bloom filters with cell size other than Integer.SIZE

Posted by "Claudenw (via GitHub)" <gi...@apache.org>.
Claudenw commented on code in PR #406:
URL: https://github.com/apache/commons-collections/pull/406#discussion_r1278563011


##########
src/main/java/org/apache/commons/collections4/bloomfilter/CountingBloomFilter.java:
##########
@@ -121,7 +186,7 @@ default boolean merge(final Hasher hasher) {
     /**
      * Merges the specified index producer into this Bloom filter.
      *
-     * <p>Specifically: all counts for the indexes identified by the {@code indexProducer} will be incremented by 1.</p>
+     * <p>Specifically: all cells for the indexes identified by the {@code indexProducer} will be incremented by 1.</p>

Review Comment:
   Should all be fixed and ready to go now.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@commons.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [commons-collections] Claudenw commented on a diff in pull request #406: COLLECTIONS-844 - allow counting Bloom filters with cell size other than Integer.SIZE

Posted by "Claudenw (via GitHub)" <gi...@apache.org>.
Claudenw commented on code in PR #406:
URL: https://github.com/apache/commons-collections/pull/406#discussion_r1283566165


##########
src/main/java/org/apache/commons/collections4/bloomfilter/CountingBloomFilter.java:
##########
@@ -103,40 +172,41 @@ default boolean merge(final BloomFilter other) {
     /**
      * Merges the specified Hasher into this Bloom filter.
      *
-     * <p>Specifically: all counts for the unique indexes identified by the {@code hasher} will be incremented by 1.</p>
+     * <p>Specifically: all cells for the unique indexes identified by the {@code hasher} will be incremented by 1.</p>
      *
      * <p>This method will return {@code true} if the filter is valid after the operation.</p>
      *
      * @param hasher the hasher
      * @return {@code true} if the removal was successful and the state is valid
      * @see #isValid()
-     * @see #add(BitCountProducer)
+     * @see #add(CellProducer)
      */
     @Override
     default boolean merge(final Hasher hasher) {
         Objects.requireNonNull(hasher, "hasher");
-        return merge(hasher.uniqueIndices(getShape()));
+        return merge(hasher.indices(getShape()));
     }
 
     /**
      * Merges the specified index producer into this Bloom filter.
      *
-     * <p>Specifically: all counts for the indexes identified by the {@code indexProducer} will be incremented by 1.</p>
+     * <p>Specifically: all unique cells for the indices identified by the {@code indexProducer} will be incremented by 1.</p>
      *
      * <p>This method will return {@code true} if the filter is valid after the operation.</p>
      *
-     * <p>Note: Indices that are returned multiple times will be incremented multiple times.</p>
+     * <p>Note: If indices that are returned multiple times should be incremented multiple times convert the IndexProducer
+     * to a CellProducer and add that.</p>
      *
      * @param indexProducer the IndexProducer
      * @return {@code true} if the removal was successful and the state is valid
      * @see #isValid()
-     * @see #add(BitCountProducer)
+     * @see #add(CellProducer)
      */
     @Override
     default boolean merge(final IndexProducer indexProducer) {
         Objects.requireNonNull(indexProducer, "indexProducer");
         try {
-            return add(BitCountProducer.from(indexProducer));
+            return add(CellProducer.from(indexProducer.uniqueIndices()));
         } catch (final IndexOutOfBoundsException e) {

Review Comment:
   We do this because BloomFilter throws and IndexOutOfBoundsException if any index is <0 or >=bitCount.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@commons.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [commons-collections] Claudenw commented on a diff in pull request #406: COLLECTIONS-844 - allow counting Bloom filters with cell size other than Integer.SIZE

Posted by "Claudenw (via GitHub)" <gi...@apache.org>.
Claudenw commented on code in PR #406:
URL: https://github.com/apache/commons-collections/pull/406#discussion_r1283571907


##########
src/main/java/org/apache/commons/collections4/bloomfilter/CellProducer.java:
##########


Review Comment:
   Seems to be an issue with github display local git showed that it was renamed during commit.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@commons.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [commons-collections] aherbert commented on pull request #406: COLLECTIONS-844 - allow counting Bloom filters with cell size other than Integer.SIZE

Posted by "aherbert (via GitHub)" <gi...@apache.org>.
aherbert commented on PR #406:
URL: https://github.com/apache/commons-collections/pull/406#issuecomment-1664496353

   I think you have fixed it. The GH UI sometimes does not show the renamed file as a move. I have cloned the latest PR code and verified locally that the history is there:
   ```
   git log --follow  src/main/java/org/apache/commons/collections4/bloomfilter/CellProducer.java
   git log --follow  src/test/java/org/apache/commons/collections4/bloomfilter/AbstractCellProducerTest.java
   ```
   Log revisions date back to 2022.
   
   I should have done this before I asked you to restore the history as it may have been there already. Anyway, it is OK now.
   
   I think we should revert the IndexFilter change and do that separately. This PR is related to the counting Bloom filters and their ability to indicate the max capacity of a cell and estimate the number of entries of an item.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@commons.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [commons-collections] Claudenw commented on pull request #406: COLLECTIONS-844 - allow counting Bloom filters with cell size other than Integer.SIZE

Posted by "Claudenw (via GitHub)" <gi...@apache.org>.
Claudenw commented on PR #406:
URL: https://github.com/apache/commons-collections/pull/406#issuecomment-1666492356

   I think that IndexUtils method should be moved to ArrayUtils in the root package.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@commons.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org