You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2021/05/05 20:58:11 UTC

[GitHub] [lucene] rmuir commented on a change in pull request #127: LUCENE-9946: Support multi-value fields in range facet counting

rmuir commented on a change in pull request #127:
URL: https://github.com/apache/lucene/pull/127#discussion_r626892485



##########
File path: lucene/facet/src/java/org/apache/lucene/facet/range/LongRangeCounter.java
##########
@@ -17,26 +17,153 @@
 package org.apache.lucene.facet.range;
 
 import java.util.ArrayList;
+import java.util.Arrays;
 import java.util.Collections;
 import java.util.HashMap;
 import java.util.List;
 import java.util.Map;
 
 /**
- * Counts how many times each range was seen; per-hit it's just a binary search ({@link #add})
- * against the elementary intervals, and in the end we rollup back to the original ranges.
+ * Segment tree for counting numeric ranges. Works for both single- and multi-valued cases (assuming
+ * you use it correctly).
+ *
+ * <p>Usage notes: For counting against a single value field/source, callers should call add() for
+ * each value and then call finish() after all documents have been processed. The call to finish()
+ * will inform the caller how many documents didn't match against any ranges. After finish() has
+ * been called, the caller-provided count buffer (passed into the ctor) will be populated with
+ * accurate range counts.
+ *
+ * <p>For counting against a multi-valued field, callers should call startDoc() at the beginning of
+ * processing each doc, followed by add() for each value, and then endDoc() at the end of the doc.
+ * The call to endDoc() will inform the caller if that document matched against any ranges. It is
+ * not necessary to call finish() for multi-valued cases. After each call to endDoc(), the
+ * caller-provided count buffer (passed into the ctor) will be populated with accurate range counts.
  */
 final class LongRangeCounter {
 
-  final LongRangeNode root;
-  final long[] boundaries;
-  final int[] leafCounts;
+  /** segment tree root node */
+  private final LongRangeNode root;
+  /** elementary segment boundaries used for efficient counting (bsearch to find interval) */
+  private final long[] boundaries;
+  /** accumulated counts for all of the ranges */
+  private final int[] countBuffer;
+  /** whether-or-not we're counting docs that could be multi-valued */
+  final boolean isMultiValued;
+
+  // Needed only for counting single-valued docs:
+  /** counts seen in each elementary interval leaf */
+  private final int[] singleValuedLeafCounts;
+
+  // Needed only for counting multi-valued docs:
+  /** whether-or-not an elementary interval has seen at least one match for a single doc */
+  private final boolean[] multiValuedDocLeafHits;
+  /** whether-or-not a range has seen at least one match for a single doc */
+  private final boolean[] multiValuedDocRangeHits;

Review comment:
       rather than `boolean[]`, have you experimented with FixedBitSet? I'm suspicious that `boolean[]` may take 8x as much memory since there isn't really a java boolean type at the .class level. personally I can't think of where i've ever used a `boolean[]`, i just usually think bitset in my head, so maybe my fears are unfounded...




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org