You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2021/09/24 19:42:44 UTC

[GitHub] [lucene] gsmiller commented on a change in pull request #304: LUCENE 10080 Added FixedBitSet for one counts when counting taxonomy facet labels

gsmiller commented on a change in pull request #304:
URL: https://github.com/apache/lucene/pull/304#discussion_r715860923



##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/IntTaxonomyFacets.java
##########
@@ -253,4 +257,92 @@ public FacetResult getTopChildren(int topN, String dim, String... path) throws I
 
     return new FacetResult(dim, path, totValue, labelValues, childCount);
   }
+
+  /**
+   * Class that uses FixedBitSet to store counts for all ordinals with 1 count and IntIntHashMap for
+   * all other counts
+   */
+  protected static class IntIntHashMapWithFixedBitSet implements Iterable<IntIntCursor> {

Review comment:
       I'd suggest looking through some of the other implementations of `Facets` since there are a few other ones that might also benefit from this work. Anything that utilizes the option of sparse counting into a map would be a good candidate. To that end, I would suggest splitting this out into its own class to be reused. Ideally it would be pkg-private, but I think the sub-packaging under the facets package might prevent us from doing this :(

##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/IntTaxonomyFacets.java
##########
@@ -27,14 +29,16 @@
 import org.apache.lucene.facet.FacetsConfig.DimConfig;
 import org.apache.lucene.facet.LabelAndValue;
 import org.apache.lucene.facet.TopOrdAndIntQueue;
+import org.apache.lucene.search.DocIdSetIterator;
+import org.apache.lucene.util.FixedBitSet;
 
 /** Base class for all taxonomy-based facets that aggregate to a per-ords int[]. */
 public abstract class IntTaxonomyFacets extends TaxonomyFacets {
 
   /** Per-ordinal value. */
   private final int[] values;
 
-  private final IntIntHashMap sparseValues;
+  private final IntIntHashMapWithFixedBitSet sparseValues;

Review comment:
       It might be worth experimenting with a heuristic-based approach to determine when it's worth using this new approach vs. just using the map directly. We'd ideally want to only use this new approach when it's likely that a large number of the unique ordinals/categories seen during counting are only seen once. If most of the ordinals we see when counting are seen more than once, this just adds unnecessary overhead. I wonder if there's a heuristic that would make sense for approximating the likelihood of a long-tail of single-hit ordinals in counting? Maybe some experimentation on the wikipedia dataset through `luceneutil` would help?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org