You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2021/09/16 23:44:52 UTC

[GitHub] [lucene] mdmarshmallow opened a new pull request #304: LUCENE 10080 Added FixedBitSet for one counts when counting taxonomy facet labels

mdmarshmallow opened a new pull request #304:
URL: https://github.com/apache/lucene/pull/304


   <!--
   _(If you are a project committer then you may remove some/all of the following template.)_
   
   Before creating a pull request, please file an issue in the ASF Jira system for Lucene:
   
   * https://issues.apache.org/jira/projects/LUCENE
   
   You will need to create an account in Jira in order to create an issue.
   
   The title of the PR should reference the Jira issue number in the form:
   
   * LUCENE-####: <short description of problem or changes>
   
   LUCENE must be fully capitalized. A short description helps people scanning pull requests for items they can work on.
   
   Properly referencing the issue in the title ensures that Jira is correctly updated with code review comments and commits. -->
   
   
   # Description
   
   This change is a possible improvement to how we do facet counting. Today, we accumulate counts in an HPPC int/int map or an `int[]` array. However, it is possible many of these counts are singletons, which means we could use a `FixedBitSet` to keep track of the bits for singleton counts, and if those counts increase to be greater than 1, we can accumulate counts like we do today. This could be more space efficient as we could be saving a lot of entries in the int/int map.
   
   # Solution
   
   I created a new `private class IntIntHashMapWithFixedBitSet` in `IntTaxonomyFacets` class. This class keeps all the necessary operations that `IntIntHashMap` has but now wrapped with a `FixedBitSet`.
   
   # Tests
   
   I added randomized unit tests to test the `IntIntHashMapWithFixedBitSet` class. I tested this class by comparing its behavior to the `IntIntHashMap` class since they should behave the exact same.
   
   # Checklist
   
   Please review the following and check all that apply:
   
   - [x] I have reviewed the guidelines for [How to Contribute](https://wiki.apache.org/lucene/HowToContribute) and my code conforms to the standards described there to the best of my ability.
   - [x] I have created a Jira issue and added the issue ID to my pull request title.
   - [x] I have given Lucene maintainers [access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork) to contribute to my PR branch. (optional but recommended)
   - [x] I have developed this patch against the `main` branch.
   - [x] I have run `./gradlew check`.
   - [x] I have added tests for my changes.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] gsmiller commented on a change in pull request #304: LUCENE 10080 Added FixedBitSet for one counts when counting taxonomy facet labels

Posted by GitBox <gi...@apache.org>.
gsmiller commented on a change in pull request #304:
URL: https://github.com/apache/lucene/pull/304#discussion_r715860923



##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/IntTaxonomyFacets.java
##########
@@ -253,4 +257,92 @@ public FacetResult getTopChildren(int topN, String dim, String... path) throws I
 
     return new FacetResult(dim, path, totValue, labelValues, childCount);
   }
+
+  /**
+   * Class that uses FixedBitSet to store counts for all ordinals with 1 count and IntIntHashMap for
+   * all other counts
+   */
+  protected static class IntIntHashMapWithFixedBitSet implements Iterable<IntIntCursor> {

Review comment:
       I'd suggest looking through some of the other implementations of `Facets` since there are a few other ones that might also benefit from this work. Anything that utilizes the option of sparse counting into a map would be a good candidate. To that end, I would suggest splitting this out into its own class to be reused. Ideally it would be pkg-private, but I think the sub-packaging under the facets package might prevent us from doing this :(

##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/IntTaxonomyFacets.java
##########
@@ -27,14 +29,16 @@
 import org.apache.lucene.facet.FacetsConfig.DimConfig;
 import org.apache.lucene.facet.LabelAndValue;
 import org.apache.lucene.facet.TopOrdAndIntQueue;
+import org.apache.lucene.search.DocIdSetIterator;
+import org.apache.lucene.util.FixedBitSet;
 
 /** Base class for all taxonomy-based facets that aggregate to a per-ords int[]. */
 public abstract class IntTaxonomyFacets extends TaxonomyFacets {
 
   /** Per-ordinal value. */
   private final int[] values;
 
-  private final IntIntHashMap sparseValues;
+  private final IntIntHashMapWithFixedBitSet sparseValues;

Review comment:
       It might be worth experimenting with a heuristic-based approach to determine when it's worth using this new approach vs. just using the map directly. We'd ideally want to only use this new approach when it's likely that a large number of the unique ordinals/categories seen during counting are only seen once. If most of the ordinals we see when counting are seen more than once, this just adds unnecessary overhead. I wonder if there's a heuristic that would make sense for approximating the likelihood of a long-tail of single-hit ordinals in counting? Maybe some experimentation on the wikipedia dataset through `luceneutil` would help?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] gautamworah96 commented on a change in pull request #304: LUCENE 10080 Added FixedBitSet for one counts when counting taxonomy facet labels

Posted by GitBox <gi...@apache.org>.
gautamworah96 commented on a change in pull request #304:
URL: https://github.com/apache/lucene/pull/304#discussion_r718007188



##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/IntTaxonomyFacets.java
##########
@@ -43,7 +47,7 @@ protected IntTaxonomyFacets(
     super(indexFieldName, taxoReader, config);
 
     if (useHashTable(fc, taxoReader)) {
-      sparseValues = new IntIntHashMap();
+      sparseValues = new IntIntHashMapWithFixedBitSet(taxoReader.getSize());

Review comment:
       I think we can definitely experiment with the conditions under which this logic is triggered. For example, I can imagine cases where there are a lot of matches (say > 10% of the index, which triggers the code to use the `values` array) and which has a lot of sparse ordinals i.e, ordinals with a count of just 0.
   
   In that case we could benefit from an optimization that creates a large bitset with most values set to 0 instead of a large `values[int]` array with most values set to 0.
   
    Another idea might be to remove both these `hashtable` and `values[]` array ideas and keep just the bitset backed by a hashtable idea (this PR). 
   
   Maybe you could have a `wikimedium10m` run with a small print statement that records whether the class used a sparse or a dense data-structure (and then get the counts).  Based on those results we could then decide on where to add this logic

##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/IntTaxonomyFacets.java
##########
@@ -253,4 +257,92 @@ public FacetResult getTopChildren(int topN, String dim, String... path) throws I
 
     return new FacetResult(dim, path, totValue, labelValues, childCount);
   }
+
+  /**
+   * Class that uses FixedBitSet to store counts for all ordinals with 1 count and IntIntHashMap for
+   * all other counts
+   */
+  protected static class IntIntHashMapWithFixedBitSet implements Iterable<IntIntCursor> {

Review comment:
       +1 there is a similar class called `FloatTaxonomyFacets` but it does not currently use this sparse counting mechanism. If we see some performance gains from this approach, we could definitively extend it to that class as well. 

##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/IntTaxonomyFacets.java
##########
@@ -43,7 +47,7 @@ protected IntTaxonomyFacets(
     super(indexFieldName, taxoReader, config);
 
     if (useHashTable(fc, taxoReader)) {
-      sparseValues = new IntIntHashMap();
+      sparseValues = new IntIntHashMapWithFixedBitSet(taxoReader.getSize());

Review comment:
       I think we can definitely experiment with the conditions under which this logic is triggered. For example, I can imagine cases where there are a lot of matches (say > 10% of the index, which triggers the code to use the `values` array) and which has a lot of sparse ordinals i.e, ordinals with a count of just 0.
   
   In that case we could benefit from an optimization that creates a large bitset with most values set to 0 instead of a large `values[int]` array with most values set to 0.
   
    Another idea might be to remove both these `hashtable` and `values[]` array ideas and keep just the bitset backed by a hashtable idea (this PR). 
   
   Maybe you could have a `wikimedium10m` run with a small print statement that records whether the class used a sparse or a dense data-structure (and then get the counts from the stdout file).  Based on those results we could then decide on where to add this logic

##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/IntTaxonomyFacets.java
##########
@@ -43,7 +47,7 @@ protected IntTaxonomyFacets(
     super(indexFieldName, taxoReader, config);
 
     if (useHashTable(fc, taxoReader)) {
-      sparseValues = new IntIntHashMap();
+      sparseValues = new IntIntHashMapWithFixedBitSet(taxoReader.getSize());

Review comment:
       I think we can definitely experiment with the conditions under which this logic is triggered. For example, I can imagine cases where there are a lot of matches (say > 10% of the index, which triggers the code to use the `values` array) and which has a lot of sparse ordinals i.e, ordinals with a count of just 0.
   
   In that case we could benefit from an optimization that creates a large bitset with most values set to 0 instead of a large `values[int]` array with most values set to 0.
   
    Another idea might be to remove both these `hashtable` and `values[]` array ideas and keep just the bitset backed by a hashtable idea (this PR). 
   
   Maybe you could have a `wikimedium10m` run with a small print statement that records whether the class used a sparse or a dense data-structure (and then get the counts from the stdout file).  Based on those results we could then decide on where to add this logic. As a side effect, this experiment also verifies whether your new code is being used or no :)
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] mdmarshmallow commented on a change in pull request #304: LUCENE 10080 Added FixedBitSet for one counts when counting taxonomy facet labels

Posted by GitBox <gi...@apache.org>.
mdmarshmallow commented on a change in pull request #304:
URL: https://github.com/apache/lucene/pull/304#discussion_r718921856



##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/IntTaxonomyFacets.java
##########
@@ -253,4 +257,92 @@ public FacetResult getTopChildren(int topN, String dim, String... path) throws I
 
     return new FacetResult(dim, path, totValue, labelValues, childCount);
   }
+
+  /**
+   * Class that uses FixedBitSet to store counts for all ordinals with 1 count and IntIntHashMap for
+   * all other counts
+   */
+  protected static class IntIntHashMapWithFixedBitSet implements Iterable<IntIntCursor> {

Review comment:
       I checked and there are several classes that this change could apply to, with varying amounts of additional work:
   * `IntTaxonomyFacets`
   * `StringValueFacetCounts`
   * `FloatTaxonomyFacets`: Currently this only uses a dense array to count facets
   * `LongValueTaxonomyFacets`: Would need to create a class like `LongIntHashMapWithFixedBitSet` to work with this
   * `SortedSetDocValuesCount`: Same as `FloatTaxonomyFacets`, it only uses dense arrays to count facets
   * `ConcurrentSortedSetDocValueFacetCounts`: Currently uses a dense atomic array, will need to make a threadsafe version of `IntIntHashMapWithFixedBitSet` to use




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] mdmarshmallow commented on a change in pull request #304: LUCENE 10080 Added FixedBitSet for one counts when counting taxonomy facet labels

Posted by GitBox <gi...@apache.org>.
mdmarshmallow commented on a change in pull request #304:
URL: https://github.com/apache/lucene/pull/304#discussion_r718921856



##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/IntTaxonomyFacets.java
##########
@@ -253,4 +257,92 @@ public FacetResult getTopChildren(int topN, String dim, String... path) throws I
 
     return new FacetResult(dim, path, totValue, labelValues, childCount);
   }
+
+  /**
+   * Class that uses FixedBitSet to store counts for all ordinals with 1 count and IntIntHashMap for
+   * all other counts
+   */
+  protected static class IntIntHashMapWithFixedBitSet implements Iterable<IntIntCursor> {

Review comment:
       I checked and there are several classes that this change could apply to, with varying amounts of additional work:
   * `IntTaxonomyFacets`
   * `StringValueFacetCounts`
   * `FloatTaxonomyFacets`: Currently this only uses a dense array to count facets
   * `LongValueTaxonomyFacets`: Would need to create a class like `LongIntHashMapWithFixedBitSet` to work with this
   * `SortedSetDocValuesCount`: Same as `FloatTaxonomyFacets`, it only uses dense arrays to count facets
   * `ConcurrentSortedSetDocValueFacetCounts`: Currently uses a dense atomic array, will need to make a threadsafe version of `IntIntHashMapWithFixedBitSet` to use




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] gautamworah96 commented on a change in pull request #304: LUCENE 10080 Added FixedBitSet for one counts when counting taxonomy facet labels

Posted by GitBox <gi...@apache.org>.
gautamworah96 commented on a change in pull request #304:
URL: https://github.com/apache/lucene/pull/304#discussion_r718007188



##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/IntTaxonomyFacets.java
##########
@@ -43,7 +47,7 @@ protected IntTaxonomyFacets(
     super(indexFieldName, taxoReader, config);
 
     if (useHashTable(fc, taxoReader)) {
-      sparseValues = new IntIntHashMap();
+      sparseValues = new IntIntHashMapWithFixedBitSet(taxoReader.getSize());

Review comment:
       I think we can definitely experiment with the conditions under which this logic is triggered. For example, I can imagine cases where there are a lot of matches (say > 10% of the index, which triggers the code to use the `values` array) and which has a lot of sparse ordinals i.e, ordinals with a count of just 0.
   
   In that case we could benefit from an optimization that creates a large bitset with most values set to 0 instead of a large `values[int]` array with most values set to 0.
   
    Another idea might be to remove both these `hashtable` and `values[]` array ideas and keep just the bitset backed by a hashtable idea (this PR). 
   
   Maybe you could have a `wikimedium10m` run with a small print statement that records whether the class used a sparse or a dense data-structure (and then get the counts from the stdout file).  Based on those results we could then decide on where to add this logic




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] gautamworah96 commented on a change in pull request #304: LUCENE 10080 Added FixedBitSet for one counts when counting taxonomy facet labels

Posted by GitBox <gi...@apache.org>.
gautamworah96 commented on a change in pull request #304:
URL: https://github.com/apache/lucene/pull/304#discussion_r718007188



##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/IntTaxonomyFacets.java
##########
@@ -43,7 +47,7 @@ protected IntTaxonomyFacets(
     super(indexFieldName, taxoReader, config);
 
     if (useHashTable(fc, taxoReader)) {
-      sparseValues = new IntIntHashMap();
+      sparseValues = new IntIntHashMapWithFixedBitSet(taxoReader.getSize());

Review comment:
       I think we can definitely experiment with the conditions under which this logic is triggered. For example, I can imagine cases where there are a lot of matches (say > 10% of the index, which triggers the code to use the `values` array) and which has a lot of sparse ordinals i.e, ordinals with a count of just 0.
   
   In that case we could benefit from an optimization that creates a large bitset with most values set to 0 instead of a large `values[int]` array with most values set to 0.
   
    Another idea might be to remove both these `hashtable` and `values[]` array ideas and keep just the bitset backed by a hashtable idea (this PR). 
   
   Maybe you could have a `wikimedium10m` run with a small print statement that records whether the class used a sparse or a dense data-structure (and then get the counts).  Based on those results we could then decide on where to add this logic

##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/IntTaxonomyFacets.java
##########
@@ -253,4 +257,92 @@ public FacetResult getTopChildren(int topN, String dim, String... path) throws I
 
     return new FacetResult(dim, path, totValue, labelValues, childCount);
   }
+
+  /**
+   * Class that uses FixedBitSet to store counts for all ordinals with 1 count and IntIntHashMap for
+   * all other counts
+   */
+  protected static class IntIntHashMapWithFixedBitSet implements Iterable<IntIntCursor> {

Review comment:
       +1 there is a similar class called `FloatTaxonomyFacets` but it does not currently use this sparse counting mechanism. If we see some performance gains from this approach, we could definitively extend it to that class as well. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] mdmarshmallow closed pull request #304: LUCENE 10080 Added FixedBitSet for one counts when counting taxonomy facet labels

Posted by GitBox <gi...@apache.org>.
mdmarshmallow closed pull request #304:
URL: https://github.com/apache/lucene/pull/304


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] gautamworah96 commented on a change in pull request #304: LUCENE 10080 Added FixedBitSet for one counts when counting taxonomy facet labels

Posted by GitBox <gi...@apache.org>.
gautamworah96 commented on a change in pull request #304:
URL: https://github.com/apache/lucene/pull/304#discussion_r718007188



##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/IntTaxonomyFacets.java
##########
@@ -43,7 +47,7 @@ protected IntTaxonomyFacets(
     super(indexFieldName, taxoReader, config);
 
     if (useHashTable(fc, taxoReader)) {
-      sparseValues = new IntIntHashMap();
+      sparseValues = new IntIntHashMapWithFixedBitSet(taxoReader.getSize());

Review comment:
       I think we can definitely experiment with the conditions under which this logic is triggered. For example, I can imagine cases where there are a lot of matches (say > 10% of the index, which triggers the code to use the `values` array) and which has a lot of sparse ordinals i.e, ordinals with a count of just 0.
   
   In that case we could benefit from an optimization that creates a large bitset with most values set to 0 instead of a large `values[int]` array with most values set to 0.
   
    Another idea might be to remove both these `hashtable` and `values[]` array ideas and keep just the bitset backed by a hashtable idea (this PR). 
   
   Maybe you could have a `wikimedium10m` run with a small print statement that records whether the class used a sparse or a dense data-structure (and then get the counts from the stdout file).  Based on those results we could then decide on where to add this logic. As a side effect, this experiment also verifies whether your new code is being used or no :)
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org