You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2021/09/08 21:42:55 UTC

[GitHub] [lucene] mdmarshmallow opened a new pull request #288: LUCENE 10080 Added FixedBitSet for one counts when counting taxonomy facet labels

mdmarshmallow opened a new pull request #288:
URL: https://github.com/apache/lucene/pull/288


   <!--
   _(If you are a project committer then you may remove some/all of the following template.)_
   
   Before creating a pull request, please file an issue in the ASF Jira system for Lucene:
   
   * https://issues.apache.org/jira/projects/LUCENE
   
   You will need to create an account in Jira in order to create an issue.
   
   The title of the PR should reference the Jira issue number in the form:
   
   * LUCENE-####: <short description of problem or changes>
   
   LUCENE must be fully capitalized. A short description helps people scanning pull requests for items they can work on.
   
   Properly referencing the issue in the title ensures that Jira is correctly updated with code review comments and commits. -->
   
   
   # Description
   
   This change is a possible improvement to how we do facet counting. Today, we accumulate counts in an HPPC int/int map or an `int[]` array. However, it is possible many of these counts are singletons, which means we could use a `FixedBitSet` to keep track of the bits for singleton counts, and if those counts increase to be greater than 1, we can accumulate counts like we do today. This could be more space efficient as we could be saving a lot of entries in the int/int map.
   
   # Solution
   
   I created a new `private class IntIntHashMapWithFixedBitSet` in `IntTaxonomyFacets` class. This class keeps all the necessary operations that `IntIntHashMap` has but now wrapped with a `FixedBitSet`.
   
   # Tests
   
   Since this change should not change or add to the behavior of the `IntTaxonomyFacets` class, I did not add any unit tests as existing ones should catch any issues with `IntTaxonomyFacets`.
   
   # Checklist
   
   Please review the following and check all that apply:
   
   - [x] I have reviewed the guidelines for [How to Contribute](https://wiki.apache.org/lucene/HowToContribute) and my code conforms to the standards described there to the best of my ability.
   - [x] I have created a Jira issue and added the issue ID to my pull request title.
   - [x] I have given Lucene maintainers [access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork) to contribute to my PR branch. (optional but recommended)
   - [x] I have developed this patch against the `main` branch.
   - [x] I have run `./gradlew check`.
   - [ ] I have added tests for my changes.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] mikemccand commented on a change in pull request #288: LUCENE 10080 Added FixedBitSet for one counts when counting taxonomy facet labels

Posted by GitBox <gi...@apache.org>.
mikemccand commented on a change in pull request #288:
URL: https://github.com/apache/lucene/pull/288#discussion_r706319030



##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/IntTaxonomyFacets.java
##########
@@ -253,4 +257,78 @@ public FacetResult getTopChildren(int topN, String dim, String... path) throws I
 
     return new FacetResult(dim, path, totValue, labelValues, childCount);
   }
+
+  /**
+   * Class that uses FixedBitSet to store counts for all ordinals with 1 count and IntIntHashMap for
+   * all other counts
+   */
+  private static class IntIntHashMapWithFixedBitSet implements Iterable<IntIntCursor> {
+    // if the key exists, fixedBitSet[key] will be true, if fixedBitSet[key] is true but the key in
+    // intIntHashMap
+    // does not exist, then the value is 1
+    private final FixedBitSet fixedBitSet;
+    private final IntIntHashMap intIntHashMap;
+
+    IntIntHashMapWithFixedBitSet(int numCategories) {
+      fixedBitSet = new FixedBitSet(numCategories);
+      intIntHashMap = new IntIntHashMap();
+    }
+
+    public int addTo(int key, int incrementValue) {
+      if (!fixedBitSet.getAndSet(key) && incrementValue == 1) {
+        return 1;
+      }
+      int currentValue = intIntHashMap.addTo(key, incrementValue);
+      if (currentValue == 1) {
+        intIntHashMap.remove(key);

Review comment:
       I'm pretty sure it must always be `> 0` -- maybe add an `assert`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] mdmarshmallow closed pull request #288: LUCENE 10080 Added FixedBitSet for one counts when counting taxonomy facet labels

Posted by GitBox <gi...@apache.org>.
mdmarshmallow closed pull request #288:
URL: https://github.com/apache/lucene/pull/288


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] mdmarshmallow commented on a change in pull request #288: LUCENE 10080 Added FixedBitSet for one counts when counting taxonomy facet labels

Posted by GitBox <gi...@apache.org>.
mdmarshmallow commented on a change in pull request #288:
URL: https://github.com/apache/lucene/pull/288#discussion_r706366414



##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/IntTaxonomyFacets.java
##########
@@ -253,4 +257,78 @@ public FacetResult getTopChildren(int topN, String dim, String... path) throws I
 
     return new FacetResult(dim, path, totValue, labelValues, childCount);
   }
+
+  /**
+   * Class that uses FixedBitSet to store counts for all ordinals with 1 count and IntIntHashMap for
+   * all other counts
+   */
+  private static class IntIntHashMapWithFixedBitSet implements Iterable<IntIntCursor> {
+    // if the key exists, fixedBitSet[key] will be true, if fixedBitSet[key] is true but the key in
+    // intIntHashMap
+    // does not exist, then the value is 1
+    private final FixedBitSet fixedBitSet;
+    private final IntIntHashMap intIntHashMap;
+
+    IntIntHashMapWithFixedBitSet(int numCategories) {
+      fixedBitSet = new FixedBitSet(numCategories);
+      intIntHashMap = new IntIntHashMap();
+    }
+
+    public int addTo(int key, int incrementValue) {
+      if (!fixedBitSet.getAndSet(key) && incrementValue == 1) {
+        return 1;
+      }
+      int currentValue = intIntHashMap.addTo(key, incrementValue);
+      if (currentValue == 1) {
+        intIntHashMap.remove(key);

Review comment:
       I posted a new commit which included that.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] mdmarshmallow commented on a change in pull request #288: LUCENE 10080 Added FixedBitSet for one counts when counting taxonomy facet labels

Posted by GitBox <gi...@apache.org>.
mdmarshmallow commented on a change in pull request #288:
URL: https://github.com/apache/lucene/pull/288#discussion_r705639626



##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/IntTaxonomyFacets.java
##########
@@ -253,4 +257,78 @@ public FacetResult getTopChildren(int topN, String dim, String... path) throws I
 
     return new FacetResult(dim, path, totValue, labelValues, childCount);
   }
+
+  /**
+   * Class that uses FixedBitSet to store counts for all ordinals with 1 count and IntIntHashMap for
+   * all other counts
+   */
+  private static class IntIntHashMapWithFixedBitSet implements Iterable<IntIntCursor> {
+    // if the key exists, fixedBitSet[key] will be true, if fixedBitSet[key] is true but the key in
+    // intIntHashMap
+    // does not exist, then the value is 1
+    private final FixedBitSet fixedBitSet;
+    private final IntIntHashMap intIntHashMap;
+
+    IntIntHashMapWithFixedBitSet(int numCategories) {
+      fixedBitSet = new FixedBitSet(numCategories);
+      intIntHashMap = new IntIntHashMap();
+    }
+
+    public int addTo(int key, int incrementValue) {
+      if (!fixedBitSet.getAndSet(key) && incrementValue == 1) {
+        return 1;
+      }
+      int currentValue = intIntHashMap.addTo(key, incrementValue);
+      if (currentValue == 1) {
+        intIntHashMap.remove(key);

Review comment:
       Ok, my mistake, I wasn't sure if `incrementValue` was always `> 0`. In that case, I'll remove this and add an `assert` here.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] mdmarshmallow commented on a change in pull request #288: LUCENE 10080 Added FixedBitSet for one counts when counting taxonomy facet labels

Posted by GitBox <gi...@apache.org>.
mdmarshmallow commented on a change in pull request #288:
URL: https://github.com/apache/lucene/pull/288#discussion_r705638799



##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/IntTaxonomyFacets.java
##########
@@ -253,4 +257,78 @@ public FacetResult getTopChildren(int topN, String dim, String... path) throws I
 
     return new FacetResult(dim, path, totValue, labelValues, childCount);
   }
+
+  /**
+   * Class that uses FixedBitSet to store counts for all ordinals with 1 count and IntIntHashMap for
+   * all other counts
+   */
+  private static class IntIntHashMapWithFixedBitSet implements Iterable<IntIntCursor> {
+    // if the key exists, fixedBitSet[key] will be true, if fixedBitSet[key] is true but the key in
+    // intIntHashMap
+    // does not exist, then the value is 1
+    private final FixedBitSet fixedBitSet;
+    private final IntIntHashMap intIntHashMap;
+
+    IntIntHashMapWithFixedBitSet(int numCategories) {
+      fixedBitSet = new FixedBitSet(numCategories);
+      intIntHashMap = new IntIntHashMap();
+    }
+
+    public int addTo(int key, int incrementValue) {
+      if (!fixedBitSet.getAndSet(key) && incrementValue == 1) {

Review comment:
       Will change that.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] mikemccand commented on a change in pull request #288: LUCENE 10080 Added FixedBitSet for one counts when counting taxonomy facet labels

Posted by GitBox <gi...@apache.org>.
mikemccand commented on a change in pull request #288:
URL: https://github.com/apache/lucene/pull/288#discussion_r705520199



##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/IntTaxonomyFacets.java
##########
@@ -253,4 +257,78 @@ public FacetResult getTopChildren(int topN, String dim, String... path) throws I
 
     return new FacetResult(dim, path, totValue, labelValues, childCount);
   }
+
+  /**
+   * Class that uses FixedBitSet to store counts for all ordinals with 1 count and IntIntHashMap for
+   * all other counts
+   */
+  private static class IntIntHashMapWithFixedBitSet implements Iterable<IntIntCursor> {
+    // if the key exists, fixedBitSet[key] will be true, if fixedBitSet[key] is true but the key in
+    // intIntHashMap
+    // does not exist, then the value is 1
+    private final FixedBitSet fixedBitSet;
+    private final IntIntHashMap intIntHashMap;
+
+    IntIntHashMapWithFixedBitSet(int numCategories) {
+      fixedBitSet = new FixedBitSet(numCategories);
+      intIntHashMap = new IntIntHashMap();
+    }
+
+    public int addTo(int key, int incrementValue) {
+      if (!fixedBitSet.getAndSet(key) && incrementValue == 1) {
+        return 1;
+      }
+      int currentValue = intIntHashMap.addTo(key, incrementValue);
+      if (currentValue == 1) {
+        intIntHashMap.remove(key);

Review comment:
       Eeeek, how would this happen?  I think `incrementValue` must always be `> 0`?  If it is `1` we short-circuit in the above `if`.

##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/IntTaxonomyFacets.java
##########
@@ -253,4 +257,78 @@ public FacetResult getTopChildren(int topN, String dim, String... path) throws I
 
     return new FacetResult(dim, path, totValue, labelValues, childCount);
   }
+
+  /**
+   * Class that uses FixedBitSet to store counts for all ordinals with 1 count and IntIntHashMap for
+   * all other counts
+   */
+  private static class IntIntHashMapWithFixedBitSet implements Iterable<IntIntCursor> {
+    // if the key exists, fixedBitSet[key] will be true, if fixedBitSet[key] is true but the key in
+    // intIntHashMap
+    // does not exist, then the value is 1
+    private final FixedBitSet fixedBitSet;
+    private final IntIntHashMap intIntHashMap;
+
+    IntIntHashMapWithFixedBitSet(int numCategories) {
+      fixedBitSet = new FixedBitSet(numCategories);
+      intIntHashMap = new IntIntHashMap();
+    }
+
+    public int addTo(int key, int incrementValue) {
+      if (!fixedBitSet.getAndSet(key) && incrementValue == 1) {

Review comment:
       Can you use `fixedBitSet.getAndSet(key) == false` instead, to reduce chance of future refactoring bugs?

##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/IntTaxonomyFacets.java
##########
@@ -253,4 +257,78 @@ public FacetResult getTopChildren(int topN, String dim, String... path) throws I
 
     return new FacetResult(dim, path, totValue, labelValues, childCount);
   }
+
+  /**
+   * Class that uses FixedBitSet to store counts for all ordinals with 1 count and IntIntHashMap for
+   * all other counts
+   */
+  private static class IntIntHashMapWithFixedBitSet implements Iterable<IntIntCursor> {
+    // if the key exists, fixedBitSet[key] will be true, if fixedBitSet[key] is true but the key in
+    // intIntHashMap
+    // does not exist, then the value is 1
+    private final FixedBitSet fixedBitSet;
+    private final IntIntHashMap intIntHashMap;
+
+    IntIntHashMapWithFixedBitSet(int numCategories) {
+      fixedBitSet = new FixedBitSet(numCategories);
+      intIntHashMap = new IntIntHashMap();
+    }
+
+    public int addTo(int key, int incrementValue) {
+      if (!fixedBitSet.getAndSet(key) && incrementValue == 1) {
+        return 1;
+      }
+      int currentValue = intIntHashMap.addTo(key, incrementValue);
+      if (currentValue == 1) {
+        intIntHashMap.remove(key);
+      }
+      return currentValue;
+    }
+
+    public int get(int key) {
+      if (fixedBitSet.get(key)) {
+        return intIntHashMap.getOrDefault(key, 1);

Review comment:
       Maybe it would be better if the bitset was used only for the "precisely == 1" case?  Then we could avoid checking hash map if we see the bit is set.  Hmm, although then iteration is hairy since you'd have to do a merge sort of the bits and the hash map keys .. nevermind!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org