You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2021/09/04 03:55:47 UTC

[GitHub] [lucene] goankur opened a new pull request #282: Lucene-10070

goankur opened a new pull request #282:
URL: https://github.com/apache/lucene/pull/282


   <!--
   _(If you are a project committer then you may remove some/all of the following template.)_
   
   Before creating a pull request, please file an issue in the ASF Jira system for Lucene:
   
   * https://issues.apache.org/jira/projects/LUCENE
   
   You will need to create an account in Jira in order to create an issue.
   
   The title of the PR should reference the Jira issue number in the form:
   
   * LUCENE-####: <short description of problem or changes>
   
   LUCENE must be fully capitalized. A short description helps people scanning pull requests for items they can work on.
   
   Properly referencing the issue in the title ensures that Jira is correctly updated with code review comments and commits. -->
   
   
   # Description
   
   Following `Facets` implementations are changed to ignore deleted documents for `count all` queries that don't use `FacetsCollector` instance.
     - SortedSetDocValueFacetCounts
     - ConcurrentSortedSetDocValueFacetCounts
     - LongValueFacetCounts
     - StringValueFacetCounts 
   
   # Solution
   
   Ignore deleted documents during docValues iteration to calculate facet label counts
   
   # Tests
   
   Following test methods have been copied from pull request
    https://github.com/apache/lucene/pull/263/files (Thanks Greg Miller)
     - TestLongValueFacetCounts.testCountAll() 
     - TestStringValueFacetCount.testCountAll()
     - TestSortedSetDocValuesFacets.testCountAll()
     
   # Checklist
   
   Please review the following and check all that apply:
   
   - [X] I have reviewed the guidelines for [How to Contribute](https://wiki.apache.org/lucene/HowToContribute) and my code conforms to the standards described there to the best of my ability.
   - [X] I have created a Jira issue and added the issue ID to my pull request title.
   - [X] I have given Lucene maintainers [access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork) to contribute to my PR branch. (optional but recommended)
   - [X] I have developed this patch against the `main` branch.
   - [X] I have run `./gradlew check`.
   - [X] I have added tests for my changes.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] gsmiller commented on a change in pull request #282: Lucene-10070

Posted by GitBox <gi...@apache.org>.
gsmiller commented on a change in pull request #282:
URL: https://github.com/apache/lucene/pull/282#discussion_r702274019



##########
File path: lucene/facet/src/java/org/apache/lucene/facet/sortedset/ConcurrentSortedSetDocValuesFacetCounts.java
##########
@@ -159,13 +160,15 @@ private FacetResult getDim(String dim, OrdRange ordRange, int topN) throws IOExc
     final MatchingDocs hits;
     final OrdinalMap ordinalMap;
     final int segOrd;
+    final Bits liveDocs;
 
     public CountOneSegment(
         LeafReader leafReader, MatchingDocs hits, OrdinalMap ordinalMap, int segOrd) {
       this.leafReader = leafReader;
       this.hits = hits;
       this.ordinalMap = ordinalMap;
       this.segOrd = segOrd;
+      this.liveDocs = (leafReader != null) ? leafReader.getLiveDocs() : null;

Review comment:
       I wonder if it makes sense to just `assert leafReader != null` at the top of this ctor? If it's `null`, when `call()` gets invoked it will throw an NPE anyway (down in `DocValues.getSortedSet`), so might be nice to fail early on this. What do you think?

##########
File path: lucene/facet/src/java/org/apache/lucene/facet/sortedset/SortedSetDocValuesFacetCounts.java
##########
@@ -152,7 +153,8 @@ private FacetResult getDim(String dim, OrdRange ordRange, int topN) throws IOExc
   }
 
   private void countOneSegment(
-      OrdinalMap ordinalMap, LeafReader reader, int segOrd, MatchingDocs hits) throws IOException {
+      OrdinalMap ordinalMap, LeafReader reader, int segOrd, MatchingDocs hits, Bits liveDocs)

Review comment:
       I left a similar comment in the concurrent version of this, but I think we should specialize the count-all case so we're not checking `liveDocs` in the common case where `hits` is non-null (since `hits` will already do that check essentially, and should only be providing live docs).

##########
File path: lucene/facet/src/java/org/apache/lucene/facet/sortedset/ConcurrentSortedSetDocValuesFacetCounts.java
##########
@@ -207,11 +210,17 @@ public Void call() throws IOException {
           // Remap every ord to global ord as we iterate:
           if (singleValues != null) {
             for (int doc = it.nextDoc(); doc != DocIdSetIterator.NO_MORE_DOCS; doc = it.nextDoc()) {
+              if (liveDocs != null && liveDocs.get(doc) == false) {

Review comment:
       Hmm, this is a little tricky. We only need to check `liveDocs` if `hits` is `null` (i.e., we're counting "all" docs). If `hits` is non-null, then we know it's only providing "live" docs already, so we don't need to re-check. Other `Facets` implementations actually provide a separate implementation / code-path for the "count all" case to specialize, and I wonder if we should do the same here. The code will end up being fairly duplicated (like elsewhere), but it would be more efficient for the common case where `hits` is non-null.

##########
File path: lucene/facet/src/java/org/apache/lucene/facet/StringValueFacetCounts.java
##########
@@ -272,7 +273,8 @@ private void count(FacetsCollector facetsCollector) throws IOException {
       // Assuming the state is valid, ordinalMap should be null since we have one segment:
       assert ordinalMap == null;
 
-      countOneSegment(docValues, hits.context.ord, hits);
+      // hits contain live documents only, no need to pass live docs bitset explicitly
+      countOneSegment(docValues, hits.context.ord, hits, null);

Review comment:
       Similar comment to what I left in `ConcurrentSortedSetDocValueFacetCounts` about whether-or-not it makes sense to split out a separate implementation for the count-all case. I think we should even though there might be significant code duplication. Checking liveness isn't necessary when `hits` is non-null, and will just add overhead to the common case. Thoughts?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] goankur commented on a change in pull request #282: Lucene-10070

Posted by GitBox <gi...@apache.org>.
goankur commented on a change in pull request #282:
URL: https://github.com/apache/lucene/pull/282#discussion_r706535145



##########
File path: lucene/facet/src/java/org/apache/lucene/facet/sortedset/ConcurrentSortedSetDocValuesFacetCounts.java
##########
@@ -207,11 +210,17 @@ public Void call() throws IOException {
           // Remap every ord to global ord as we iterate:
           if (singleValues != null) {
             for (int doc = it.nextDoc(); doc != DocIdSetIterator.NO_MORE_DOCS; doc = it.nextDoc()) {
+              if (liveDocs != null && liveDocs.get(doc) == false) {

Review comment:
       Checking `liveDocs`  only when `hits` is `null` makes sense to me.  However, duplicating significant amount of code doesn't sound appealing to me. To get around this issue,  I created a new DocIdSetIterator that wraps liveDocs and the main DocIdSetIterator only when we know we are counting all docs (i.e hits is null).  Please see the new PR
   https://github.com/apache/lucene/pull/293/files




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] goankur commented on a change in pull request #282: Lucene-10070

Posted by GitBox <gi...@apache.org>.
goankur commented on a change in pull request #282:
URL: https://github.com/apache/lucene/pull/282#discussion_r706532667



##########
File path: lucene/facet/src/java/org/apache/lucene/facet/sortedset/ConcurrentSortedSetDocValuesFacetCounts.java
##########
@@ -159,13 +160,15 @@ private FacetResult getDim(String dim, OrdRange ordRange, int topN) throws IOExc
     final MatchingDocs hits;
     final OrdinalMap ordinalMap;
     final int segOrd;
+    final Bits liveDocs;
 
     public CountOneSegment(
         LeafReader leafReader, MatchingDocs hits, OrdinalMap ordinalMap, int segOrd) {
       this.leafReader = leafReader;
       this.hits = hits;
       this.ordinalMap = ordinalMap;
       this.segOrd = segOrd;
+      this.liveDocs = (leafReader != null) ? leafReader.getLiveDocs() : null;

Review comment:
       Yes that makes sense. I made the change in the new revision captured in a different PR
   https://github.com/apache/lucene/pull/293
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] goankur commented on a change in pull request #282: Lucene-10070

Posted by GitBox <gi...@apache.org>.
goankur commented on a change in pull request #282:
URL: https://github.com/apache/lucene/pull/282#discussion_r706532667



##########
File path: lucene/facet/src/java/org/apache/lucene/facet/sortedset/ConcurrentSortedSetDocValuesFacetCounts.java
##########
@@ -159,13 +160,15 @@ private FacetResult getDim(String dim, OrdRange ordRange, int topN) throws IOExc
     final MatchingDocs hits;
     final OrdinalMap ordinalMap;
     final int segOrd;
+    final Bits liveDocs;
 
     public CountOneSegment(
         LeafReader leafReader, MatchingDocs hits, OrdinalMap ordinalMap, int segOrd) {
       this.leafReader = leafReader;
       this.hits = hits;
       this.ordinalMap = ordinalMap;
       this.segOrd = segOrd;
+      this.liveDocs = (leafReader != null) ? leafReader.getLiveDocs() : null;

Review comment:
       Yes that makes sense. I made the change in the new revision captured in a different PR
   https://github.com/apache/lucene/pull/293
   
   I had to create the new PR of a different branch as changes in a local branch got overwritten after a rebase from remote, likely due to some GIT related mistakes I made. 
   
   I'd request you to move our conversation over to the new PR. Apologies for the inconvenience.
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] goankur commented on a change in pull request #282: Lucene-10070

Posted by GitBox <gi...@apache.org>.
goankur commented on a change in pull request #282:
URL: https://github.com/apache/lucene/pull/282#discussion_r706535368



##########
File path: lucene/facet/src/java/org/apache/lucene/facet/StringValueFacetCounts.java
##########
@@ -272,7 +273,8 @@ private void count(FacetsCollector facetsCollector) throws IOException {
       // Assuming the state is valid, ordinalMap should be null since we have one segment:
       assert ordinalMap == null;
 
-      countOneSegment(docValues, hits.context.ord, hits);
+      // hits contain live documents only, no need to pass live docs bitset explicitly
+      countOneSegment(docValues, hits.context.ord, hits, null);

Review comment:
       Please see my earlier comments. Thanks for taking a look.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] gsmiller commented on pull request #282: Lucene-10070

Posted by GitBox <gi...@apache.org>.
gsmiller commented on pull request #282:
URL: https://github.com/apache/lucene/pull/282#issuecomment-921968655


   @goankur this can be closed out now right since you opened a separate PR for this change? There's nothing else remaining on this one is there? Thanks again!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] goankur commented on a change in pull request #282: Lucene-10070

Posted by GitBox <gi...@apache.org>.
goankur commented on a change in pull request #282:
URL: https://github.com/apache/lucene/pull/282#discussion_r706535312



##########
File path: lucene/facet/src/java/org/apache/lucene/facet/sortedset/SortedSetDocValuesFacetCounts.java
##########
@@ -152,7 +153,8 @@ private FacetResult getDim(String dim, OrdRange ordRange, int topN) throws IOExc
   }
 
   private void countOneSegment(
-      OrdinalMap ordinalMap, LeafReader reader, int segOrd, MatchingDocs hits) throws IOException {
+      OrdinalMap ordinalMap, LeafReader reader, int segOrd, MatchingDocs hits, Bits liveDocs)

Review comment:
       Please see https://github.com/apache/lucene/pull/293/files for the approach that relies on wrapping `liveDocs` and main `DocIdSeIterator` to avoid checking `liveDocs` when `hits` is not null.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org