You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2022/08/15 21:24:22 UTC

[GitHub] [lucene] jmazanec15 opened a new pull request, #1068: LUCENE-10674: Update subiterators when BitSetConjDISI exhausts

jmazanec15 opened a new pull request, #1068:
URL: https://github.com/apache/lucene/pull/1068

   ### Description (or a Jira issue link if you have one)
   
   Update all subiterators of BitSetConjunctionDISI to NO_MORE_DOCS when lead iterator's current doc is greater than the length of the smallest bitset in the bitset iterators. Add test to verify sub iterators exhaust when this case happens.
   
   [LUCENE-10674](https://issues.apache.org/jira/browse/LUCENE-10674)
   
   
   <!--
   If this is your first contribution to Lucene, please make sure you have reviewed the contribution guide.
   https://github.com/apache/lucene/blob/main/CONTRIBUTING.md
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jmazanec15 commented on a diff in pull request #1068: LUCENE-10674: Update subiterators when BitSetConjDISI exhausts

Posted by GitBox <gi...@apache.org>.
jmazanec15 commented on code in PR #1068:
URL: https://github.com/apache/lucene/pull/1068#discussion_r966532126


##########
lucene/core/src/java/org/apache/lucene/search/ConjunctionDISI.java:
##########
@@ -281,6 +281,12 @@ private int doNext(int doc) throws IOException {
       advanceLead:
       for (; ; doc = lead.nextDoc()) {
         if (doc >= minLength) {
+          if (doc != NO_MORE_DOCS) {
+            lead.advance(NO_MORE_DOCS);
+          }
+          for (BitSetIterator iterator : bitSetIterators) {
+            iterator.setDocId(NO_MORE_DOCS);
+          }

Review Comment:
   Oh I see. For this, we created a DocIdSetIterator using DocIdSetBuilder. For [maxDoc](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/DocIdSetBuilder.java#L116), it was less than the number of docs in the index. This led to the length of the bitsets being different.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jpountz merged pull request #1068: LUCENE-10674: Update subiterators when BitSetConjDISI exhausts

Posted by GitBox <gi...@apache.org>.
jpountz merged PR #1068:
URL: https://github.com/apache/lucene/pull/1068


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jpountz commented on a diff in pull request #1068: LUCENE-10674: Update subiterators when BitSetConjDISI exhausts

Posted by GitBox <gi...@apache.org>.
jpountz commented on code in PR #1068:
URL: https://github.com/apache/lucene/pull/1068#discussion_r966192242


##########
lucene/core/src/java/org/apache/lucene/search/ConjunctionDISI.java:
##########
@@ -281,6 +281,12 @@ private int doNext(int doc) throws IOException {
       advanceLead:
       for (; ; doc = lead.nextDoc()) {
         if (doc >= minLength) {
+          if (doc != NO_MORE_DOCS) {
+            lead.advance(NO_MORE_DOCS);
+          }
+          for (BitSetIterator iterator : bitSetIterators) {
+            iterator.setDocId(NO_MORE_DOCS);
+          }

Review Comment:
   The `if` statement makes sense to me, but I'm curious how you managed to hit this case. This suggests that we create BitSets whose size is not `maxDoc`, do you know where this happens?
   
   The `for` loop should be unnecessary, there is no guarantee that all sub iterators advance to `NO_MORE_DOCS`. If this causes problems, then it means we have another bug somewhere else?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jpountz commented on pull request #1068: LUCENE-10674: Update subiterators when BitSetConjDISI exhausts

Posted by GitBox <gi...@apache.org>.
jpountz commented on PR #1068:
URL: https://github.com/apache/lucene/pull/1068#issuecomment-1240949394

   The fact that it should be legal for `ConjunctionDISI` to return `NO_MORE_DOCS` when the lead iterator advances to `NO_MORE_DOCS` without advancing other iterators makes me wonder if this change is hiding another bug. Do you understand why it's important for clauses under the `ConjuctionDISI` to advance to `NO_MORE_DOCS`?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jmazanec15 commented on pull request #1068: LUCENE-10674: Update subiterators when BitSetConjDISI exhausts

Posted by GitBox <gi...@apache.org>.
jmazanec15 commented on PR #1068:
URL: https://github.com/apache/lucene/pull/1068#issuecomment-1241094561

   > The fact that it should be legal for ConjunctionDISI to return NO_MORE_DOCS when the lead iterator advances to NO_MORE_DOCS without advancing other iterators makes me wonder if this change is hiding another bug. 
   
   Is it valid for an iterator to advance outside of the ConjunctionDISI? I cant find anywhere that prevents this, but I was under the assumption it should be invalid to do this.
   
   > Do you understand why it's important for clauses under the ConjuctionDISI to advance to NO_MORE_DOCS?
   
   In general I was trying to keep all sub-iterators on the same doc all the time. In terms of importance, I guess it depends if the iterators can be used outside the ConjunctionDISI. 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jmazanec15 commented on a diff in pull request #1068: LUCENE-10674: Update subiterators when BitSetConjDISI exhausts

Posted by GitBox <gi...@apache.org>.
jmazanec15 commented on code in PR #1068:
URL: https://github.com/apache/lucene/pull/1068#discussion_r966298807


##########
lucene/core/src/java/org/apache/lucene/search/ConjunctionDISI.java:
##########
@@ -281,6 +281,12 @@ private int doNext(int doc) throws IOException {
       advanceLead:
       for (; ; doc = lead.nextDoc()) {
         if (doc >= minLength) {
+          if (doc != NO_MORE_DOCS) {
+            lead.advance(NO_MORE_DOCS);
+          }
+          for (BitSetIterator iterator : bitSetIterators) {
+            iterator.setDocId(NO_MORE_DOCS);
+          }

Review Comment:
   > The if statement makes sense to me, but I'm curious how you managed to hit this case. This suggests that we create BitSets whose size is not maxDoc, do you know where this happens?
   
   I think I might be misunderstanding the question. Each bitsetiterator could have a different length of bitset, potentially as an optimization ([minLength](https://github.com/apache/lucene/blob/branch_9_4/lucene/core/src/java/org/apache/lucene/search/ConjunctionDISI.java#L256) I think suggests this is expected). If a bitsetiterator's top match is 10 and there are 1M docs in the index, I think there was no reason to store 1M bits - the bitsetiterator can just exhaust after 10.
   
   
   > The for loop should be unnecessary, there is no guarantee that all sub iterators advance to NO_MORE_DOCS. If this causes problems, then it means we have another bug somewhere else?
   
   Agree this is probably unnecessary. I added it to ensure that [this statement](https://github.com/apache/lucene/blob/branch_9_4/lucene/core/src/java/org/apache/lucene/search/ConjunctionDISI.java#L31-L32) holds: "Requires that all of its sub-iterators must be on the same document all the time."



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jpountz commented on pull request #1068: LUCENE-10674: Update subiterators when BitSetConjDISI exhausts

Posted by GitBox <gi...@apache.org>.
jpountz commented on PR #1068:
URL: https://github.com/apache/lucene/pull/1068#issuecomment-1241631941

   Thank you for your comments, I think I understand the bug now. I think a better description of the bug is that `BitSetConjunctionDISI#docID()` doesn't honor its contract that it must return `NO_MORE_DOCS` when the iterator is exhausted. And this issue may only occur if any of the bitsets involved in the conjunction have a length that is less than `maxDoc`, which is not typical. This would explain why we haven't seen this bug earlier.
   
   > Is it valid for an iterator to advance outside of the ConjunctionDISI? I cant find anywhere that prevents this, but I was under the assumption it should be invalid to do this.
   
   It is invalid indeed. The goal of the assertion that all iterators are on the same doc is to catch such issues.
   
   > I was trying to keep all sub-iterators on the same doc all the time.
   
   I have a preference for not advancing other iterators, because it should not be necessary, and if it is then this means we have another bug somewhere else. The `ConjunctionDISI` that doesn't optimize for bitsets only advances other iterators when reaching `NO_MORE_DOCS` because avoiding to do this would require introducing more conditionals, which would have overhead. But it isn't necessary.
   
   I would suggest to no longer advance other iterators, and change the test to make sure that `docID()` returns `NO_MORE_DOCS` when the iterator is exhausted, instead of checking if all sub iterators are on the same doc ID?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jmazanec15 commented on pull request #1068: LUCENE-10674: Update subiterators when BitSetConjDISI exhausts

Posted by GitBox <gi...@apache.org>.
jmazanec15 commented on PR #1068:
URL: https://github.com/apache/lucene/pull/1068#issuecomment-1244076921

   @jpountz That makes sense. If the iterators are not valid outside of the ConjDISI, it doesnt make sense to advance them once the leader is exhausted. I updated test and changes to reflect this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jmazanec15 commented on pull request #1068: LUCENE-10674: Update subiterators when BitSetConjDISI exhausts

Posted by GitBox <gi...@apache.org>.
jmazanec15 commented on PR #1068:
URL: https://github.com/apache/lucene/pull/1068#issuecomment-1240159528

   @zhaih Thanks! Added entry for 9.5 release.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jpountz commented on pull request #1068: LUCENE-10674: Ensure BitSetConjDISI returns NO_MORE_DOCS when sub-iterator exhausts

Posted by GitBox <gi...@apache.org>.
jpountz commented on PR #1068:
URL: https://github.com/apache/lucene/pull/1068#issuecomment-1249481566

   @jmazanec15 Maybe one thing I'd like to double check: what is the query that created a DocIdSetBuilder with a length that is less than `maxDoc`? Is it a Lucene Query or a custom query of your own?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jpountz commented on pull request #1068: LUCENE-10674: Ensure BitSetConjDISI returns NO_MORE_DOCS when sub-iterator exhausts

Posted by GitBox <gi...@apache.org>.
jpountz commented on PR #1068:
URL: https://github.com/apache/lucene/pull/1068#issuecomment-1249477900

   @jmazanec15 FYI I backported this change to 9.4, so it should be in the next release.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] zhaih commented on pull request #1068: LUCENE-10674: Update subiterators when BitSetConjDISI exhausts

Posted by GitBox <gi...@apache.org>.
zhaih commented on PR #1068:
URL: https://github.com/apache/lucene/pull/1068#issuecomment-1238647657

   Thanks, looks reasonable to me, please add an entry to CHANGES.txt!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org