You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2021/09/23 22:48:15 UTC
[GitHub] [lucene] gsmiller commented on a change in pull request #319: LUCENE-10121: More skipping in WANDScorer.

gsmiller commented on a change in pull request #319:
URL: https://github.com/apache/lucene/pull/319#discussion_r715177324



##########
File path: lucene/core/src/java/org/apache/lucene/search/WANDScorer.java
##########
@@ -385,42 +387,69 @@ private void advanceTail() throws IOException {
   }
 
   private void updateMaxScores(int target) throws IOException {
-    if (head.size() == 0) {
-      // If the head is empty we use the greatest score contributor as a lead
-      // like for conjunctions.
-      upTo = tail[0].scorer.advanceShallow(target);
-    } else {
-      // If we still have entries in 'head', we treat them all as leads and
-      // take the minimum of their next block boundaries as a next boundary.
-      // We don't take entries in 'tail' into account on purpose: 'tail' is
-      // supposed to contain the least score contributors, and taking them
-      // into account might not move the boundary fast enough, so we'll waste
-      // CPU re-computing the next boundary all the time.
-      int newUpTo = DocIdSetIterator.NO_MORE_DOCS;
-      for (DisiWrapper w : head) {
-        if (w.doc <= newUpTo) {
-          newUpTo = Math.min(w.scorer.advanceShallow(w.doc), newUpTo);
-          w.maxScore = scaleMaxScore(w.scorer.getMaxScore(newUpTo), scalingFactor);
+    while (upTo < DocIdSetIterator.NO_MORE_DOCS) {
+      if (head.size() == 0) {
+        // If the head is empty we use the greatest score contributor as a lead
+        // like for conjunctions.
+        upTo = tail[0].scorer.advanceShallow(target);
+      } else {
+        // If we still have entries in 'head', we treat them all as leads and
+        // take the minimum of their next block boundaries as a next boundary.
+        // We don't take entries in 'tail' into account on purpose: 'tail' is
+        // supposed to contain the least score contributors, and taking them
+        // into account might not move the boundary fast enough, so we'll waste
+        // CPU re-computing the next boundary all the time.
+        int newUpTo = DocIdSetIterator.NO_MORE_DOCS;
+        for (DisiWrapper w : head) {
+          if (w.doc <= newUpTo) {
+            newUpTo = Math.min(w.scorer.advanceShallow(w.doc), newUpTo);
+            w.maxScore = scaleMaxScore(w.scorer.getMaxScore(newUpTo), scalingFactor);
+          }
         }
+        upTo = newUpTo;
       }
-      upTo = newUpTo;
-    }
 
-    tailMaxScore = 0;
-    for (int i = 0; i < tailSize; ++i) {
-      DisiWrapper w = tail[i];
-      w.scorer.advanceShallow(target);
-      w.maxScore = scaleMaxScore(w.scorer.getMaxScore(upTo), scalingFactor);
-      upHeapMaxScore(tail, i); // the heap might need to be reordered
-      tailMaxScore += w.maxScore;
-    }
+      // Because the scaling process involves rounding, it is possible that a
+      // block is considered as possibly containing a competitive document
+      // according to the scaled scores, while the unscaled scores could tell us
+      // that the block has not competitive document. We try to detect such
+      // situations here.
+      while (target < DocIdSetIterator.NO_MORE_DOCS && getMaxScore(upTo) < unscaledMinCompetitiveScore) {
+        target = upTo + 1;
+        if (target < 0) { // overflow
+          target = DocIdSetIterator.NO_MORE_DOCS;

Review comment:
       Can this actually overflow here since the while-loop is checking `target < DocIdSetIterator.NO_MORE_DOCS` already?

##########
File path: lucene/core/src/java/org/apache/lucene/search/WANDScorer.java
##########
@@ -385,42 +387,69 @@ private void advanceTail() throws IOException {
   }
 
   private void updateMaxScores(int target) throws IOException {
-    if (head.size() == 0) {
-      // If the head is empty we use the greatest score contributor as a lead
-      // like for conjunctions.
-      upTo = tail[0].scorer.advanceShallow(target);
-    } else {
-      // If we still have entries in 'head', we treat them all as leads and
-      // take the minimum of their next block boundaries as a next boundary.
-      // We don't take entries in 'tail' into account on purpose: 'tail' is
-      // supposed to contain the least score contributors, and taking them
-      // into account might not move the boundary fast enough, so we'll waste
-      // CPU re-computing the next boundary all the time.
-      int newUpTo = DocIdSetIterator.NO_MORE_DOCS;
-      for (DisiWrapper w : head) {
-        if (w.doc <= newUpTo) {
-          newUpTo = Math.min(w.scorer.advanceShallow(w.doc), newUpTo);
-          w.maxScore = scaleMaxScore(w.scorer.getMaxScore(newUpTo), scalingFactor);
+    while (upTo < DocIdSetIterator.NO_MORE_DOCS) {
+      if (head.size() == 0) {
+        // If the head is empty we use the greatest score contributor as a lead
+        // like for conjunctions.
+        upTo = tail[0].scorer.advanceShallow(target);
+      } else {
+        // If we still have entries in 'head', we treat them all as leads and
+        // take the minimum of their next block boundaries as a next boundary.
+        // We don't take entries in 'tail' into account on purpose: 'tail' is
+        // supposed to contain the least score contributors, and taking them
+        // into account might not move the boundary fast enough, so we'll waste
+        // CPU re-computing the next boundary all the time.
+        int newUpTo = DocIdSetIterator.NO_MORE_DOCS;
+        for (DisiWrapper w : head) {
+          if (w.doc <= newUpTo) {
+            newUpTo = Math.min(w.scorer.advanceShallow(w.doc), newUpTo);
+            w.maxScore = scaleMaxScore(w.scorer.getMaxScore(newUpTo), scalingFactor);
+          }
         }
+        upTo = newUpTo;
       }
-      upTo = newUpTo;
-    }
 
-    tailMaxScore = 0;
-    for (int i = 0; i < tailSize; ++i) {
-      DisiWrapper w = tail[i];
-      w.scorer.advanceShallow(target);
-      w.maxScore = scaleMaxScore(w.scorer.getMaxScore(upTo), scalingFactor);
-      upHeapMaxScore(tail, i); // the heap might need to be reordered
-      tailMaxScore += w.maxScore;
-    }
+      // Because the scaling process involves rounding, it is possible that a
+      // block is considered as possibly containing a competitive document
+      // according to the scaled scores, while the unscaled scores could tell us
+      // that the block has not competitive document. We try to detect such
+      // situations here.
+      while (target < DocIdSetIterator.NO_MORE_DOCS && getMaxScore(upTo) < unscaledMinCompetitiveScore) {
+        target = upTo + 1;
+        if (target < 0) { // overflow
+          target = DocIdSetIterator.NO_MORE_DOCS;
+        }
+        upTo = advanceShallow(target);
+      }
 
-    // We need to make sure that entries in 'tail' alone cannot match
-    // a competitive hit.
-    while (tailSize > 0 && tailMaxScore >= minCompetitiveScore) {
-      DisiWrapper w = popTail();
-      w.doc = w.iterator.advance(target);
-      head.add(w);
+      tailMaxScore = 0;
+      for (int i = 0; i < tailSize; ++i) {
+        DisiWrapper w = tail[i];
+        w.scorer.advanceShallow(target);
+        w.maxScore = scaleMaxScore(w.scorer.getMaxScore(upTo), scalingFactor);
+        upHeapMaxScore(tail, i); // the heap might need to be reordered
+        tailMaxScore += w.maxScore;
+      }
+
+      // We need to make sure that entries in 'tail' alone cannot match
+      // a competitive hit.
+      while (tailSize > 0 && tailMaxScore >= minCompetitiveScore) {
+        DisiWrapper w = popTail();
+        w.doc = w.iterator.advance(target);
+        head.add(w);
+      }
+
+      // Make sure the head is now beyond the target
+      DisiWrapper top = head.top();
+      if (top != null) {

Review comment:
       Maybe check `if (head.size() != 0)` instead to be consistent with the checks earlier in this method?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org