You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2021/07/22 16:45:43 UTC
[GitHub] [lucene] jpountz commented on a change in pull request #221: LUCENE-10031: Speed up SortedDocIdMerger on low-cardinality sort fields.

jpountz commented on a change in pull request #221:
URL: https://github.com/apache/lucene/pull/221#discussion_r674980048



##########
File path: lucene/core/src/java/org/apache/lucene/index/DocIDMerger.java
##########
@@ -144,59 +159,45 @@ protected boolean lessThan(Sub a, Sub b) {
     public void reset() throws IOException {
       // caller may not have fully consumed the queue:
       queue.clear();
+      current = null;
       boolean first = true;
       for (T sub : subs) {
         if (first) {
           // by setting mappedDocID = -1, this entry is guaranteed to be the top of the queue
           // so the first call to next() will advance it
           sub.mappedDocID = -1;
+          current = sub;
           first = false;
-        } else {
-          int mappedDocID;
-          while (true) {
-            int docID = sub.nextDoc();
-            if (docID == NO_MORE_DOCS) {
-              mappedDocID = NO_MORE_DOCS;
-              break;
-            }
-            mappedDocID = sub.docMap.get(docID);
-            if (mappedDocID != -1) {
-              break;
-            }
-          }
-          if (mappedDocID == NO_MORE_DOCS) {
-            // all docs in this sub were deleted; do not add it to the queue!
-            continue;
-          }
-          sub.mappedDocID = mappedDocID;
-        }
-        queue.add(sub);
+        } else if (sub.nextMappedDoc() != NO_MORE_DOCS) {
+          queue.add(sub);
+        } // else all docs in this sub were deleted; do not add it to the queue!
       }
+      queueNextMappedDoc = queue.size() == 0 ? NO_MORE_DOCS : queue.top().mappedDocID;
     }
 
     @Override
     public T next() throws IOException {
-      T top = queue.top();
-
-      while (true) {
-        int docID = top.nextDoc();
-        if (docID == NO_MORE_DOCS) {
-          queue.pop();
-          top = queue.top();
-          break;
-        }
-        int mappedDocID = top.docMap.get(docID);
-        if (mappedDocID == -1) {
-          // doc was deleted
-          continue;
+      int nextDoc = current.nextMappedDoc();
+      // the below condition is unlikely when the index sort is on a low-cardinality field
+      if (nextDoc >= queueNextMappedDoc) {
+        if (nextDoc == NO_MORE_DOCS) {
+          if (queue.size() == 0) {
+            current = null;
+            return null;
+          } else {
+            current = queue.pop();
+            queueNextMappedDoc = queue.size() == 0 ? NO_MORE_DOCS : queue.top().mappedDocID;

Review comment:
       Thanks. I had to add a check whether the queue is empty under `if (nextDoc == NO_MORE_DOCS) {` for it to work.
   
   I ran a quick benchmark with a merge that merges 10M docs that have 10 single-valued SORTED_SET doc-value fields each. Without index sorting, the merge runs in ~740ms. With index sorting on a NumericDocValues field of cardinality 8, I get the following merging times:
    - master: ~1300ms
    - the first version of this PR: ~860ms,
    - your suggestion: ~910ms
   
   I think the extra simplicity is worth the minor slowdown, it's still significantly faster than master.
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org