You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2021/07/21 17:27:15 UTC

[GitHub] [lucene] jpountz opened a new pull request #221: LUCENE-10031: Speed up SortedDocIdMerger on low-cardinality sort fields.

jpountz opened a new pull request #221:
URL: https://github.com/apache/lucene/pull/221


   When sorting by low-cardinality fields, the same sub remains current for long
   sequences of doc IDs. This speeds up SortedDocIdMerger a bit by caching the next
   mapped document of the next sub so that we can quickly check whether the current
   sub is actually still current.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] dnhatn commented on a change in pull request #221: LUCENE-10031: Speed up SortedDocIdMerger on low-cardinality sort fields.

Posted by GitBox <gi...@apache.org>.
dnhatn commented on a change in pull request #221:
URL: https://github.com/apache/lucene/pull/221#discussion_r674914345



##########
File path: lucene/core/src/java/org/apache/lucene/index/DocIDMerger.java
##########
@@ -144,59 +159,45 @@ protected boolean lessThan(Sub a, Sub b) {
     public void reset() throws IOException {
       // caller may not have fully consumed the queue:
       queue.clear();
+      current = null;
       boolean first = true;
       for (T sub : subs) {
         if (first) {
           // by setting mappedDocID = -1, this entry is guaranteed to be the top of the queue
           // so the first call to next() will advance it
           sub.mappedDocID = -1;
+          current = sub;
           first = false;
-        } else {
-          int mappedDocID;
-          while (true) {
-            int docID = sub.nextDoc();
-            if (docID == NO_MORE_DOCS) {
-              mappedDocID = NO_MORE_DOCS;
-              break;
-            }
-            mappedDocID = sub.docMap.get(docID);
-            if (mappedDocID != -1) {
-              break;
-            }
-          }
-          if (mappedDocID == NO_MORE_DOCS) {
-            // all docs in this sub were deleted; do not add it to the queue!
-            continue;
-          }
-          sub.mappedDocID = mappedDocID;
-        }
-        queue.add(sub);
+        } else if (sub.nextMappedDoc() != NO_MORE_DOCS) {
+          queue.add(sub);
+        } // else all docs in this sub were deleted; do not add it to the queue!
       }
+      queueNextMappedDoc = queue.size() == 0 ? NO_MORE_DOCS : queue.top().mappedDocID;
     }
 
     @Override
     public T next() throws IOException {
-      T top = queue.top();
-
-      while (true) {
-        int docID = top.nextDoc();
-        if (docID == NO_MORE_DOCS) {
-          queue.pop();
-          top = queue.top();
-          break;
-        }
-        int mappedDocID = top.docMap.get(docID);
-        if (mappedDocID == -1) {
-          // doc was deleted
-          continue;
+      int nextDoc = current.nextMappedDoc();
+      // the below condition is unlikely when the index sort is on a low-cardinality field
+      if (nextDoc >= queueNextMappedDoc) {
+        if (nextDoc == NO_MORE_DOCS) {
+          if (queue.size() == 0) {
+            current = null;
+            return null;
+          } else {
+            current = queue.pop();
+            queueNextMappedDoc = queue.size() == 0 ? NO_MORE_DOCS : queue.top().mappedDocID;

Review comment:
       I think we can remove `queueNextMappedDoc` and replace it with `queue.top().mappedDocID`? Below is my version without `queueNextMappedDoc`. WDYT?
   
   ```java
   int nextDoc = current.nextMappedDoc();
   // the below condition is unlikely when the index sort is on a low-cardinality field
   if (nextDoc == NO_MORE_DOCS) {
     current = queue.pop();
   } else if (queue.size() > 0 && nextDoc > queue.top().mappedDocID) {
     T newCurrent = queue.top();
     queue.updateTop(current);
     current = newCurrent;
   }
   return current;
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jpountz merged pull request #221: LUCENE-10031: Speed up SortedDocIdMerger on low-cardinality sort fields.

Posted by GitBox <gi...@apache.org>.
jpountz merged pull request #221:
URL: https://github.com/apache/lucene/pull/221


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jpountz commented on a change in pull request #221: LUCENE-10031: Speed up SortedDocIdMerger on low-cardinality sort fields.

Posted by GitBox <gi...@apache.org>.
jpountz commented on a change in pull request #221:
URL: https://github.com/apache/lucene/pull/221#discussion_r674980048



##########
File path: lucene/core/src/java/org/apache/lucene/index/DocIDMerger.java
##########
@@ -144,59 +159,45 @@ protected boolean lessThan(Sub a, Sub b) {
     public void reset() throws IOException {
       // caller may not have fully consumed the queue:
       queue.clear();
+      current = null;
       boolean first = true;
       for (T sub : subs) {
         if (first) {
           // by setting mappedDocID = -1, this entry is guaranteed to be the top of the queue
           // so the first call to next() will advance it
           sub.mappedDocID = -1;
+          current = sub;
           first = false;
-        } else {
-          int mappedDocID;
-          while (true) {
-            int docID = sub.nextDoc();
-            if (docID == NO_MORE_DOCS) {
-              mappedDocID = NO_MORE_DOCS;
-              break;
-            }
-            mappedDocID = sub.docMap.get(docID);
-            if (mappedDocID != -1) {
-              break;
-            }
-          }
-          if (mappedDocID == NO_MORE_DOCS) {
-            // all docs in this sub were deleted; do not add it to the queue!
-            continue;
-          }
-          sub.mappedDocID = mappedDocID;
-        }
-        queue.add(sub);
+        } else if (sub.nextMappedDoc() != NO_MORE_DOCS) {
+          queue.add(sub);
+        } // else all docs in this sub were deleted; do not add it to the queue!
       }
+      queueNextMappedDoc = queue.size() == 0 ? NO_MORE_DOCS : queue.top().mappedDocID;
     }
 
     @Override
     public T next() throws IOException {
-      T top = queue.top();
-
-      while (true) {
-        int docID = top.nextDoc();
-        if (docID == NO_MORE_DOCS) {
-          queue.pop();
-          top = queue.top();
-          break;
-        }
-        int mappedDocID = top.docMap.get(docID);
-        if (mappedDocID == -1) {
-          // doc was deleted
-          continue;
+      int nextDoc = current.nextMappedDoc();
+      // the below condition is unlikely when the index sort is on a low-cardinality field
+      if (nextDoc >= queueNextMappedDoc) {
+        if (nextDoc == NO_MORE_DOCS) {
+          if (queue.size() == 0) {
+            current = null;
+            return null;
+          } else {
+            current = queue.pop();
+            queueNextMappedDoc = queue.size() == 0 ? NO_MORE_DOCS : queue.top().mappedDocID;

Review comment:
       Thanks. I had to add a check whether the queue is empty under `if (nextDoc == NO_MORE_DOCS) {` for it to work.
   
   I ran a quick benchmark with a merge that merges 10M docs that have 10 single-valued SORTED_SET doc-value fields each. Without index sorting, the merge runs in ~740ms. With index sorting on a NumericDocValues field of cardinality 8, I get the following merging times:
    - master: ~1300ms
    - the first version of this PR: ~860ms,
    - your suggestion: ~910ms
   
   I think the extra simplicity is worth the minor slowdown, it's still significantly faster than master.
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] dnhatn commented on a change in pull request #221: LUCENE-10031: Speed up SortedDocIdMerger on low-cardinality sort fields.

Posted by GitBox <gi...@apache.org>.
dnhatn commented on a change in pull request #221:
URL: https://github.com/apache/lucene/pull/221#discussion_r674984362



##########
File path: lucene/core/src/java/org/apache/lucene/index/DocIDMerger.java
##########
@@ -144,59 +159,45 @@ protected boolean lessThan(Sub a, Sub b) {
     public void reset() throws IOException {
       // caller may not have fully consumed the queue:
       queue.clear();
+      current = null;
       boolean first = true;
       for (T sub : subs) {
         if (first) {
           // by setting mappedDocID = -1, this entry is guaranteed to be the top of the queue
           // so the first call to next() will advance it
           sub.mappedDocID = -1;
+          current = sub;
           first = false;
-        } else {
-          int mappedDocID;
-          while (true) {
-            int docID = sub.nextDoc();
-            if (docID == NO_MORE_DOCS) {
-              mappedDocID = NO_MORE_DOCS;
-              break;
-            }
-            mappedDocID = sub.docMap.get(docID);
-            if (mappedDocID != -1) {
-              break;
-            }
-          }
-          if (mappedDocID == NO_MORE_DOCS) {
-            // all docs in this sub were deleted; do not add it to the queue!
-            continue;
-          }
-          sub.mappedDocID = mappedDocID;
-        }
-        queue.add(sub);
+        } else if (sub.nextMappedDoc() != NO_MORE_DOCS) {
+          queue.add(sub);
+        } // else all docs in this sub were deleted; do not add it to the queue!
       }
+      queueNextMappedDoc = queue.size() == 0 ? NO_MORE_DOCS : queue.top().mappedDocID;
     }
 
     @Override
     public T next() throws IOException {
-      T top = queue.top();
-
-      while (true) {
-        int docID = top.nextDoc();
-        if (docID == NO_MORE_DOCS) {
-          queue.pop();
-          top = queue.top();
-          break;
-        }
-        int mappedDocID = top.docMap.get(docID);
-        if (mappedDocID == -1) {
-          // doc was deleted
-          continue;
+      int nextDoc = current.nextMappedDoc();
+      // the below condition is unlikely when the index sort is on a low-cardinality field
+      if (nextDoc >= queueNextMappedDoc) {
+        if (nextDoc == NO_MORE_DOCS) {
+          if (queue.size() == 0) {
+            current = null;
+            return null;
+          } else {
+            current = queue.pop();
+            queueNextMappedDoc = queue.size() == 0 ? NO_MORE_DOCS : queue.top().mappedDocID;

Review comment:
       @jpountz Thanks for benchmarking this. I am fine with either versions.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] dnhatn commented on a change in pull request #221: LUCENE-10031: Speed up SortedDocIdMerger on low-cardinality sort fields.

Posted by GitBox <gi...@apache.org>.
dnhatn commented on a change in pull request #221:
URL: https://github.com/apache/lucene/pull/221#discussion_r674914345



##########
File path: lucene/core/src/java/org/apache/lucene/index/DocIDMerger.java
##########
@@ -144,59 +159,45 @@ protected boolean lessThan(Sub a, Sub b) {
     public void reset() throws IOException {
       // caller may not have fully consumed the queue:
       queue.clear();
+      current = null;
       boolean first = true;
       for (T sub : subs) {
         if (first) {
           // by setting mappedDocID = -1, this entry is guaranteed to be the top of the queue
           // so the first call to next() will advance it
           sub.mappedDocID = -1;
+          current = sub;
           first = false;
-        } else {
-          int mappedDocID;
-          while (true) {
-            int docID = sub.nextDoc();
-            if (docID == NO_MORE_DOCS) {
-              mappedDocID = NO_MORE_DOCS;
-              break;
-            }
-            mappedDocID = sub.docMap.get(docID);
-            if (mappedDocID != -1) {
-              break;
-            }
-          }
-          if (mappedDocID == NO_MORE_DOCS) {
-            // all docs in this sub were deleted; do not add it to the queue!
-            continue;
-          }
-          sub.mappedDocID = mappedDocID;
-        }
-        queue.add(sub);
+        } else if (sub.nextMappedDoc() != NO_MORE_DOCS) {
+          queue.add(sub);
+        } // else all docs in this sub were deleted; do not add it to the queue!
       }
+      queueNextMappedDoc = queue.size() == 0 ? NO_MORE_DOCS : queue.top().mappedDocID;
     }
 
     @Override
     public T next() throws IOException {
-      T top = queue.top();
-
-      while (true) {
-        int docID = top.nextDoc();
-        if (docID == NO_MORE_DOCS) {
-          queue.pop();
-          top = queue.top();
-          break;
-        }
-        int mappedDocID = top.docMap.get(docID);
-        if (mappedDocID == -1) {
-          // doc was deleted
-          continue;
+      int nextDoc = current.nextMappedDoc();
+      // the below condition is unlikely when the index sort is on a low-cardinality field
+      if (nextDoc >= queueNextMappedDoc) {
+        if (nextDoc == NO_MORE_DOCS) {
+          if (queue.size() == 0) {
+            current = null;
+            return null;
+          } else {
+            current = queue.pop();
+            queueNextMappedDoc = queue.size() == 0 ? NO_MORE_DOCS : queue.top().mappedDocID;

Review comment:
       I think we can remove `queueNextMappedDoc` and replace it with `queue.top().mappedDocID`? Below is my version without `queueNextMappedDoc`. WDYT?
   
   ```java
   int nextDoc = current.nextMappedDoc();
   // the below condition is unlikely when the index sort is on a low-cardinality field
   if (nextDoc == NO_MORE_DOCS) {
     if (queue.size() == 0) {
       return null;
     }
     current = queue.pop();
   } else if (queue.size() > 0 && nextDoc > queue.top().mappedDocID) {
     T newCurrent = queue.top();
     queue.updateTop(current);
     current = newCurrent;
   }
   return current;
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org