You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2021/07/05 19:55:07 UTC

[GitHub] [lucene] jtibshirani commented on a change in pull request #204: LUCENE-10020 DocComparator don't skip docs of same docID

jtibshirani commented on a change in pull request #204:
URL: https://github.com/apache/lucene/pull/204#discussion_r664107815



##########
File path: lucene/core/src/test/org/apache/lucene/search/TestSortOptimization.java
##########
@@ -332,6 +333,71 @@ public void testFloatSortOptimization() throws IOException {
     dir.close();
   }
 
+  /**
+   * Test that a search with sort on [_doc, other fields] across multiple indices doesn't miss any
+   * documents.
+   */
+  public void testDocSortOptimizationMultipleIndices() throws IOException {
+    final int numIndices = 3;
+    final int numDocsInIndex = atLeast(50);
+    Directory[] dirs = new Directory[numIndices];
+    IndexReader[] readers = new IndexReader[numIndices];
+    for (int i = 0; i < numIndices; i++) {
+      dirs[i] = newDirectory();
+      final int remainder = i % 3;

Review comment:
       Since `numIndices` is 3, do we need to take a mod here?

##########
File path: lucene/core/src/java/org/apache/lucene/search/comparators/DocComparator.java
##########
@@ -81,7 +87,12 @@ public Integer value(int slot) {
     public DocLeafComparator(LeafReaderContext context) {
       this.docBase = context.docBase;
       if (enableSkipping) {
-        this.minDoc = topValue + 1;
+        // For a single sort on _doc, we want to skip all docs before topValue.
+        // For multiple fields sort on [_doc, other fields], we want to include docs with the same
+        // docID.
+        // This is needed in a distributed search, where there are docs from different indices with
+        // the same docID.
+        this.minDoc = singleSort ? topValue + 1 : topValue;

Review comment:
       This seems to work and matches the approach in `NumericComparator`. I guess it doesn't specifically address the case where `_doc` is the last sort, for example a sort on `["some_field", "_doc"]`, where we could also use `topValue + 1`. 
   
   One thing I wondered: is keeping track of `singleSort` really important, or could we simplify and just always use `topValue`? At most we'd consider one extra document. A similar simplification would apply to `NumericComparator`. The skipping logic is a bit complex and I'm thinking about the performance/ simplicity trade-off.

##########
File path: lucene/core/src/test/org/apache/lucene/search/TestSortOptimization.java
##########
@@ -332,6 +333,71 @@ public void testFloatSortOptimization() throws IOException {
     dir.close();
   }
 
+  /**
+   * Test that a search with sort on [_doc, other fields] across multiple indices doesn't miss any
+   * documents.
+   */
+  public void testDocSortOptimizationMultipleIndices() throws IOException {
+    final int numIndices = 3;
+    final int numDocsInIndex = atLeast(50);
+    Directory[] dirs = new Directory[numIndices];
+    IndexReader[] readers = new IndexReader[numIndices];
+    for (int i = 0; i < numIndices; i++) {
+      dirs[i] = newDirectory();
+      final int remainder = i % 3;
+      Function<Integer, Integer> valueSupplier = docID -> (docID * 3 + remainder);

Review comment:
       Maybe this could be a simple variable assignment instead of using a supplier. Also I think we can replace 3 with `numIndices`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org