You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2021/07/04 13:01:35 UTC

[GitHub] [lucene] mayya-sharipova opened a new pull request #204: LUCENE-10020 DocComparator don't skip docs of same docID

mayya-sharipova opened a new pull request #204:
URL: https://github.com/apache/lucene/pull/204


   DocComparator should not skip docs with the same docID on multiple
   sorts with search after.
   
   Because of the optimization introduced in LUCENE-9449, currently when
   searching with sort on [_doc, other fields] with search after,
   DocComparator can efficiently skip all docs before and including
   the provided [search after docID]. This is a desirable behaviour
   in a single index search. But in a distributed search, where multiple
   indices have docs with the same docID, and when searching on
   [_doc, other fields], the sort optimization should NOT skip
   documents with the same docIDs.
   
   This PR fixes this.
   
   Relates to LUCENE-9449


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jtibshirani commented on a change in pull request #204: LUCENE-10020 DocComparator don't skip docs of same docID

Posted by GitBox <gi...@apache.org>.
jtibshirani commented on a change in pull request #204:
URL: https://github.com/apache/lucene/pull/204#discussion_r664107815



##########
File path: lucene/core/src/test/org/apache/lucene/search/TestSortOptimization.java
##########
@@ -332,6 +333,71 @@ public void testFloatSortOptimization() throws IOException {
     dir.close();
   }
 
+  /**
+   * Test that a search with sort on [_doc, other fields] across multiple indices doesn't miss any
+   * documents.
+   */
+  public void testDocSortOptimizationMultipleIndices() throws IOException {
+    final int numIndices = 3;
+    final int numDocsInIndex = atLeast(50);
+    Directory[] dirs = new Directory[numIndices];
+    IndexReader[] readers = new IndexReader[numIndices];
+    for (int i = 0; i < numIndices; i++) {
+      dirs[i] = newDirectory();
+      final int remainder = i % 3;

Review comment:
       Since `numIndices` is 3, do we need to take a mod here?

##########
File path: lucene/core/src/java/org/apache/lucene/search/comparators/DocComparator.java
##########
@@ -81,7 +87,12 @@ public Integer value(int slot) {
     public DocLeafComparator(LeafReaderContext context) {
       this.docBase = context.docBase;
       if (enableSkipping) {
-        this.minDoc = topValue + 1;
+        // For a single sort on _doc, we want to skip all docs before topValue.
+        // For multiple fields sort on [_doc, other fields], we want to include docs with the same
+        // docID.
+        // This is needed in a distributed search, where there are docs from different indices with
+        // the same docID.
+        this.minDoc = singleSort ? topValue + 1 : topValue;

Review comment:
       This seems to work and matches the approach in `NumericComparator`. I guess it doesn't specifically address the case where `_doc` is the last sort, for example a sort on `["some_field", "_doc"]`, where we could also use `topValue + 1`. 
   
   One thing I wondered: is keeping track of `singleSort` really important, or could we simplify and just always use `topValue`? At most we'd consider one extra document. A similar simplification would apply to `NumericComparator`. The skipping logic is a bit complex and I'm thinking about the performance/ simplicity trade-off.

##########
File path: lucene/core/src/test/org/apache/lucene/search/TestSortOptimization.java
##########
@@ -332,6 +333,71 @@ public void testFloatSortOptimization() throws IOException {
     dir.close();
   }
 
+  /**
+   * Test that a search with sort on [_doc, other fields] across multiple indices doesn't miss any
+   * documents.
+   */
+  public void testDocSortOptimizationMultipleIndices() throws IOException {
+    final int numIndices = 3;
+    final int numDocsInIndex = atLeast(50);
+    Directory[] dirs = new Directory[numIndices];
+    IndexReader[] readers = new IndexReader[numIndices];
+    for (int i = 0; i < numIndices; i++) {
+      dirs[i] = newDirectory();
+      final int remainder = i % 3;
+      Function<Integer, Integer> valueSupplier = docID -> (docID * 3 + remainder);

Review comment:
       Maybe this could be a simple variable assignment instead of using a supplier. Also I think we can replace 3 with `numIndices`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jtibshirani commented on a change in pull request #204: LUCENE-10020 DocComparator don't skip docs of same docID

Posted by GitBox <gi...@apache.org>.
jtibshirani commented on a change in pull request #204:
URL: https://github.com/apache/lucene/pull/204#discussion_r664107815



##########
File path: lucene/core/src/test/org/apache/lucene/search/TestSortOptimization.java
##########
@@ -332,6 +333,71 @@ public void testFloatSortOptimization() throws IOException {
     dir.close();
   }
 
+  /**
+   * Test that a search with sort on [_doc, other fields] across multiple indices doesn't miss any
+   * documents.
+   */
+  public void testDocSortOptimizationMultipleIndices() throws IOException {
+    final int numIndices = 3;
+    final int numDocsInIndex = atLeast(50);
+    Directory[] dirs = new Directory[numIndices];
+    IndexReader[] readers = new IndexReader[numIndices];
+    for (int i = 0; i < numIndices; i++) {
+      dirs[i] = newDirectory();
+      final int remainder = i % 3;

Review comment:
       Since `numIndices` is 3, do we need to take a mod here?

##########
File path: lucene/core/src/java/org/apache/lucene/search/comparators/DocComparator.java
##########
@@ -81,7 +87,12 @@ public Integer value(int slot) {
     public DocLeafComparator(LeafReaderContext context) {
       this.docBase = context.docBase;
       if (enableSkipping) {
-        this.minDoc = topValue + 1;
+        // For a single sort on _doc, we want to skip all docs before topValue.
+        // For multiple fields sort on [_doc, other fields], we want to include docs with the same
+        // docID.
+        // This is needed in a distributed search, where there are docs from different indices with
+        // the same docID.
+        this.minDoc = singleSort ? topValue + 1 : topValue;

Review comment:
       This seems to work and matches the approach in `NumericComparator`. I guess it doesn't specifically address the case where `_doc` is the last sort, for example a sort on `["some_field", "_doc"]`, where we could also use `topValue + 1`. 
   
   One thing I wondered: is keeping track of `singleSort` really important, or could we simplify and just always use `topValue`? At most we'd consider one extra document. A similar simplification would apply to `NumericComparator`. The skipping logic is a bit complex and I'm thinking about the performance/ simplicity trade-off.

##########
File path: lucene/core/src/test/org/apache/lucene/search/TestSortOptimization.java
##########
@@ -332,6 +333,71 @@ public void testFloatSortOptimization() throws IOException {
     dir.close();
   }
 
+  /**
+   * Test that a search with sort on [_doc, other fields] across multiple indices doesn't miss any
+   * documents.
+   */
+  public void testDocSortOptimizationMultipleIndices() throws IOException {
+    final int numIndices = 3;
+    final int numDocsInIndex = atLeast(50);
+    Directory[] dirs = new Directory[numIndices];
+    IndexReader[] readers = new IndexReader[numIndices];
+    for (int i = 0; i < numIndices; i++) {
+      dirs[i] = newDirectory();
+      final int remainder = i % 3;
+      Function<Integer, Integer> valueSupplier = docID -> (docID * 3 + remainder);

Review comment:
       Maybe this could be a simple variable assignment instead of using a supplier. Also I think we can replace 3 with `numIndices`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] mayya-sharipova commented on a change in pull request #204: LUCENE-10020 DocComparator don't skip docs of same docID

Posted by GitBox <gi...@apache.org>.
mayya-sharipova commented on a change in pull request #204:
URL: https://github.com/apache/lucene/pull/204#discussion_r664804322



##########
File path: lucene/core/src/java/org/apache/lucene/search/comparators/DocComparator.java
##########
@@ -81,7 +87,12 @@ public Integer value(int slot) {
     public DocLeafComparator(LeafReaderContext context) {
       this.docBase = context.docBase;
       if (enableSkipping) {
-        this.minDoc = topValue + 1;
+        // For a single sort on _doc, we want to skip all docs before topValue.
+        // For multiple fields sort on [_doc, other fields], we want to include docs with the same
+        // docID.
+        // This is needed in a distributed search, where there are docs from different indices with
+        // the same docID.
+        this.minDoc = singleSort ? topValue + 1 : topValue;

Review comment:
       Right, `singleSort` check is used not only for search after case.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jtibshirani commented on a change in pull request #204: LUCENE-10020 DocComparator don't skip docs of same docID

Posted by GitBox <gi...@apache.org>.
jtibshirani commented on a change in pull request #204:
URL: https://github.com/apache/lucene/pull/204#discussion_r664698038



##########
File path: lucene/core/src/java/org/apache/lucene/search/comparators/DocComparator.java
##########
@@ -81,7 +87,12 @@ public Integer value(int slot) {
     public DocLeafComparator(LeafReaderContext context) {
       this.docBase = context.docBase;
       if (enableSkipping) {
-        this.minDoc = topValue + 1;
+        // For a single sort on _doc, we want to skip all docs before topValue.
+        // For multiple fields sort on [_doc, other fields], we want to include docs with the same
+        // docID.
+        // This is needed in a distributed search, where there are docs from different indices with
+        // the same docID.
+        this.minDoc = singleSort ? topValue + 1 : topValue;

Review comment:
       > For `NumericComparator` though this is not the case, and there could be huge number of docs with the same value, so extra optimization for `singleSort` is important.
   
   Oh I see, `NumericComparator` isn't just used in 'search after' cases so there could be many docs that share the same value.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] mayya-sharipova merged pull request #204: LUCENE-10020 DocComparator don't skip docs of same docID

Posted by GitBox <gi...@apache.org>.
mayya-sharipova merged pull request #204:
URL: https://github.com/apache/lucene/pull/204


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] mayya-sharipova commented on a change in pull request #204: LUCENE-10020 DocComparator don't skip docs of same docID

Posted by GitBox <gi...@apache.org>.
mayya-sharipova commented on a change in pull request #204:
URL: https://github.com/apache/lucene/pull/204#discussion_r664573881



##########
File path: lucene/core/src/java/org/apache/lucene/search/comparators/DocComparator.java
##########
@@ -81,7 +87,12 @@ public Integer value(int slot) {
     public DocLeafComparator(LeafReaderContext context) {
       this.docBase = context.docBase;
       if (enableSkipping) {
-        this.minDoc = topValue + 1;
+        // For a single sort on _doc, we want to skip all docs before topValue.
+        // For multiple fields sort on [_doc, other fields], we want to include docs with the same
+        // docID.
+        // This is needed in a distributed search, where there are docs from different indices with
+        // the same docID.
+        this.minDoc = singleSort ? topValue + 1 : topValue;

Review comment:
       Great comment! +1 for simplifying the code at the expense for extra single case in `DocComparator`.  
   
   For `NumericComparator` though this is not the case, and there could be huge number of docs with the same value, so extra optimization for `singleSort` is important.
   
   > I guess it doesn't specifically address the case where _doc is the last sort, for example a sort on ["some_field", "_doc"], where we could also use topValue + 1.
   
   No, the sort optimizations in `DocComparator` are not applicable where `_doc` is the 1st sort. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] mayya-sharipova commented on pull request #204: LUCENE-10020 DocComparator don't skip docs of same docID

Posted by GitBox <gi...@apache.org>.
mayya-sharipova commented on pull request #204:
URL: https://github.com/apache/lucene/pull/204#issuecomment-874777903


   @jtibshirani Thanks for the review. I've tried to address all your feedback in b5af441a39611d63df30168452cdec521ce4d578


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org