You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2022/09/21 22:56:39 UTC

[GitHub] [lucene] gsmiller opened a new pull request, #11803: DrillSideways optimizations

gsmiller opened a new pull request, #11803:
URL: https://github.com/apache/lucene/pull/11803

   ### Description
   
   This change makes use of `advance` instead of `next` where possible and splits out 1st and 2nd phase checking to avoid match confirmation when unnecessary.
   
   Note that I only focused on the `doQueryFirstScoring` implementation here and didn't modify the other two scoring approaches. "Progress not perfection" and all that (plus, I think we should strongly consider removing these other two implementations, but we'd want to benchmark to be certain).
   
   Unfortunately, `luceneutil` doesn't have dedicated drill sideways benchmarks, but some benchmarks on our internal software that makes use of drill sideways showed a +2% QPS improvement and no obvious regressions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] gsmiller merged pull request #11803: DrillSideways optimizations

Posted by GitBox <gi...@apache.org>.
gsmiller merged PR #11803:
URL: https://github.com/apache/lucene/pull/11803


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] gsmiller commented on a diff in pull request #11803: DrillSideways optimizations

Posted by GitBox <gi...@apache.org>.
gsmiller commented on code in PR #11803:
URL: https://github.com/apache/lucene/pull/11803#discussion_r980560040


##########
lucene/facet/src/java/org/apache/lucene/facet/DrillSidewaysScorer.java:
##########
@@ -149,63 +162,147 @@ public int score(LeafCollector collector, Bits acceptDocs, int min, int maxDoc)
    */
   private void doQueryFirstScoring(Bits acceptDocs, LeafCollector collector, DocsAndCost[] dims)
       throws IOException {
-    // if (DEBUG) {
-    //  System.out.println("  doQueryFirstScoring");
-    // }
     setScorer(collector, ScoreCachingWrappingScorer.wrap(baseScorer));
 
-    int docID = baseScorer.docID();
+    List<DocsAndCost> allDims = Arrays.asList(dims);
+    CollectionUtil.timSort(allDims, Comparator.comparingLong(e -> e.approximation.cost()));
+
+    List<DocsAndCost> twoPhaseDims = null;
+    for (DocsAndCost dim : dims) {
+      if (dim.twoPhase != null) {
+        if (twoPhaseDims == null) {
+          twoPhaseDims = new ArrayList<>(dims.length);
+        }
+        twoPhaseDims.add(dim);
+      }
+    }
+    if (twoPhaseDims != null) {
+      CollectionUtil.timSort(
+          twoPhaseDims,
+          new Comparator<DocsAndCost>() {
+            @Override
+            public int compare(DocsAndCost o1, DocsAndCost o2) {
+              return Float.compare(o1.twoPhase.matchCost(), o2.twoPhase.matchCost());
+            }
+          });
+    }
+
+    int docID = baseApproximation.docID();
 
     nextDoc:
     while (docID != PostingsEnum.NO_MORE_DOCS) {
+      assert docID == baseApproximation.docID();
+
       if (acceptDocs != null && acceptDocs.get(docID) == false) {
-        docID = baseIterator.nextDoc();
+        docID = baseApproximation.nextDoc();
         continue;
       }
-      LeafCollector failedCollector = null;
-      for (DocsAndCost dim : dims) {
-        // TODO: should we sort this 2nd dimension of
-        // docsEnums from most frequent to least?
+
+      DocsAndCost failedDim = null;
+      for (DocsAndCost dim : allDims) {
+        final int dimDocID;
         if (dim.approximation.docID() < docID) {
-          dim.approximation.advance(docID);
+          dimDocID = dim.approximation.advance(docID);
+        } else {
+          dimDocID = dim.approximation.docID();
         }
-
-        boolean matches = false;
-        if (dim.approximation.docID() == docID) {
-          if (dim.twoPhase == null) {
-            matches = true;
+        if (dimDocID != docID) {
+          if (failedDim != null) {
+            int next = Math.min(dimDocID, failedDim.approximation.docID());
+            docID = baseApproximation.advance(next);
+            continue nextDoc;
           } else {
-            matches = dim.twoPhase.matches();
+            failedDim = dim;
           }
         }
+      }
 
-        if (matches == false) {
-          if (failedCollector != null) {
-            // More than one dim fails on this document, so
-            // it's neither a hit nor a near-miss; move to
-            // next doc:
-            docID = baseIterator.nextDoc();
-            continue nextDoc;
-          } else {
-            failedCollector = dim.sidewaysLeafCollector;
+      if (baseTwoPhase != null && baseTwoPhase.matches() == false) {
+        docID = baseApproximation.nextDoc();
+        continue;
+      }
+
+      if (twoPhaseDims != null) {
+        if (failedDim == null) {
+          for (DocsAndCost dim : twoPhaseDims) {
+            assert dim.approximation.docID() == docID;

Review Comment:
   Sure, thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] gsmiller commented on pull request #11803: DrillSideways optimizations

Posted by GitBox <gi...@apache.org>.
gsmiller commented on PR #11803:
URL: https://github.com/apache/lucene/pull/11803#issuecomment-1261198752

   @zhaih I reexamined our test coverage and think we're in good shape already actually. We've got good coverage for covering drill-sideways correctness with multiple dimensions, etc. (including random and non-random). We could try to take these further by somehow asserting that advance is being used in favor of nextDoc when appropriate, but I think those tests would be reasonably complex to write and I'm not sure they add tremendous value. I'd rather we spent time building drill-sideways benchmarks that focus on ensuring our performance doesn't regress. But that's just my opinion. Please let me know if you feel differently and we can keep discussing. Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] gsmiller commented on a diff in pull request #11803: DrillSideways optimizations

Posted by GitBox <gi...@apache.org>.
gsmiller commented on code in PR #11803:
URL: https://github.com/apache/lucene/pull/11803#discussion_r980557793


##########
lucene/facet/src/java/org/apache/lucene/facet/DrillSidewaysScorer.java:
##########
@@ -149,63 +162,147 @@ public int score(LeafCollector collector, Bits acceptDocs, int min, int maxDoc)
    */
   private void doQueryFirstScoring(Bits acceptDocs, LeafCollector collector, DocsAndCost[] dims)
       throws IOException {
-    // if (DEBUG) {
-    //  System.out.println("  doQueryFirstScoring");
-    // }
     setScorer(collector, ScoreCachingWrappingScorer.wrap(baseScorer));
 
-    int docID = baseScorer.docID();
+    List<DocsAndCost> allDims = Arrays.asList(dims);
+    CollectionUtil.timSort(allDims, Comparator.comparingLong(e -> e.approximation.cost()));
+
+    List<DocsAndCost> twoPhaseDims = null;
+    for (DocsAndCost dim : dims) {
+      if (dim.twoPhase != null) {
+        if (twoPhaseDims == null) {
+          twoPhaseDims = new ArrayList<>(dims.length);
+        }
+        twoPhaseDims.add(dim);
+      }
+    }
+    if (twoPhaseDims != null) {
+      CollectionUtil.timSort(
+          twoPhaseDims,
+          new Comparator<DocsAndCost>() {
+            @Override
+            public int compare(DocsAndCost o1, DocsAndCost o2) {
+              return Float.compare(o1.twoPhase.matchCost(), o2.twoPhase.matchCost());
+            }
+          });
+    }
+
+    int docID = baseApproximation.docID();
 
     nextDoc:
     while (docID != PostingsEnum.NO_MORE_DOCS) {
+      assert docID == baseApproximation.docID();
+
       if (acceptDocs != null && acceptDocs.get(docID) == false) {
-        docID = baseIterator.nextDoc();
+        docID = baseApproximation.nextDoc();
         continue;
       }
-      LeafCollector failedCollector = null;
-      for (DocsAndCost dim : dims) {
-        // TODO: should we sort this 2nd dimension of
-        // docsEnums from most frequent to least?
+
+      DocsAndCost failedDim = null;
+      for (DocsAndCost dim : allDims) {
+        final int dimDocID;
         if (dim.approximation.docID() < docID) {
-          dim.approximation.advance(docID);
+          dimDocID = dim.approximation.advance(docID);
+        } else {
+          dimDocID = dim.approximation.docID();
         }
-
-        boolean matches = false;
-        if (dim.approximation.docID() == docID) {
-          if (dim.twoPhase == null) {
-            matches = true;
+        if (dimDocID != docID) {
+          if (failedDim != null) {
+            int next = Math.min(dimDocID, failedDim.approximation.docID());
+            docID = baseApproximation.advance(next);
+            continue nextDoc;
           } else {
-            matches = dim.twoPhase.matches();
+            failedDim = dim;
           }
         }
+      }
 
-        if (matches == false) {
-          if (failedCollector != null) {
-            // More than one dim fails on this document, so
-            // it's neither a hit nor a near-miss; move to
-            // next doc:
-            docID = baseIterator.nextDoc();
-            continue nextDoc;
-          } else {
-            failedCollector = dim.sidewaysLeafCollector;
+      if (baseTwoPhase != null && baseTwoPhase.matches() == false) {
+        docID = baseApproximation.nextDoc();
+        continue;
+      }
+
+      if (twoPhaseDims != null) {
+        if (failedDim == null) {
+          for (DocsAndCost dim : twoPhaseDims) {
+            assert dim.approximation.docID() == docID;
+            if (dim.twoPhase.matches() == false) {
+              if (failedDim != null) {
+                assert dim.approximation.docID() + 1 <= failedDim.approximation.docID();

Review Comment:
   Great catch, thanks! I think this must be a silly copy/paste on my part. I can't image what I was thinking otherwise. Also, we can be more aggressive with setting the next docID here. I'll update the PR with these changes. I've actually been working on another change that builds on this (on a separate fork) which allows sideways facets collectors to terminate early and then fold their conditions into hard requirements on the base query. That's a whole separate change to propose another day, but what slipped through here is an unfortunate side-effect of trying to keep two versions of a partial change in sync. Anyway, thanks again for the catch!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] zhaih commented on a diff in pull request #11803: DrillSideways optimizations

Posted by GitBox <gi...@apache.org>.
zhaih commented on code in PR #11803:
URL: https://github.com/apache/lucene/pull/11803#discussion_r980469938


##########
lucene/facet/src/java/org/apache/lucene/facet/DrillSidewaysScorer.java:
##########
@@ -149,63 +162,147 @@ public int score(LeafCollector collector, Bits acceptDocs, int min, int maxDoc)
    */
   private void doQueryFirstScoring(Bits acceptDocs, LeafCollector collector, DocsAndCost[] dims)
       throws IOException {
-    // if (DEBUG) {
-    //  System.out.println("  doQueryFirstScoring");
-    // }
     setScorer(collector, ScoreCachingWrappingScorer.wrap(baseScorer));
 
-    int docID = baseScorer.docID();
+    List<DocsAndCost> allDims = Arrays.asList(dims);
+    CollectionUtil.timSort(allDims, Comparator.comparingLong(e -> e.approximation.cost()));
+
+    List<DocsAndCost> twoPhaseDims = null;
+    for (DocsAndCost dim : dims) {
+      if (dim.twoPhase != null) {
+        if (twoPhaseDims == null) {
+          twoPhaseDims = new ArrayList<>(dims.length);
+        }
+        twoPhaseDims.add(dim);
+      }
+    }
+    if (twoPhaseDims != null) {
+      CollectionUtil.timSort(
+          twoPhaseDims,
+          new Comparator<DocsAndCost>() {
+            @Override
+            public int compare(DocsAndCost o1, DocsAndCost o2) {
+              return Float.compare(o1.twoPhase.matchCost(), o2.twoPhase.matchCost());
+            }
+          });
+    }
+
+    int docID = baseApproximation.docID();
 
     nextDoc:
     while (docID != PostingsEnum.NO_MORE_DOCS) {
+      assert docID == baseApproximation.docID();
+
       if (acceptDocs != null && acceptDocs.get(docID) == false) {
-        docID = baseIterator.nextDoc();
+        docID = baseApproximation.nextDoc();
         continue;
       }
-      LeafCollector failedCollector = null;
-      for (DocsAndCost dim : dims) {
-        // TODO: should we sort this 2nd dimension of
-        // docsEnums from most frequent to least?
+
+      DocsAndCost failedDim = null;
+      for (DocsAndCost dim : allDims) {
+        final int dimDocID;
         if (dim.approximation.docID() < docID) {
-          dim.approximation.advance(docID);
+          dimDocID = dim.approximation.advance(docID);
+        } else {
+          dimDocID = dim.approximation.docID();
         }
-
-        boolean matches = false;
-        if (dim.approximation.docID() == docID) {
-          if (dim.twoPhase == null) {
-            matches = true;
+        if (dimDocID != docID) {
+          if (failedDim != null) {
+            int next = Math.min(dimDocID, failedDim.approximation.docID());
+            docID = baseApproximation.advance(next);
+            continue nextDoc;
           } else {
-            matches = dim.twoPhase.matches();
+            failedDim = dim;
           }
         }
+      }
 
-        if (matches == false) {
-          if (failedCollector != null) {
-            // More than one dim fails on this document, so
-            // it's neither a hit nor a near-miss; move to
-            // next doc:
-            docID = baseIterator.nextDoc();
-            continue nextDoc;
-          } else {
-            failedCollector = dim.sidewaysLeafCollector;
+      if (baseTwoPhase != null && baseTwoPhase.matches() == false) {
+        docID = baseApproximation.nextDoc();
+        continue;
+      }
+
+      if (twoPhaseDims != null) {
+        if (failedDim == null) {
+          for (DocsAndCost dim : twoPhaseDims) {
+            assert dim.approximation.docID() == docID;
+            if (dim.twoPhase.matches() == false) {
+              if (failedDim != null) {
+                assert dim.approximation.docID() + 1 <= failedDim.approximation.docID();

Review Comment:
   Why is this assertion true? Shouldn't the two docID be the same and both equal to `docID` according to assertion in L228?



##########
lucene/facet/src/java/org/apache/lucene/facet/DrillSidewaysScorer.java:
##########
@@ -149,63 +162,147 @@ public int score(LeafCollector collector, Bits acceptDocs, int min, int maxDoc)
    */
   private void doQueryFirstScoring(Bits acceptDocs, LeafCollector collector, DocsAndCost[] dims)
       throws IOException {
-    // if (DEBUG) {
-    //  System.out.println("  doQueryFirstScoring");
-    // }
     setScorer(collector, ScoreCachingWrappingScorer.wrap(baseScorer));
 
-    int docID = baseScorer.docID();
+    List<DocsAndCost> allDims = Arrays.asList(dims);
+    CollectionUtil.timSort(allDims, Comparator.comparingLong(e -> e.approximation.cost()));
+
+    List<DocsAndCost> twoPhaseDims = null;
+    for (DocsAndCost dim : dims) {
+      if (dim.twoPhase != null) {
+        if (twoPhaseDims == null) {
+          twoPhaseDims = new ArrayList<>(dims.length);
+        }
+        twoPhaseDims.add(dim);
+      }
+    }
+    if (twoPhaseDims != null) {
+      CollectionUtil.timSort(
+          twoPhaseDims,
+          new Comparator<DocsAndCost>() {
+            @Override
+            public int compare(DocsAndCost o1, DocsAndCost o2) {
+              return Float.compare(o1.twoPhase.matchCost(), o2.twoPhase.matchCost());
+            }
+          });
+    }
+
+    int docID = baseApproximation.docID();
 
     nextDoc:
     while (docID != PostingsEnum.NO_MORE_DOCS) {
+      assert docID == baseApproximation.docID();
+
       if (acceptDocs != null && acceptDocs.get(docID) == false) {
-        docID = baseIterator.nextDoc();
+        docID = baseApproximation.nextDoc();
         continue;
       }
-      LeafCollector failedCollector = null;
-      for (DocsAndCost dim : dims) {
-        // TODO: should we sort this 2nd dimension of
-        // docsEnums from most frequent to least?
+
+      DocsAndCost failedDim = null;
+      for (DocsAndCost dim : allDims) {
+        final int dimDocID;
         if (dim.approximation.docID() < docID) {
-          dim.approximation.advance(docID);
+          dimDocID = dim.approximation.advance(docID);
+        } else {
+          dimDocID = dim.approximation.docID();
         }
-
-        boolean matches = false;
-        if (dim.approximation.docID() == docID) {
-          if (dim.twoPhase == null) {
-            matches = true;
+        if (dimDocID != docID) {
+          if (failedDim != null) {
+            int next = Math.min(dimDocID, failedDim.approximation.docID());
+            docID = baseApproximation.advance(next);
+            continue nextDoc;
           } else {
-            matches = dim.twoPhase.matches();
+            failedDim = dim;
           }
         }
+      }
 
-        if (matches == false) {
-          if (failedCollector != null) {
-            // More than one dim fails on this document, so
-            // it's neither a hit nor a near-miss; move to
-            // next doc:
-            docID = baseIterator.nextDoc();
-            continue nextDoc;
-          } else {
-            failedCollector = dim.sidewaysLeafCollector;
+      if (baseTwoPhase != null && baseTwoPhase.matches() == false) {
+        docID = baseApproximation.nextDoc();
+        continue;
+      }
+
+      if (twoPhaseDims != null) {
+        if (failedDim == null) {
+          for (DocsAndCost dim : twoPhaseDims) {
+            assert dim.approximation.docID() == docID;

Review Comment:
   I found it easier to understand if we assert `dim.approximation.docID() == baseApproximation.docID()` or simply add another line asserting `docID == baseApproximation.docID()`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] zhaih commented on pull request #11803: DrillSideways optimizations

Posted by GitBox <gi...@apache.org>.
zhaih commented on PR #11803:
URL: https://github.com/apache/lucene/pull/11803#issuecomment-1258979907

   Changes LGTM, do we need to add some unit tests?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] gsmiller commented on pull request #11803: DrillSideways optimizations

Posted by GitBox <gi...@apache.org>.
gsmiller commented on PR #11803:
URL: https://github.com/apache/lucene/pull/11803#issuecomment-1259557102

   > Changes LGTM, do we need to add some unit tests?
   
   Thanks @zhaih. Let me consider some specific test for this. I know our randomized testing for DrillSideways covers these code paths but maybe some specific, non-random tests would be useful.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] gsmiller commented on pull request #11803: DrillSideways optimizations

Posted by GitBox <gi...@apache.org>.
gsmiller commented on PR #11803:
URL: https://github.com/apache/lucene/pull/11803#issuecomment-1261579341

   @zhaih well, thank you for keeping me honest with testing. I think I've already found an insidious, potential bug with some beefier tests. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] gsmiller commented on pull request #11803: DrillSideways optimizations

Posted by GitBox <gi...@apache.org>.
gsmiller commented on PR #11803:
URL: https://github.com/apache/lucene/pull/11803#issuecomment-1261458849

   @zhaih that's a good point and valid concern. I dug into the existing tests and it looks like we have lots of coverage _except_ that the majority of the coverage is using basic, single-phase drill-down dimensions. I'm going to augment our randomized testing to randomly use two-phase drill-downs to broaden coverage. Thanks for the discussion!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] zhaih commented on pull request #11803: DrillSideways optimizations

Posted by GitBox <gi...@apache.org>.
zhaih commented on PR #11803:
URL: https://github.com/apache/lucene/pull/11803#issuecomment-1261541919

   @gsmiller Thank you for checking and continuous effort!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org