You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2021/04/21 07:21:08 UTC

[GitHub] [lucene] zacharymorn opened a new pull request #101: LUCENE-9335: [Discussion Only] Add BMM scorer and use it for pure disjunction term query

zacharymorn opened a new pull request #101:
URL: https://github.com/apache/lucene/pull/101


   Implement BMM algorithm from "Optimizing Top-k Document Retrieval Strategies for Block-Max Indexes" by Dimopoulos, Nepomnyachiy and Suel.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jpountz commented on pull request #101: LUCENE-9335: [Discussion Only] Add BMM scorer and use it for pure disjunction term query

Posted by GitBox <gi...@apache.org>.
jpountz commented on pull request #101:
URL: https://github.com/apache/lucene/pull/101#issuecomment-841124271


   > in the jira ticket you had suggested to use BMM for top-level (flat?) boolean query only. Do you think this will need to be fixed?
   
   I opened this JIRA ticket because it felt like we could do better for top-level disjunctions, but if BMM appears to work better most of the time, we could just move to it all the time.
   
   > The one result that does show negative impact to AndMedOrHighHigh also shows impact to OrHighMed, so it’s a bit strange and may need further looking into to see the cause.
   
   Yeah, I suspect there will always be cases when BMW will perform better than BMM or vice-versa, sometimes for subtle reasons.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jpountz commented on pull request #101: LUCENE-9335: [Discussion Only] Add BMM scorer and use it for pure disjunction term query

Posted by GitBox <gi...@apache.org>.
jpountz commented on pull request #101:
URL: https://github.com/apache/lucene/pull/101#issuecomment-836880288


   Wonderful, thanks @mikemccand !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jpountz commented on a change in pull request #101: LUCENE-9335: [Discussion Only] Add BMM scorer and use it for pure disjunction term query

Posted by GitBox <gi...@apache.org>.
jpountz commented on a change in pull request #101:
URL: https://github.com/apache/lucene/pull/101#discussion_r625979068



##########
File path: lucene/core/src/java/org/apache/lucene/search/BlockMaxMaxscoreScorer.java
##########
@@ -0,0 +1,339 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.search;
+
+import static org.apache.lucene.search.ScorerUtil.costWithMinShouldMatch;
+
+import java.io.IOException;
+import java.util.*;
+
+/** Scorer implementing Block-Max Maxscore algorithm */
+public class BlockMaxMaxscoreScorer extends Scorer {
+  private final ScoreMode scoreMode;
+  private final int scalingFactor;
+
+  // current doc ID of the leads
+  private int doc;
+
+  // doc id boundary that all scorers maxScore are valid
+  private int upTo = -1;
+
+  // heap of scorers ordered by doc ID
+  private final DisiPriorityQueue essentialsScorers;
+
+  // list of scorers whose sum of maxScore is less than minCompetitiveScore, ordered by maxScore
+  private final List<DisiWrapper> nonEssentialScorers;
+
+  // sum of max scores of scorers in nonEssentialScorers list
+  private long nonEssentialMaxScoreSum;
+
+  // sum of score of scorers in essentialScorers list that are positioned on matching doc
+  private long matchedDocScoreSum;
+
+  private long cost;
+
+  private final MaxScoreSumPropagator maxScoreSumPropagator;
+
+  private final List<Scorer> scorers;
+
+  // scaled min competitive score
+  private long minCompetitiveScore = 0;
+
+  /**
+   * Constructs a Scorer
+   *
+   * @param weight The weight to be used.
+   * @param scorers The sub scorers this Scorer should iterate on for optional clauses
+   * @param scoreMode The scoreMode
+   */
+  public BlockMaxMaxscoreScorer(Weight weight, List<Scorer> scorers, ScoreMode scoreMode)
+      throws IOException {
+    super(weight);
+    assert scoreMode == ScoreMode.TOP_SCORES;
+
+    this.scoreMode = scoreMode;
+    this.doc = -1;
+    this.scorers = scorers;
+    this.cost =
+        costWithMinShouldMatch(
+            scorers.stream().map(Scorer::iterator).mapToLong(DocIdSetIterator::cost),
+            scorers.size(),
+            1);
+
+    essentialsScorers = new DisiPriorityQueue(scorers.size());
+    nonEssentialScorers = new LinkedList<>();
+
+    scalingFactor = WANDScorer.getScalingFactor(scorers);
+    maxScoreSumPropagator = new MaxScoreSumPropagator(scorers);
+
+    for (Scorer scorer : scorers) {
+      nonEssentialScorers.add(new DisiWrapper(scorer));
+    }
+  }
+
+  @Override
+  public DocIdSetIterator iterator() {
+    return TwoPhaseIterator.asDocIdSetIterator(twoPhaseIterator());
+  }
+
+  @Override
+  public TwoPhaseIterator twoPhaseIterator() {
+    DocIdSetIterator approximation =
+        new DocIdSetIterator() {
+          private long lastMinCompetitiveScore;
+
+          @Override
+          public int docID() {
+            return doc;
+          }
+
+          @Override
+          public int nextDoc() throws IOException {
+            return advance(doc + 1);
+          }
+
+          @Override
+          public int advance(int target) throws IOException {
+            doAdvance(target);
+
+            while (doc != DocIdSetIterator.NO_MORE_DOCS
+                && nonEssentialMaxScoreSum + matchedDocScoreSum < minCompetitiveScore) {
+              doAdvance(doc + 1);
+            }
+
+            return doc;
+          }
+
+          private void doAdvance(int target) throws IOException {
+            matchedDocScoreSum = 0;
+            // Find next smallest doc id that is larger than or equal to target from the essential
+            // scorers
+
+            // If the next candidate doc id is still within interval boundary,
+            if (lastMinCompetitiveScore == minCompetitiveScore && target <= upTo) {
+              while (essentialsScorers.top().doc < target) {
+                DisiWrapper w = essentialsScorers.pop();
+                w.doc = w.iterator.advance(target);
+                essentialsScorers.add(w);

Review comment:
       can you use updateTop instead? It's usually faster than pop+add




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] zacharymorn commented on pull request #101: LUCENE-9335: [Discussion Only] Add BMM scorer and use it for pure disjunction term query

Posted by GitBox <gi...@apache.org>.
zacharymorn commented on pull request #101:
URL: https://github.com/apache/lucene/pull/101#issuecomment-837880326


   > Thanks @mikemccand and @jpountz for the uploads!
   > 
   > > The nightly benchmarks uses the binary form of wikibigall, to reduce thread bottleneck when reading/parsing documents to index. Hmm it is sampled from a different date (01/15/2011) ... OK I am uploading that one to https://home.apache.org/~mikemccand/enwiki-20110115-lines.bin (ETA ~20 minutes more).
   > 
   > I'm assuming we should now run `nightlyBench.py` instead of `wikibigall`, since the latter depends on `enwiki-20100302-pages-articles-lines.txt` but it's no longer available?
   
   Actually, I'm still not quite sure how to run it with `enwiki-20110115-lines.bin`, since it's not referred to in luceneutil code I think? I'll dig around in nightly benchmark job tomorrow to find out.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jpountz commented on pull request #101: LUCENE-9335: [Discussion Only] Add BMM scorer and use it for pure disjunction term query

Posted by GitBox <gi...@apache.org>.
jpountz commented on pull request #101:
URL: https://github.com/apache/lucene/pull/101#issuecomment-837909869


   Actually you don't need nightlyBench.py, you can use the standard python script. I think the following should work to test out on larger documents:
    - Download https://home.apache.org/~mikemccand/enwiki-20130102-lines.txt.lzma and put it under the same directory as the other data files like `enwiki-20120502-lines-1k.txt`.
    - Decompress it with `unlzma enwiki-20130102-lines.txt.lzma`.
    - Add the following to your `localconstants.py`:
   
   ```
   WIKI_BIG_DOCS_LINE_FILE = '%s/data/enwiki-20130102-lines.txt' % BASE_DIR
   WIKI_BIG_DOCS_COUNT = 6647577
   WIKI_BIG_TASKS_FILE = '%s/tasks/wikinightly.tasks' % BENCH_BASE_DIR
   ```
    - Run the benchmark with `-source wikibig10k` to make sure everything works.
    - Finally run the benchmark with `-source wikibigall`.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] zacharymorn commented on pull request #101: LUCENE-9335: [Discussion Only] Add BMM scorer and use it for pure disjunction term query

Posted by GitBox <gi...@apache.org>.
zacharymorn commented on pull request #101:
URL: https://github.com/apache/lucene/pull/101#issuecomment-829928067


   2 luceneutil runs with wikimedium5m results:
   
   ```
                  TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                  OrHighMed      196.58      (3.5%)      134.01      (2.5%)  -31.8% ( -36% -  -26%) 0.000
                 OrHighHigh       43.32      (2.0%)       31.45      (2.0%)  -27.4% ( -30% -  -23%) 0.000
                     Fuzzy1       94.71     (14.2%)       89.31     (16.9%)   -5.7% ( -32% -   29%) 0.249
                  OrHighLow      549.71      (4.7%)      521.55      (5.8%)   -5.1% ( -14% -    5%) 0.002
      HighTermDayOfYearSort      284.66     (11.4%)      274.13     (11.4%)   -3.7% ( -23% -   21%) 0.304
                   HighTerm     1392.83      (4.6%)     1352.76      (4.7%)   -2.9% ( -11% -    6%) 0.050
                 HighPhrase      199.03      (2.8%)      193.93      (3.0%)   -2.6% (  -8% -    3%) 0.006
                 TermDTSort      200.40     (10.9%)      196.04     (13.2%)   -2.2% ( -23% -   24%) 0.570
               OrNotHighLow     1137.55      (3.8%)     1114.38      (4.8%)   -2.0% ( -10% -    6%) 0.138
               OrNotHighMed      874.47      (5.8%)      857.02      (5.3%)   -2.0% ( -12% -    9%) 0.259
          HighTermMonthSort      285.86     (16.9%)      280.37     (15.4%)   -1.9% ( -29% -   36%) 0.707
               OrHighNotLow      797.17      (5.2%)      782.04      (5.5%)   -1.9% ( -11% -    9%) 0.263
                 AndHighMed      259.03      (3.2%)      254.45      (4.8%)   -1.8% (  -9% -    6%) 0.169
                    LowTerm     1619.44      (3.4%)     1592.71      (4.4%)   -1.7% (  -9% -    6%) 0.184
                    MedTerm     1448.09      (3.7%)     1432.36      (4.6%)   -1.1% (  -9% -    7%) 0.413
                  LowPhrase      530.26      (3.0%)      524.77      (3.8%)   -1.0% (  -7% -    6%) 0.343
                   Wildcard      237.52      (1.7%)      235.18      (2.1%)   -1.0% (  -4% -    2%) 0.104
                   PKLookup      226.73      (2.8%)      224.67      (2.5%)   -0.9% (  -6% -    4%) 0.277
                AndHighHigh      167.49      (3.4%)      166.40      (2.7%)   -0.6% (  -6% -    5%) 0.505
            MedSloppyPhrase       52.92      (1.9%)       52.60      (2.2%)   -0.6% (  -4% -    3%) 0.339
                MedSpanNear      128.06      (2.0%)      127.34      (2.3%)   -0.6% (  -4% -    3%) 0.405
                 AndHighLow     1016.64      (3.9%)     1011.41      (4.2%)   -0.5% (  -8% -    7%) 0.689
                    Prefix3      333.77      (3.9%)      332.40      (4.0%)   -0.4% (  -8% -    7%) 0.742
                LowSpanNear       45.55      (2.6%)       45.37      (2.6%)   -0.4% (  -5% -    4%) 0.636
       HighTermTitleBDVSort      152.08     (15.6%)      151.52     (14.8%)   -0.4% ( -26% -   35%) 0.938
       HighIntervalsOrdered       23.28      (1.9%)       23.20      (2.4%)   -0.4% (  -4% -    3%) 0.597
           HighSloppyPhrase      276.19      (2.4%)      275.46      (2.8%)   -0.3% (  -5% -    5%) 0.749
      BrowseMonthTaxoFacets       13.43      (2.7%)       13.40      (3.0%)   -0.2% (  -5% -    5%) 0.827
              OrNotHighHigh      730.45      (6.8%)      729.04      (5.1%)   -0.2% ( -11% -   12%) 0.919
                     IntNRQ      137.68     (16.3%)      137.51     (16.6%)   -0.1% ( -28% -   39%) 0.981
              OrHighNotHigh      629.90      (4.9%)      630.02      (5.0%)    0.0% (  -9% -   10%) 0.990
            LowSloppyPhrase       20.78      (1.2%)       20.79      (1.7%)    0.1% (  -2% -    3%) 0.895
                    Respell      117.77      (2.5%)      117.85      (2.4%)    0.1% (  -4% -    5%) 0.930
   BrowseDayOfYearSSDVFacets       28.73      (2.6%)       28.75      (2.6%)    0.1% (  -4% -    5%) 0.930
                  MedPhrase       51.36      (3.1%)       51.41      (2.1%)    0.1% (  -4% -    5%) 0.902
               HighSpanNear        3.58      (2.2%)        3.58      (2.8%)    0.2% (  -4% -    5%) 0.835
                     Fuzzy2       20.57      (4.2%)       20.66      (3.5%)    0.5% (  -6% -    8%) 0.710
               OrHighNotMed      753.94      (6.4%)      758.23      (5.1%)    0.6% ( -10% -   12%) 0.756
       BrowseDateTaxoFacets       11.29      (3.1%)       11.35      (3.0%)    0.6% (  -5% -    6%) 0.547
      BrowseMonthSSDVFacets       31.97      (6.9%)       32.21      (5.4%)    0.8% ( -10% -   13%) 0.698
   BrowseDayOfYearTaxoFacets       11.30      (3.0%)       11.38      (3.1%)    0.8% (  -5% -    7%) 0.427
   ```
   ```
                  TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                 OrHighHigh       77.68      (3.2%)       36.44      (3.3%)  -53.1% ( -57% -  -48%) 0.000
                  OrHighMed      265.36      (4.9%)      148.05      (2.9%)  -44.2% ( -49% -  -38%) 0.000
                  OrHighLow      563.62      (6.2%)      392.94      (3.6%)  -30.3% ( -37% -  -21%) 0.000
              OrHighNotHigh      647.72      (6.3%)      625.42      (7.9%)   -3.4% ( -16% -   11%) 0.126
                     IntNRQ      422.98      (3.5%)      410.61      (6.9%)   -2.9% ( -12% -    7%) 0.091
               OrHighNotLow      840.72      (7.0%)      818.98      (4.4%)   -2.6% ( -13% -    9%) 0.162
                   HighTerm     1464.33      (4.7%)     1428.42      (5.3%)   -2.5% ( -11% -    7%) 0.119
                  MedPhrase      486.70      (4.1%)      478.83      (3.3%)   -1.6% (  -8% -    6%) 0.171
               OrHighNotMed      738.23      (5.9%)      726.30      (7.4%)   -1.6% ( -14% -   12%) 0.444
              OrNotHighHigh      713.95      (4.5%)      703.94      (5.2%)   -1.4% ( -10% -    8%) 0.358
       BrowseDateTaxoFacets       10.78      (3.6%)       10.63      (4.1%)   -1.4% (  -8% -    6%) 0.260
                    MedTerm     1709.93      (4.8%)     1686.84      (5.8%)   -1.4% ( -11% -    9%) 0.422
                    LowTerm     1836.19      (5.4%)     1811.93      (6.6%)   -1.3% ( -12% -   11%) 0.491
   BrowseDayOfYearTaxoFacets       10.80      (3.8%)       10.66      (4.0%)   -1.3% (  -8% -    6%) 0.309
                 HighPhrase       40.45      (2.7%)       39.99      (2.5%)   -1.1% (  -6% -    4%) 0.160
      BrowseMonthTaxoFacets       13.11      (2.6%)       12.98      (3.2%)   -1.0% (  -6% -    4%) 0.254
           HighSloppyPhrase       32.04      (5.6%)       31.70      (4.9%)   -1.0% ( -10% -   10%) 0.535
                 TermDTSort      431.47     (12.5%)      427.83     (13.2%)   -0.8% ( -23% -   28%) 0.836
                   Wildcard      289.98      (7.5%)      287.96      (5.3%)   -0.7% ( -12% -   13%) 0.735
                   PKLookup      217.64      (3.8%)      216.31      (3.2%)   -0.6% (  -7% -    6%) 0.580
       HighIntervalsOrdered        4.21      (2.6%)        4.20      (3.0%)   -0.4% (  -5% -    5%) 0.638
   BrowseDayOfYearSSDVFacets       27.75      (2.6%)       27.64      (2.3%)   -0.4% (  -5% -    4%) 0.589
                 AndHighMed      227.13      (2.3%)      226.33      (3.9%)   -0.4% (  -6% -    5%) 0.729
                AndHighHigh      110.85      (2.4%)      110.49      (3.1%)   -0.3% (  -5% -    5%) 0.709
                    Prefix3      235.98      (3.2%)      235.41      (2.2%)   -0.2% (  -5% -    5%) 0.787
                LowSpanNear       45.89      (1.7%)       45.78      (1.6%)   -0.2% (  -3% -    3%) 0.663
            LowSloppyPhrase       24.25      (3.7%)       24.21      (3.5%)   -0.2% (  -7% -    7%) 0.867
               HighSpanNear       50.67      (3.8%)       50.59      (4.0%)   -0.2% (  -7% -    7%) 0.902
                  LowPhrase      576.08      (4.7%)      575.26      (5.2%)   -0.1% (  -9% -   10%) 0.928
               OrNotHighMed      692.19      (6.3%)      691.63      (5.1%)   -0.1% ( -10% -   12%) 0.965
                     Fuzzy2       30.66      (4.3%)       30.64      (2.9%)   -0.1% (  -7% -    7%) 0.949
                MedSpanNear       23.95      (2.0%)       23.95      (2.2%)   -0.0% (  -4% -    4%) 0.992
            MedSloppyPhrase       36.54      (5.7%)       36.57      (4.2%)    0.1% (  -9% -   10%) 0.962
      BrowseMonthSSDVFacets       31.13      (5.1%)       31.20      (4.8%)    0.2% (  -9% -   10%) 0.888
       HighTermTitleBDVSort       89.16      (8.9%)       89.37      (8.7%)    0.2% ( -15% -   19%) 0.934
                    Respell       83.91      (3.1%)       84.11      (2.7%)    0.2% (  -5% -    6%) 0.798
               OrNotHighLow     1073.98      (6.5%)     1076.76      (6.1%)    0.3% ( -11% -   13%) 0.897
                 AndHighLow      905.72      (4.4%)      908.07      (5.2%)    0.3% (  -8% -   10%) 0.864
                     Fuzzy1       81.24     (12.0%)       81.73      (9.6%)    0.6% ( -18% -   25%) 0.861
      HighTermDayOfYearSort      292.09     (15.6%)      294.83     (14.1%)    0.9% ( -24% -   36%) 0.842
          HighTermMonthSort       71.72     (11.6%)       74.49     (17.1%)    3.9% ( -22% -   36%) 0.403
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] zacharymorn commented on pull request #101: LUCENE-9335: [Discussion Only] Add BMM scorer and use it for pure disjunction term query

Posted by GitBox <gi...@apache.org>.
zacharymorn commented on pull request #101:
URL: https://github.com/apache/lucene/pull/101#issuecomment-835707134


   I cherry-picked your commit and pushed to this branch / PR to further explore the changes and their effect, hope that's ok.
   
   > (2) feels natural to avoid doing useless score computations, though it might only work well when score upper bounds are very close to the actual scores. Maybe we should test on wikibig instead of wikimedium to get better confidence that this change makes things better.
   
   > Regarding (3) does it actually push more scorers into nonEssentialScorers? I thought I just reorganized the existing logic a bit. If it pushes more scorers into nonEssentialScorers it's probably a bug. :)
   
   I did some explorations to see how these two affect QPS by removing them here https://github.com/apache/lucene/pull/101/commits/2835055979fa6a739972d2abf30888e855b7683c, and restore the optimizations in later commits (the luceneutil results from OrMedMedMedMedMed for these changes are added into git commit messages). From a few runs it seems that these two accounted for about 15% improvement (from -7% to +7%), although for (3)  after some more thought I think the different implementations should actually be the same, since the scorers were already sorted by maxScore before they were partitioned. The changes to iterator to avoid unneeded score computation should probably account for the rest of the improvement.
   
   I also tried to run `wikibigall` as well, which seems to require `enwiki-20100302-pages-articles-lines.txt` but it's not downloaded by the util. It appears the archive should be coming from http://home.apache.org/~mikemccand/enwiki-20100302-pages-articles-lines.txt.bz2, but it's giving 404 now. Is there a new place for this archive to live now that I can download from?
   
   > Maybe we can also try to port similar changes to the bulk scorer to see if it yields even greater benefits?
   
   Yes I would love to do that! Will work on that next and try out more different queries.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] zacharymorn commented on pull request #101: LUCENE-9335: [Discussion Only] Add BMM scorer and use it for pure disjunction term query

Posted by GitBox <gi...@apache.org>.
zacharymorn commented on pull request #101:
URL: https://github.com/apache/lucene/pull/101#issuecomment-834114823


   > I played with your branch in order to try to speed things up a bit and I'm getting a bit closer to our current WAND implementation: https://github.com/jpountz/lucene/tree/LUCENE-9335-BMMScorer. Hopefully it's not too buggy. On the latest benchmark I ran it was 15% faster than WANScorer on your OrMedMedMedMedMed task.
   
   Oh wow thanks Adrien for trying it out! It's quite a speed up from -20 ~ 40% to +15%!! I also run it with the full benchmark queries and see it's making + 20% - 90% speed up for `OrHighHigh`, with the cost of -10 ~ 20% slow down for  `OrHighLow` and `OrHighMed`, but it seems like a good net speedup given the latter twos already have much faster QPS compared to `OrHighHigh`?
   
   I took a quick look at your implementation and will study it more in the next few days, but right now I can see a few optimizations you applied and suggested earlier that can explain the speed up:
   1. The use of faster methods on the priority queue `updateTop`, `topList` etc
   2. The use of maxScore instead of `score()` to pick competitive doc from the simplified iterator
   3. Pushing more scorers into `nonEssentialScorers` in `repartitionLists`  
   
   The last two are optimization techniques not mentioned in the paper I think?
   
   By the way, please feel free to push directly into this PR branch as well, as I think your changes are in pretty good shape? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] mikemccand commented on pull request #101: LUCENE-9335: [Discussion Only] Add BMM scorer and use it for pure disjunction term query

Posted by GitBox <gi...@apache.org>.
mikemccand commented on pull request #101:
URL: https://github.com/apache/lucene/pull/101#issuecomment-836788286


   > > I also tried to run wikibigall as well, which seems to require enwiki-20100302-pages-articles-lines.txt but it's not downloaded by the util. It appears the archive should be coming from http://home.apache.org/~mikemccand/enwiki-20100302-pages-articles-lines.txt.bz2, but it's giving 404 now.
   > 
   > Hmm good question, I had downloaded this file years ago, I'm not sure where to find it nowadays. @mikemccand Do you know where to find it? Otherwise I'll upload mine somewhere.
   
   Egads, it is indeed missing!  I will re-upload it.  Hmm, I do not seem to have that exact file locally cached. Confusing ;)  I have the `-1kb` version (medium sized docs), but not the `big` docs.  @jpountz could you please post somewhere, maybe your `home.apache.org/~jpountz`, using `sftp`?  Then I'll download and copy it up to `/~mikemccand`.
   
   The nightly benchmarks uses the binary form of `wikibigall`, to reduce thread bottleneck when reading/parsing documents to index.  Hmm it is sampled from a different date (01/15/2011) ... OK I am uploading that one to https://home.apache.org/~mikemccand/enwiki-20110115-lines.bin (ETA ~20 minutes more).
   
   BTW there is a [`luceneutil` issue to re-sample the Wikipedia export](https://github.com/mikemccand/luceneutil/issues/91), but, alas, it is snagged up because the latest `enwiki` download, after converting XML -> text, is SMALLER than the sample we pulled 11 years ago!  Which I could not yet explain ...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] zacharymorn commented on pull request #101: LUCENE-9335: [Discussion Only] Add BMM scorer and use it for pure disjunction term query

Posted by GitBox <gi...@apache.org>.
zacharymorn commented on pull request #101:
URL: https://github.com/apache/lucene/pull/101#issuecomment-823838918


   luceneutil benchmark result with wikimedium5m
   
   ```
                      TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                  OrHighLow      537.97      (6.2%)      256.35      (2.8%)  -52.3% ( -57% -  -46%) 0.000
                 OrHighHigh       46.17      (2.4%)       22.41      (2.4%)  -51.5% ( -54% -  -47%) 0.000
                  OrHighMed      174.93      (3.3%)       87.04      (2.2%)  -50.2% ( -54% -  -46%) 0.000
                     Fuzzy2       81.84     (15.8%)       59.19      (8.7%)  -27.7% ( -45% -   -3%) 0.000
                     Fuzzy1       92.74      (7.4%)       75.87      (6.5%)  -18.2% ( -29% -   -4%) 0.000
               OrHighNotLow      817.07      (4.2%)      790.10      (6.6%)   -3.3% ( -13% -    7%) 0.060
                    LowTerm     1634.11      (3.3%)     1602.43      (4.3%)   -1.9% (  -9% -    5%) 0.107
                    Prefix3      411.01      (2.4%)      403.38      (2.0%)   -1.9% (  -6% -    2%) 0.008
                    MedTerm     1579.45      (5.0%)     1553.86      (4.7%)   -1.6% ( -10% -    8%) 0.292
              OrHighNotHigh      649.97      (4.4%)      641.91      (6.8%)   -1.2% ( -11% -   10%) 0.496
                 HighPhrase      134.74      (3.0%)      133.12      (2.7%)   -1.2% (  -6% -    4%) 0.184
                   HighTerm     1580.99      (3.7%)     1563.39      (4.8%)   -1.1% (  -9% -    7%) 0.409
               OrNotHighMed      748.48      (5.6%)      740.18      (6.2%)   -1.1% ( -12% -   11%) 0.551
                   Wildcard      128.78      (2.4%)      127.73      (2.2%)   -0.8% (  -5% -    3%) 0.271
               OrNotHighLow     1013.49      (4.1%)     1008.11      (3.5%)   -0.5% (  -7% -    7%) 0.656
                  MedPhrase      232.36      (2.1%)      231.17      (3.3%)   -0.5% (  -5% -    5%) 0.558
                AndHighHigh      100.09      (2.7%)       99.60      (3.2%)   -0.5% (  -6% -    5%) 0.604
                     IntNRQ      118.54      (1.2%)      117.97      (1.8%)   -0.5% (  -3% -    2%) 0.332
                  LowPhrase       82.96      (3.5%)       82.57      (3.1%)   -0.5% (  -6% -    6%) 0.646
            LowSloppyPhrase       68.53      (3.7%)       68.28      (3.2%)   -0.4% (  -7% -    6%) 0.744
       HighTermTitleBDVSort      302.06     (12.3%)      301.05     (15.3%)   -0.3% ( -24% -   31%) 0.939
                 AndHighMed      405.77      (3.2%)      404.65      (3.2%)   -0.3% (  -6% -    6%) 0.784
                   PKLookup      217.46      (2.9%)      217.05      (2.6%)   -0.2% (  -5% -    5%) 0.829
      BrowseMonthSSDVFacets       31.87      (1.7%)       31.81      (1.8%)   -0.2% (  -3% -    3%) 0.752
                    Respell      105.24      (2.4%)      105.14      (2.5%)   -0.1% (  -4% -    4%) 0.904
       HighIntervalsOrdered       27.43      (2.3%)       27.45      (2.5%)    0.1% (  -4% -    5%) 0.902
                MedSpanNear      175.50      (2.3%)      175.75      (2.5%)    0.1% (  -4% -    5%) 0.848
           HighSloppyPhrase       29.61      (4.6%)       29.66      (3.4%)    0.2% (  -7% -    8%) 0.895
              OrNotHighHigh      723.11      (4.7%)      724.68      (4.4%)    0.2% (  -8% -    9%) 0.880
                LowSpanNear      108.75      (1.6%)      109.21      (2.0%)    0.4% (  -3% -    4%) 0.465
            MedSloppyPhrase       90.19      (3.8%)       90.59      (3.0%)    0.4% (  -6% -    7%) 0.682
               HighSpanNear       51.05      (3.7%)       51.31      (3.4%)    0.5% (  -6% -    7%) 0.646
       BrowseDateTaxoFacets       11.05      (3.7%)       11.14      (3.3%)    0.8% (  -5% -    8%) 0.487
   BrowseDayOfYearTaxoFacets       11.06      (3.7%)       11.16      (3.2%)    0.8% (  -5% -    8%) 0.446
      BrowseMonthTaxoFacets       13.14      (3.0%)       13.27      (2.7%)    1.0% (  -4% -    6%) 0.295
   BrowseDayOfYearSSDVFacets       27.88      (5.0%)       28.24      (2.0%)    1.3% (  -5% -    8%) 0.287
      HighTermDayOfYearSort      184.68      (8.9%)      187.15      (9.1%)    1.3% ( -15% -   21%) 0.638
          HighTermMonthSort      243.63     (11.9%)      247.11     (10.6%)    1.4% ( -18% -   27%) 0.690
               OrHighNotMed      722.87      (4.3%)      733.49      (4.5%)    1.5% (  -7% -   10%) 0.293
                 TermDTSort      256.82     (13.0%)      261.88     (14.6%)    2.0% ( -22% -   33%) 0.651
                 AndHighLow     1106.79      (4.4%)     1144.40      (4.1%)    3.4% (  -4% -   12%) 0.012
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] zacharymorn commented on pull request #101: LUCENE-9335: [Discussion Only] Add BMM scorer and use it for pure disjunction term query

Posted by GitBox <gi...@apache.org>.
zacharymorn commented on pull request #101:
URL: https://github.com/apache/lucene/pull/101#issuecomment-840255508


   I'm able to run wikibigall with the above configuration (and a few small changes to `localrun.py` to take in vector dictionary and file), and get the following results:
   
   wikibigall run 1
   ```
                       TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
           AndMedOrHighHigh       25.47      (2.5%)       21.21      (4.1%)  -16.7% ( -22% -  -10%) 0.000
                  OrHighMed       72.05      (2.5%)       60.17      (4.1%)  -16.5% ( -22% -  -10%) 0.000
                     Fuzzy2       31.51      (8.4%)       29.27      (9.5%)   -7.1% ( -23% -   11%) 0.012
                     Fuzzy1       59.00      (8.4%)       56.38      (9.8%)   -4.4% ( -20% -   14%) 0.122
          TermDayOfYearSort       48.68     (14.6%)       46.86      (8.9%)   -3.7% ( -23% -   23%) 0.328
              TermMonthSort       61.44      (9.3%)       60.23      (9.8%)   -2.0% ( -19% -   18%) 0.514
              TermTitleSort      110.13      (8.9%)      108.54      (9.2%)   -1.4% ( -18% -   18%) 0.616
                 TermDTSort       56.74      (9.5%)       56.04     (10.2%)   -1.2% ( -19% -   20%) 0.692
               TermGroup10K       15.95      (2.6%)       15.75      (3.2%)   -1.2% (  -6% -    4%) 0.183
               TermGroup100       19.34      (3.5%)       19.11      (3.4%)   -1.2% (  -7% -    5%) 0.276
               TermBGroup1M       17.92      (2.7%)       17.72      (3.8%)   -1.1% (  -7% -    5%) 0.280
                TermGroup1M       19.06      (2.5%)       18.86      (3.1%)   -1.1% (  -6% -    4%) 0.234
             TermBGroup1M1P       43.93      (3.8%)       43.66      (4.4%)   -0.6% (  -8% -    7%) 0.633
                   SpanNear       24.55      (2.5%)       24.46      (2.4%)   -0.4% (  -5% -    4%) 0.620
               SloppyPhrase        2.99      (8.2%)        2.98      (8.0%)   -0.3% ( -15% -   17%) 0.897
                     IntNRQ      139.92      (1.2%)      139.61      (1.7%)   -0.2% (  -3% -    2%) 0.627
           IntervalsOrdered        2.02      (3.7%)        2.02      (3.6%)   -0.1% (  -7% -    7%) 0.957
      BrowseMonthSSDVFacets       19.28      (1.0%)       19.27      (0.9%)   -0.0% (  -1% -    1%) 0.888
   BrowseDayOfYearSSDVFacets       17.61      (1.9%)       17.61      (1.8%)    0.0% (  -3% -    3%) 0.989
                    Respell       48.40      (2.6%)       48.59      (3.1%)    0.4% (  -5% -    6%) 0.660
                   PKLookup      211.17      (3.4%)      212.28      (3.6%)    0.5% (  -6% -    7%) 0.631
   BrowseDayOfYearTaxoFacets        7.30      (6.4%)        7.34      (5.8%)    0.6% ( -10% -   13%) 0.766
                    Prefix3      120.15      (6.1%)      120.95      (4.6%)    0.7% (  -9% -   12%) 0.694
                     Phrase       54.76      (3.6%)       55.13      (4.3%)    0.7% (  -6% -    8%) 0.586
       BrowseDateTaxoFacets        7.62      (6.7%)        7.67      (6.1%)    0.7% ( -11% -   14%) 0.718
                   Wildcard       72.89      (2.8%)       73.46      (2.2%)    0.8% (  -4% -    5%) 0.331
      BrowseMonthTaxoFacets        8.42      (7.0%)        8.49      (6.4%)    0.8% ( -11% -   15%) 0.694
             TermDateFacets        8.56      (7.4%)        8.64      (7.3%)    0.9% ( -12% -   16%) 0.702
               VectorSearch      954.06      (3.7%)      967.28      (2.5%)    1.4% (  -4% -    7%) 0.167
                AndHighHigh       32.04      (3.1%)       32.65      (4.1%)    1.9% (  -5% -    9%) 0.100
                 AndHighMed       83.72      (1.8%)       85.38      (2.9%)    2.0% (  -2% -    6%) 0.009
            AndHighOrMedMed       44.41      (2.2%)       45.34      (4.2%)    2.1% (  -4% -    8%) 0.049
                       Term     1132.83      (5.2%)     1157.74      (5.0%)    2.2% (  -7% -   13%) 0.173
                 OrHighHigh       11.35      (2.9%)       16.95      (4.1%)   49.4% (  41% -   58%) 0.000
   ```
   
   wikibigall run 2
   ```
                       TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
          TermDayOfYearSort       59.31     (11.3%)       56.25      (9.5%)   -5.2% ( -23% -   17%) 0.116
              TermTitleSort      175.57     (10.3%)      169.31      (9.3%)   -3.6% ( -21% -   17%) 0.252
              TermMonthSort       61.73     (10.5%)       59.73      (9.2%)   -3.2% ( -20% -   18%) 0.300
               TermBGroup1M       11.87      (3.2%)       11.57      (2.6%)   -2.6% (  -8% -    3%) 0.005
               TermGroup10K       15.60      (2.8%)       15.26      (3.3%)   -2.2% (  -8% -    4%) 0.023
               TermGroup100       23.39      (2.6%)       22.94      (3.0%)   -1.9% (  -7% -    3%) 0.030
                TermGroup1M       12.16      (2.6%)       11.94      (2.9%)   -1.8% (  -7% -    3%) 0.036
                 TermDTSort       78.08      (6.4%)       76.75      (5.4%)   -1.7% ( -12% -   10%) 0.361
                     Fuzzy1       28.49      (5.6%)       28.08      (5.8%)   -1.4% ( -12% -   10%) 0.419
             TermBGroup1M1P       42.98      (2.9%)       42.49      (3.3%)   -1.1% (  -7% -    5%) 0.247
                     IntNRQ      136.00      (3.4%)      134.92      (3.5%)   -0.8% (  -7% -    6%) 0.473
                AndHighHigh       22.60      (3.9%)       22.42      (2.4%)   -0.8% (  -6% -    5%) 0.455
             TermDateFacets        8.38      (7.0%)        8.32      (6.6%)   -0.7% ( -13% -   13%) 0.738
                    Prefix3       33.76      (3.1%)       33.55      (2.8%)   -0.6% (  -6% -    5%) 0.502
                   Wildcard       88.25      (3.0%)       87.72      (2.3%)   -0.6% (  -5% -    4%) 0.479
   BrowseDayOfYearSSDVFacets       17.39      (1.4%)       17.31      (1.5%)   -0.4% (  -3% -    2%) 0.363
                       Term     1144.71      (8.2%)     1139.98      (8.5%)   -0.4% ( -15% -   17%) 0.876
               VectorSearch      807.48      (4.7%)      804.27      (4.9%)   -0.4% (  -9% -    9%) 0.794
           IntervalsOrdered        0.82      (5.3%)        0.82      (5.4%)   -0.3% ( -10% -   11%) 0.872
                   PKLookup      207.38      (4.8%)      206.99      (5.5%)   -0.2% ( -10% -   10%) 0.910
                    Respell       48.27      (3.6%)       48.18      (3.6%)   -0.2% (  -7% -    7%) 0.878
       BrowseDateTaxoFacets        7.58      (6.8%)        7.57      (6.5%)   -0.2% ( -12% -   14%) 0.933
   BrowseDayOfYearTaxoFacets        7.26      (6.3%)        7.24      (6.2%)   -0.2% ( -11% -   13%) 0.932
               SloppyPhrase        5.55      (5.7%)        5.54      (5.8%)   -0.1% ( -10% -   12%) 0.946
      BrowseMonthTaxoFacets        8.38      (6.7%)        8.38      (6.6%)   -0.1% ( -12% -   14%) 0.969
                   SpanNear        2.12      (3.5%)        2.12      (3.9%)    0.0% (  -7% -    7%) 0.993
      BrowseMonthSSDVFacets       19.52      (5.7%)       19.53      (5.7%)    0.0% ( -10% -   12%) 0.985
                     Phrase       59.48      (3.8%)       59.72      (3.9%)    0.4% (  -6% -    8%) 0.735
                 AndHighMed       60.30      (4.6%)       60.57      (2.3%)    0.4% (  -6% -    7%) 0.705
                     Fuzzy2       53.28     (11.3%)       53.93     (10.6%)    1.2% ( -18% -   26%) 0.727
           AndMedOrHighHigh       27.58      (3.1%)       27.93      (3.4%)    1.3% (  -5% -    8%) 0.209
            AndHighOrMedMed       24.38      (3.1%)       25.06      (3.3%)    2.8% (  -3% -    9%) 0.006
                  OrHighMed       45.08      (4.3%)       57.44      (6.9%)   27.4% (  15% -   40%) 0.000
                 OrHighHigh       11.22      (4.8%)       16.16      (9.8%)   44.0% (  28% -   61%) 0.000
   ```
   
   wikibigall run 3
   ```
                       TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                     Fuzzy2       51.26     (10.6%)       47.91     (11.4%)   -6.5% ( -25% -   17%) 0.060
                 TermDTSort       83.68     (11.6%)       79.69      (7.7%)   -4.8% ( -21% -   16%) 0.126
                     Fuzzy1       60.65      (6.6%)       57.96      (8.5%)   -4.4% ( -18% -   11%) 0.065
              TermMonthSort       61.55     (15.5%)       59.70      (8.4%)   -3.0% ( -23% -   24%) 0.447
              TermTitleSort       88.13     (15.1%)       85.68      (8.4%)   -2.8% ( -22% -   24%) 0.471
          TermDayOfYearSort       47.32      (8.7%)       46.58     (10.0%)   -1.5% ( -18% -   18%) 0.602
                   Wildcard       37.71      (9.9%)       37.32      (9.1%)   -1.0% ( -18% -   19%) 0.729
                    Prefix3      164.10     (12.8%)      162.91     (11.5%)   -0.7% ( -22% -   27%) 0.850
               TermGroup100       23.62      (3.0%)       23.50      (2.9%)   -0.5% (  -6% -    5%) 0.617
               TermBGroup1M       11.90      (2.7%)       11.86      (3.1%)   -0.3% (  -5% -    5%) 0.714
                       Term     1044.58      (8.0%)     1041.84      (6.5%)   -0.3% ( -13% -   15%) 0.910
               TermGroup10K       12.18      (2.9%)       12.15      (2.7%)   -0.2% (  -5% -    5%) 0.808
           IntervalsOrdered        3.91      (2.7%)        3.91      (2.7%)   -0.1% (  -5% -    5%) 0.934
   BrowseDayOfYearSSDVFacets       17.47      (2.3%)       17.47      (2.0%)   -0.0% (  -4% -    4%) 0.994
               VectorSearch      851.38      (5.2%)      851.53      (5.6%)    0.0% ( -10% -   11%) 0.992
                   SpanNear       24.59      (2.2%)       24.61      (2.3%)    0.1% (  -4% -    4%) 0.923
                TermGroup1M       15.44      (2.7%)       15.45      (2.9%)    0.1% (  -5% -    5%) 0.936
      BrowseMonthSSDVFacets       19.19      (1.4%)       19.22      (1.5%)    0.2% (  -2% -    3%) 0.746
             TermBGroup1M1P       18.19      (4.7%)       18.24      (4.6%)    0.3% (  -8% -    9%) 0.858
                     IntNRQ      136.29      (3.4%)      136.75      (3.5%)    0.3% (  -6% -    7%) 0.755
                    Respell       40.87      (5.4%)       41.04      (4.8%)    0.4% (  -9% -   11%) 0.793
                     Phrase       24.71      (3.1%)       24.87      (2.7%)    0.7% (  -4% -    6%) 0.472
                   PKLookup      205.99      (5.9%)      208.11      (6.4%)    1.0% ( -10% -   14%) 0.596
   BrowseDayOfYearTaxoFacets        7.04      (7.8%)        7.11      (7.5%)    1.0% ( -13% -   17%) 0.665
       BrowseDateTaxoFacets        7.36      (7.9%)        7.45      (7.7%)    1.1% ( -13% -   18%) 0.646
               SloppyPhrase        1.17     (10.0%)        1.19      (9.8%)    1.2% ( -16% -   23%) 0.705
                 AndHighMed       45.25      (3.5%)       45.82      (3.8%)    1.3% (  -5% -    8%) 0.274
      BrowseMonthTaxoFacets        8.13      (7.9%)        8.25      (7.9%)    1.4% ( -13% -   18%) 0.569
                AndHighHigh       23.27      (3.6%)       23.60      (4.0%)    1.4% (  -5% -    9%) 0.227
             TermDateFacets       11.08      (9.9%)       11.27     (10.0%)    1.7% ( -16% -   23%) 0.582
            AndHighOrMedMed       24.91      (3.8%)       25.72      (4.0%)    3.3% (  -4% -   11%) 0.008
           AndMedOrHighHigh       10.00      (4.2%)       10.61      (4.2%)    6.1% (  -2% -   15%) 0.000
                  OrHighMed       59.17      (5.5%)       67.57      (4.5%)   14.2% (   3% -   25%) 0.000
                 OrHighHigh       17.06      (4.9%)       24.26      (5.2%)   42.2% (  30% -   55%) 0.000
   ```
   
   I'll run it for the other two implementations as well and post the results in the other PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] zacharymorn commented on pull request #101: LUCENE-9335: [Discussion Only] Add BMM scorer and use it for pure disjunction term query

Posted by GitBox <gi...@apache.org>.
zacharymorn commented on pull request #101:
URL: https://github.com/apache/lucene/pull/101#issuecomment-830494893


   2 luceneutil runs with wikimedium5m results:
   
   ```
                TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                  OrHighMed      199.09      (3.5%)      105.27      (2.5%)  -47.1% ( -51% -  -42%) 0.000
                  OrHighLow      536.68      (6.0%)      309.93      (3.2%)  -42.3% ( -48% -  -35%) 0.000
                 OrHighHigh       34.53      (2.3%)       23.21      (3.8%)  -32.8% ( -38% -  -27%) 0.000
                    LowTerm     1830.50      (4.8%)     1784.41      (5.1%)   -2.5% ( -11% -    7%) 0.109
                 HighPhrase      243.73      (2.6%)      237.72      (2.9%)   -2.5% (  -7% -    3%) 0.005
               OrNotHighLow     1251.92      (3.8%)     1236.67      (4.1%)   -1.2% (  -8% -    6%) 0.328
               OrHighNotMed      760.26      (4.4%)      751.82      (4.7%)   -1.1% (  -9% -    8%) 0.443
                   HighTerm     1384.99      (4.2%)     1370.00      (4.2%)   -1.1% (  -9% -    7%) 0.415
                  LowPhrase       80.03      (3.2%)       79.19      (3.5%)   -1.1% (  -7% -    5%) 0.321
                LowSpanNear      194.21      (2.3%)      192.63      (2.4%)   -0.8% (  -5% -    4%) 0.281
              OrNotHighHigh      687.96      (6.5%)      682.76      (4.9%)   -0.8% ( -11% -   11%) 0.679
                  MedPhrase      435.41      (3.1%)      432.56      (3.8%)   -0.7% (  -7% -    6%) 0.554
                MedSpanNear      143.42      (3.0%)      142.58      (2.8%)   -0.6% (  -6% -    5%) 0.520
          HighTermMonthSort      153.33     (13.2%)      152.43     (11.4%)   -0.6% ( -22% -   27%) 0.880
               HighSpanNear       46.53      (2.7%)       46.31      (2.4%)   -0.5% (  -5% -    4%) 0.568
                   PKLookup      222.89      (3.2%)      221.95      (3.3%)   -0.4% (  -6% -    6%) 0.683
                 TermDTSort      210.61     (15.6%)      209.73     (15.1%)   -0.4% ( -26% -   35%) 0.932
      HighTermDayOfYearSort      261.28      (8.7%)      260.58      (8.2%)   -0.3% ( -15% -   18%) 0.921
                    MedTerm     1647.86      (3.4%)     1644.52      (5.7%)   -0.2% (  -8% -    9%) 0.891
            MedSloppyPhrase       30.70      (2.8%)       30.65      (2.4%)   -0.2% (  -5% -    5%) 0.838
      BrowseMonthSSDVFacets       31.90      (2.1%)       31.86      (2.2%)   -0.1% (  -4% -    4%) 0.870
   BrowseDayOfYearTaxoFacets       11.13      (2.9%)       11.13      (2.7%)   -0.0% (  -5% -    5%) 0.958
            LowSloppyPhrase       72.26      (1.9%)       72.26      (1.5%)   -0.0% (  -3% -    3%) 0.985
                 AndHighLow     1093.17      (4.7%)     1093.14      (5.8%)   -0.0% ( -10% -   11%) 0.999
      BrowseMonthTaxoFacets       13.15      (2.4%)       13.15      (2.6%)    0.0% (  -4% -    5%) 0.995
       BrowseDateTaxoFacets       11.10      (2.8%)       11.11      (2.6%)    0.0% (  -5% -    5%) 0.956
                    Prefix3      435.49      (5.6%)      435.75      (4.6%)    0.1% (  -9% -   10%) 0.971
       HighTermTitleBDVSort      349.10     (15.4%)      349.53     (15.9%)    0.1% ( -26% -   37%) 0.980
                     IntNRQ      196.54      (1.2%)      196.83      (1.4%)    0.1% (  -2% -    2%) 0.726
       HighIntervalsOrdered       39.29      (1.7%)       39.35      (1.7%)    0.2% (  -3% -    3%) 0.779
                    Respell       86.26      (2.8%)       86.41      (2.2%)    0.2% (  -4% -    5%) 0.828
               OrNotHighMed      630.88      (5.5%)      632.63      (4.2%)    0.3% (  -9% -   10%) 0.859
                   Wildcard      219.64      (3.3%)      220.47      (3.0%)    0.4% (  -5% -    6%) 0.700
                 AndHighMed      446.48      (3.2%)      448.84      (2.5%)    0.5% (  -5% -    6%) 0.566
           HighSloppyPhrase       48.95      (5.3%)       49.29      (4.3%)    0.7% (  -8% -   10%) 0.653
                     Fuzzy1       77.46      (5.8%)       78.08      (6.6%)    0.8% ( -10% -   13%) 0.683
              OrHighNotHigh      646.82      (5.2%)      653.41      (5.8%)    1.0% (  -9% -   12%) 0.558
   BrowseDayOfYearSSDVFacets       27.71      (6.9%)       28.04      (5.3%)    1.2% ( -10% -   14%) 0.543
                AndHighHigh      153.06      (3.3%)      155.03      (4.2%)    1.3% (  -6% -    9%) 0.283
               OrHighNotLow      783.18      (5.7%)      802.15      (5.5%)    2.4% (  -8% -   14%) 0.173
                     Fuzzy2       82.84     (18.2%)       86.90     (16.6%)    4.9% ( -25% -   48%) 0.373
   ```
   ```
                     TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                  OrHighMed      217.30      (3.7%)      124.89      (1.6%)  -42.5% ( -46% -  -38%) 0.000
                 OrHighHigh       66.63      (2.2%)       49.45      (1.5%)  -25.8% ( -28% -  -22%) 0.000
                  OrHighLow      225.66      (2.9%)      210.48      (4.3%)   -6.7% ( -13% -    0%) 0.000
              OrNotHighHigh      828.22      (5.6%)      803.12      (5.7%)   -3.0% ( -13% -    8%) 0.090
              OrHighNotHigh      788.67      (5.7%)      766.95      (5.1%)   -2.8% ( -12% -    8%) 0.107
       HighTermTitleBDVSort      233.10     (10.2%)      227.07      (8.9%)   -2.6% ( -19% -   18%) 0.393
          HighTermMonthSort      399.02     (12.3%)      389.61     (13.3%)   -2.4% ( -24% -   26%) 0.560
                    MedTerm     1607.99      (7.4%)     1570.56      (3.6%)   -2.3% ( -12% -    9%) 0.206
               OrNotHighMed      702.25      (4.6%)      687.85      (5.8%)   -2.1% ( -11% -    8%) 0.213
      HighTermDayOfYearSort      258.38     (14.9%)      253.13     (13.3%)   -2.0% ( -26% -   30%) 0.649
                    Prefix3      499.62      (5.8%)      490.49      (5.9%)   -1.8% ( -12% -   10%) 0.323
                    LowTerm     1675.66      (4.5%)     1645.62      (4.4%)   -1.8% ( -10% -    7%) 0.203
                   Wildcard      148.88      (2.8%)      146.54      (3.8%)   -1.6% (  -7% -    5%) 0.138
               OrHighNotLow      752.35      (5.4%)      741.10      (7.0%)   -1.5% ( -13% -   11%) 0.446
            MedSloppyPhrase       35.95      (3.7%)       35.59      (3.1%)   -1.0% (  -7% -    6%) 0.354
                  MedPhrase      738.72      (5.1%)      731.71      (5.3%)   -0.9% ( -10% -    9%) 0.564
            LowSloppyPhrase       75.04      (3.4%)       74.41      (3.1%)   -0.8% (  -7% -    5%) 0.414
                 TermDTSort      444.12     (14.6%)      440.54     (14.0%)   -0.8% ( -25% -   32%) 0.858
               OrNotHighLow      903.20      (3.9%)      898.10      (4.3%)   -0.6% (  -8% -    7%) 0.662
                   PKLookup      217.06      (3.8%)      216.21      (3.6%)   -0.4% (  -7% -    7%) 0.739
                MedSpanNear      124.58      (3.3%)      124.14      (1.7%)   -0.3% (  -5% -    4%) 0.676
                 HighPhrase       84.33      (2.9%)       84.05      (3.0%)   -0.3% (  -6% -    5%) 0.718
                  LowPhrase       37.84      (3.7%)       37.72      (3.7%)   -0.3% (  -7% -    7%) 0.782
                     Fuzzy2       64.65     (13.7%)       64.47     (11.6%)   -0.3% ( -22% -   29%) 0.943
                     Fuzzy1       57.26      (8.4%)       57.13      (7.4%)   -0.2% ( -14% -   16%) 0.930
                AndHighHigh      126.99      (2.2%)      126.72      (3.0%)   -0.2% (  -5% -    5%) 0.804
       HighIntervalsOrdered       13.60      (2.8%)       13.59      (2.5%)   -0.1% (  -5% -    5%) 0.916
                    Respell      102.67      (2.9%)      102.59      (2.4%)   -0.1% (  -5% -    5%) 0.930
                LowSpanNear       63.07      (3.0%)       63.16      (2.3%)    0.2% (  -4% -    5%) 0.857
           HighSloppyPhrase       91.17      (3.5%)       91.31      (3.3%)    0.2% (  -6% -    7%) 0.887
                 AndHighMed      284.25      (2.0%)      285.44      (2.6%)    0.4% (  -4% -    5%) 0.568
                   HighTerm     1258.74      (4.9%)     1264.45      (3.7%)    0.5% (  -7% -    9%) 0.741
               OrHighNotMed      699.99      (6.1%)      703.29      (7.3%)    0.5% ( -12% -   14%) 0.824
   BrowseDayOfYearTaxoFacets       10.94      (3.8%)       11.00      (3.5%)    0.5% (  -6% -    8%) 0.653
      BrowseMonthTaxoFacets       13.06      (2.4%)       13.13      (2.6%)    0.5% (  -4% -    5%) 0.505
       BrowseDateTaxoFacets       10.92      (3.8%)       10.98      (3.6%)    0.6% (  -6% -    8%) 0.625
                 AndHighLow     1026.60      (4.0%)     1032.70      (4.5%)    0.6% (  -7% -    9%) 0.659
      BrowseMonthSSDVFacets       31.41      (5.4%)       31.63      (2.4%)    0.7% (  -6% -    9%) 0.595
                     IntNRQ      293.66      (3.3%)      295.77      (2.0%)    0.7% (  -4% -    6%) 0.407
               HighSpanNear      147.03      (3.1%)      148.42      (1.8%)    0.9% (  -3% -    6%) 0.243
   BrowseDayOfYearSSDVFacets       27.66      (5.2%)       28.13      (1.7%)    1.7% (  -4% -    9%) 0.163
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jpountz commented on pull request #101: LUCENE-9335: [Discussion Only] Add BMM scorer and use it for pure disjunction term query

Posted by GitBox <gi...@apache.org>.
jpountz commented on pull request #101:
URL: https://github.com/apache/lucene/pull/101#issuecomment-836296845


   > I cherry-picked your commit and pushed to this branch / PR to further explore the changes and their effect, hope that's ok.
   
   Of course!
   
   > I also tried to run wikibigall as well, which seems to require enwiki-20100302-pages-articles-lines.txt but it's not downloaded by the util. It appears the archive should be coming from http://home.apache.org/~mikemccand/enwiki-20100302-pages-articles-lines.txt.bz2, but it's giving 404 now.
   
   Hmm good question, I had downloaded this file years ago, I'm not sure where to find it nowadays. @mikemccand Do you know where to find it? Otherwise I'll upload mine somewhere.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] zacharymorn commented on a change in pull request #101: LUCENE-9335: [Discussion Only] Add BMM scorer and use it for pure disjunction term query

Posted by GitBox <gi...@apache.org>.
zacharymorn commented on a change in pull request #101:
URL: https://github.com/apache/lucene/pull/101#discussion_r627975325



##########
File path: lucene/core/src/java/org/apache/lucene/search/BlockMaxMaxscoreScorer.java
##########
@@ -0,0 +1,339 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.search;
+
+import static org.apache.lucene.search.ScorerUtil.costWithMinShouldMatch;
+
+import java.io.IOException;
+import java.util.*;
+
+/** Scorer implementing Block-Max Maxscore algorithm */
+public class BlockMaxMaxscoreScorer extends Scorer {
+  private final ScoreMode scoreMode;
+  private final int scalingFactor;
+
+  // current doc ID of the leads
+  private int doc;
+
+  // doc id boundary that all scorers maxScore are valid
+  private int upTo = -1;
+
+  // heap of scorers ordered by doc ID
+  private final DisiPriorityQueue essentialsScorers;
+
+  // list of scorers whose sum of maxScore is less than minCompetitiveScore, ordered by maxScore
+  private final List<DisiWrapper> nonEssentialScorers;
+
+  // sum of max scores of scorers in nonEssentialScorers list
+  private long nonEssentialMaxScoreSum;
+
+  // sum of score of scorers in essentialScorers list that are positioned on matching doc
+  private long matchedDocScoreSum;
+
+  private long cost;
+
+  private final MaxScoreSumPropagator maxScoreSumPropagator;
+
+  private final List<Scorer> scorers;
+
+  // scaled min competitive score
+  private long minCompetitiveScore = 0;
+
+  /**
+   * Constructs a Scorer
+   *
+   * @param weight The weight to be used.
+   * @param scorers The sub scorers this Scorer should iterate on for optional clauses
+   * @param scoreMode The scoreMode
+   */
+  public BlockMaxMaxscoreScorer(Weight weight, List<Scorer> scorers, ScoreMode scoreMode)
+      throws IOException {
+    super(weight);
+    assert scoreMode == ScoreMode.TOP_SCORES;
+
+    this.scoreMode = scoreMode;
+    this.doc = -1;
+    this.scorers = scorers;
+    this.cost =
+        costWithMinShouldMatch(
+            scorers.stream().map(Scorer::iterator).mapToLong(DocIdSetIterator::cost),
+            scorers.size(),
+            1);
+
+    essentialsScorers = new DisiPriorityQueue(scorers.size());
+    nonEssentialScorers = new LinkedList<>();
+
+    scalingFactor = WANDScorer.getScalingFactor(scorers);
+    maxScoreSumPropagator = new MaxScoreSumPropagator(scorers);
+
+    for (Scorer scorer : scorers) {
+      nonEssentialScorers.add(new DisiWrapper(scorer));
+    }
+  }
+
+  @Override
+  public DocIdSetIterator iterator() {
+    return TwoPhaseIterator.asDocIdSetIterator(twoPhaseIterator());
+  }
+
+  @Override
+  public TwoPhaseIterator twoPhaseIterator() {
+    DocIdSetIterator approximation =
+        new DocIdSetIterator() {
+          private long lastMinCompetitiveScore;
+
+          @Override
+          public int docID() {
+            return doc;
+          }
+
+          @Override
+          public int nextDoc() throws IOException {
+            return advance(doc + 1);
+          }
+
+          @Override
+          public int advance(int target) throws IOException {
+            doAdvance(target);
+
+            while (doc != DocIdSetIterator.NO_MORE_DOCS
+                && nonEssentialMaxScoreSum + matchedDocScoreSum < minCompetitiveScore) {
+              doAdvance(doc + 1);
+            }
+
+            return doc;
+          }
+
+          private void doAdvance(int target) throws IOException {
+            matchedDocScoreSum = 0;
+            // Find next smallest doc id that is larger than or equal to target from the essential
+            // scorers
+
+            // If the next candidate doc id is still within interval boundary,
+            if (lastMinCompetitiveScore == minCompetitiveScore && target <= upTo) {
+              while (essentialsScorers.top().doc < target) {
+                DisiWrapper w = essentialsScorers.pop();
+                w.doc = w.iterator.advance(target);
+                essentialsScorers.add(w);
+              }
+
+              if (essentialsScorers.top().doc <= upTo) {
+                doc = essentialsScorers.top().doc;
+
+                if (doc == NO_MORE_DOCS) {
+                  return;
+                }
+              } else {
+                doc = upTo + 1;
+              }
+            } else {
+              lastMinCompetitiveScore = minCompetitiveScore;
+              // Next candidate doc id is above interval boundary, or minCompetitive has increased.
+              // Find next interval boundary.
+              // Block boundary alignment strategy is adapted from "Optimizing Top-k Document
+              // Retrieval Strategies for Block-Max Indexes" by Dimopoulos, Nepomnyachiy and Suel.
+              // Find the block interval boundary that is the minimum of all participating scorer's
+              // block boundary. Then run BMM within each interval.
+              updateUpToAndMaxScore(target);
+
+              repartitionLists();
+
+              // maxScore of all scorers sum to less than minCompetitiveScore, no more result is
+              // available up to upTo
+              if (essentialsScorers.size() == 0) {
+                // current bound no long valid
+                doc = upTo + 1;
+              } else {
+                doc = essentialsScorers.top().doc;
+
+                if (doc == NO_MORE_DOCS) {
+                  return;
+                }
+              }
+            }
+
+            for (DisiWrapper w : essentialsScorers) {
+              if (w.doc == doc && doc != NO_MORE_DOCS) {
+                matchedDocScoreSum += WANDScorer.scaleMaxScore(w.scorer.score(), scalingFactor);
+              }
+            }
+          }
+
+          private void updateUpToAndMaxScore(int target) throws IOException {
+            while (essentialsScorers.size() > 0) {
+              nonEssentialScorers.add(essentialsScorers.pop());
+            }
+
+            // reset upTo
+            upTo = DocIdSetIterator.NO_MORE_DOCS;
+            for (DisiWrapper w : nonEssentialScorers) {
+              if (w.doc < target) {
+                upTo = Math.min(w.scorer.advanceShallow(target), upTo);
+              } else {
+                upTo = Math.min(w.scorer.advanceShallow(w.doc), upTo);
+              }
+            }
+            assert target <= upTo;
+
+            for (DisiWrapper w : nonEssentialScorers) {
+              if (w.doc <= upTo) {
+                w.maxScore = WANDScorer.scaleMaxScore(w.scorer.getMaxScore(upTo), scalingFactor);
+              } else {
+                // This scorer won't be able to contribute to match for target, setting its maxScore
+                // to 0 so it goes into nonEssentialList
+                w.maxScore = 0;
+              }
+              if (w.doc < target) {
+                w.doc = w.iterator.advance(target);
+              }

Review comment:
       Hmm I think the purpose here is similar to https://github.com/jpountz/lucene/blob/ce2e5baed8831aac0f44398b97c54264293beb40/lucene/core/src/java/org/apache/lucene/search/BlockMaxMaxscoreScorer.java#L117-L120, to make sure scorer gets positioned on or after `target` if `upTo` is updated. But your approach would require less positioning than mine since it only updates the ones in essential list.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] mikemccand commented on pull request #101: LUCENE-9335: [Discussion Only] Add BMM scorer and use it for pure disjunction term query

Posted by GitBox <gi...@apache.org>.
mikemccand commented on pull request #101:
URL: https://github.com/apache/lucene/pull/101#issuecomment-836879100


   > FWIW the file I have locally is enwiki-20130102-lines.txt, not the enwiki-20100302-pages-articles-lines.txt file that luceneutil refers to.
   
   Aha!  I have that one locally, compressed with `.lzma` -- I'll upload it.
   
   OK it's copying now to https://home.apache.org/~mikemccand/enwiki-20130102-lines.txt.lzma!  ETA ~15 minutes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jpountz commented on pull request #101: LUCENE-9335: [Discussion Only] Add BMM scorer and use it for pure disjunction term query

Posted by GitBox <gi...@apache.org>.
jpountz commented on pull request #101:
URL: https://github.com/apache/lucene/pull/101#issuecomment-834791219


   > The last two are optimization techniques not mentioned in the paper I think?
   
   To be honest I didn't read the paper recently so it's possible I diverged a bit from it.
   
   (2) feels natural to avoid doing useless score computations, though it might only work well when score upper bounds are very close to the actual scores. Maybe we should test on wikibig instead of wikimedium to get better confidence that this change makes things better.
   
   Regarding (3) does it actually push more scorers into `nonEssentialScorers`? I thought I just reorganized the existing logic a bit. If it pushes more scorers into `nonEssentialScorers` it's probably a bug. :)
   
   > but it seems like a good net speedup given the latter twos already have much faster QPS compared to OrHighHigh?
   
   Agreed, we already made similar choices in the past. It's probably still worth playing with a wider variety of queries, e.g. queries with many terms that have mixed frequencies, e.g. something like OrHighHighMedMedLowLow, or disjunctions within conjunctions/conjunctions within disjunctions.
   
   Maybe we can also try to port similar changes to the bulk scorer to see if it yields even greater benefits?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] zacharymorn commented on pull request #101: LUCENE-9335: [Discussion Only] Add BMM scorer and use it for pure disjunction term query

Posted by GitBox <gi...@apache.org>.
zacharymorn commented on pull request #101:
URL: https://github.com/apache/lucene/pull/101#issuecomment-837864536


   Thanks @mikemccand and @jpountz for the uploads! 
   
   > The nightly benchmarks uses the binary form of wikibigall, to reduce thread bottleneck when reading/parsing documents to index. Hmm it is sampled from a different date (01/15/2011) ... OK I am uploading that one to https://home.apache.org/~mikemccand/enwiki-20110115-lines.bin (ETA ~20 minutes more).
   
   I'm assuming we should now run `nightlyBench.py` instead of `wikibigall`, since the latter depends on `enwiki-20100302-pages-articles-lines.txt` but it's no longer available?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] zacharymorn commented on a change in pull request #101: LUCENE-9335: [Discussion Only] Add BMM scorer and use it for pure disjunction term query

Posted by GitBox <gi...@apache.org>.
zacharymorn commented on a change in pull request #101:
URL: https://github.com/apache/lucene/pull/101#discussion_r627978058



##########
File path: lucene/core/src/java/org/apache/lucene/search/BlockMaxMaxscoreScorer.java
##########
@@ -0,0 +1,339 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.search;
+
+import static org.apache.lucene.search.ScorerUtil.costWithMinShouldMatch;
+
+import java.io.IOException;
+import java.util.*;
+
+/** Scorer implementing Block-Max Maxscore algorithm */
+public class BlockMaxMaxscoreScorer extends Scorer {
+  private final ScoreMode scoreMode;
+  private final int scalingFactor;
+
+  // current doc ID of the leads
+  private int doc;
+
+  // doc id boundary that all scorers maxScore are valid
+  private int upTo = -1;
+
+  // heap of scorers ordered by doc ID
+  private final DisiPriorityQueue essentialsScorers;
+
+  // list of scorers whose sum of maxScore is less than minCompetitiveScore, ordered by maxScore
+  private final List<DisiWrapper> nonEssentialScorers;
+
+  // sum of max scores of scorers in nonEssentialScorers list
+  private long nonEssentialMaxScoreSum;
+
+  // sum of score of scorers in essentialScorers list that are positioned on matching doc
+  private long matchedDocScoreSum;
+
+  private long cost;
+
+  private final MaxScoreSumPropagator maxScoreSumPropagator;
+
+  private final List<Scorer> scorers;
+
+  // scaled min competitive score
+  private long minCompetitiveScore = 0;
+
+  /**
+   * Constructs a Scorer
+   *
+   * @param weight The weight to be used.
+   * @param scorers The sub scorers this Scorer should iterate on for optional clauses
+   * @param scoreMode The scoreMode
+   */
+  public BlockMaxMaxscoreScorer(Weight weight, List<Scorer> scorers, ScoreMode scoreMode)
+      throws IOException {
+    super(weight);
+    assert scoreMode == ScoreMode.TOP_SCORES;
+
+    this.scoreMode = scoreMode;
+    this.doc = -1;
+    this.scorers = scorers;
+    this.cost =
+        costWithMinShouldMatch(
+            scorers.stream().map(Scorer::iterator).mapToLong(DocIdSetIterator::cost),
+            scorers.size(),
+            1);
+
+    essentialsScorers = new DisiPriorityQueue(scorers.size());
+    nonEssentialScorers = new LinkedList<>();
+
+    scalingFactor = WANDScorer.getScalingFactor(scorers);
+    maxScoreSumPropagator = new MaxScoreSumPropagator(scorers);
+
+    for (Scorer scorer : scorers) {
+      nonEssentialScorers.add(new DisiWrapper(scorer));
+    }
+  }
+
+  @Override
+  public DocIdSetIterator iterator() {
+    return TwoPhaseIterator.asDocIdSetIterator(twoPhaseIterator());
+  }
+
+  @Override
+  public TwoPhaseIterator twoPhaseIterator() {
+    DocIdSetIterator approximation =
+        new DocIdSetIterator() {
+          private long lastMinCompetitiveScore;
+
+          @Override
+          public int docID() {
+            return doc;
+          }
+
+          @Override
+          public int nextDoc() throws IOException {
+            return advance(doc + 1);
+          }
+
+          @Override
+          public int advance(int target) throws IOException {
+            doAdvance(target);
+
+            while (doc != DocIdSetIterator.NO_MORE_DOCS
+                && nonEssentialMaxScoreSum + matchedDocScoreSum < minCompetitiveScore) {
+              doAdvance(doc + 1);
+            }
+
+            return doc;
+          }
+
+          private void doAdvance(int target) throws IOException {
+            matchedDocScoreSum = 0;
+            // Find next smallest doc id that is larger than or equal to target from the essential
+            // scorers
+
+            // If the next candidate doc id is still within interval boundary,
+            if (lastMinCompetitiveScore == minCompetitiveScore && target <= upTo) {
+              while (essentialsScorers.top().doc < target) {
+                DisiWrapper w = essentialsScorers.pop();
+                w.doc = w.iterator.advance(target);
+                essentialsScorers.add(w);
+              }
+
+              if (essentialsScorers.top().doc <= upTo) {
+                doc = essentialsScorers.top().doc;
+
+                if (doc == NO_MORE_DOCS) {
+                  return;
+                }
+              } else {
+                doc = upTo + 1;
+              }
+            } else {
+              lastMinCompetitiveScore = minCompetitiveScore;
+              // Next candidate doc id is above interval boundary, or minCompetitive has increased.
+              // Find next interval boundary.
+              // Block boundary alignment strategy is adapted from "Optimizing Top-k Document
+              // Retrieval Strategies for Block-Max Indexes" by Dimopoulos, Nepomnyachiy and Suel.
+              // Find the block interval boundary that is the minimum of all participating scorer's
+              // block boundary. Then run BMM within each interval.
+              updateUpToAndMaxScore(target);
+
+              repartitionLists();
+
+              // maxScore of all scorers sum to less than minCompetitiveScore, no more result is
+              // available up to upTo
+              if (essentialsScorers.size() == 0) {
+                // current bound no long valid
+                doc = upTo + 1;
+              } else {
+                doc = essentialsScorers.top().doc;
+
+                if (doc == NO_MORE_DOCS) {
+                  return;
+                }
+              }
+            }
+
+            for (DisiWrapper w : essentialsScorers) {
+              if (w.doc == doc && doc != NO_MORE_DOCS) {
+                matchedDocScoreSum += WANDScorer.scaleMaxScore(w.scorer.score(), scalingFactor);
+              }
+            }
+          }
+
+          private void updateUpToAndMaxScore(int target) throws IOException {
+            while (essentialsScorers.size() > 0) {
+              nonEssentialScorers.add(essentialsScorers.pop());
+            }
+
+            // reset upTo
+            upTo = DocIdSetIterator.NO_MORE_DOCS;
+            for (DisiWrapper w : nonEssentialScorers) {
+              if (w.doc < target) {
+                upTo = Math.min(w.scorer.advanceShallow(target), upTo);
+              } else {
+                upTo = Math.min(w.scorer.advanceShallow(w.doc), upTo);
+              }
+            }
+            assert target <= upTo;
+
+            for (DisiWrapper w : nonEssentialScorers) {
+              if (w.doc <= upTo) {
+                w.maxScore = WANDScorer.scaleMaxScore(w.scorer.getMaxScore(upTo), scalingFactor);
+              } else {
+                // This scorer won't be able to contribute to match for target, setting its maxScore
+                // to 0 so it goes into nonEssentialList
+                w.maxScore = 0;
+              }
+              if (w.doc < target) {
+                w.doc = w.iterator.advance(target);
+              }
+            }
+          }
+
+          private void repartitionLists() {
+            Collections.sort(nonEssentialScorers, (w1, w2) -> (int) (w1.maxScore - w2.maxScore));
+
+            // Re-partition the scorers into non-essential list and essential list, as defined in
+            // the "Optimizing Top-k Document Retrieval Strategies for Block-Max Indexes" paper.
+            nonEssentialMaxScoreSum = 0;
+            for (int i = 0; i < nonEssentialScorers.size(); i++) {
+              DisiWrapper w = nonEssentialScorers.get(i);
+              if (nonEssentialMaxScoreSum + w.maxScore < minCompetitiveScore) {
+                nonEssentialMaxScoreSum += w.maxScore;
+              } else {
+                // the logic is a bit ugly here...but as soon as we find maxScore of scorers in
+                // non-essential list sum to above minCompetitiveScore, we move the rest of
+                // scorers
+                // into essential list
+                for (int j = nonEssentialScorers.size() - 1; j >= i; j--) {
+                  essentialsScorers.add(nonEssentialScorers.remove(j));
+                }
+                break;
+              }
+            }
+          }
+
+          @Override
+          public long cost() {
+            // fixed at initialization
+            return cost;
+          }
+        };
+
+    return new TwoPhaseIterator(approximation) {
+      @Override
+      public boolean matches() throws IOException {
+        // The doc is a match when all scores sum above minCompetitiveScore
+        for (DisiWrapper w : nonEssentialScorers) {
+          if (w.doc < doc) {
+            w.doc = w.iterator.advance(doc);
+          }
+        }
+
+        if (matchedDocScoreSum >= minCompetitiveScore) {
+          return true;
+        }
+
+        for (DisiWrapper w : nonEssentialScorers) {
+          if (w.doc == doc) {
+            matchedDocScoreSum += WANDScorer.scaleMaxScore(w.scorer.score(), scalingFactor);
+
+            if (matchedDocScoreSum >= minCompetitiveScore) {
+              return true;
+            }
+          }
+        }
+
+        return false;
+      }

Review comment:
       Yeah I think I was following the approach mentioned in the paper too strictly here (candidate doc was picked based on sum of maxScore from non-essential list scorers, and actual score from essential list scorers, on the positioned doc). Implementing it this way definitely will save a lot of computations.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] zacharymorn commented on pull request #101: LUCENE-9335: [Discussion Only] Add BMM scorer and use it for pure disjunction term query

Posted by GitBox <gi...@apache.org>.
zacharymorn commented on pull request #101:
URL: https://github.com/apache/lucene/pull/101#issuecomment-839419774


   Thanks Adrien for the suggestion! Let me give it a try.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jpountz commented on pull request #101: LUCENE-9335: [Discussion Only] Add BMM scorer and use it for pure disjunction term query

Posted by GitBox <gi...@apache.org>.
jpountz commented on pull request #101:
URL: https://github.com/apache/lucene/pull/101#issuecomment-836811596


   @mikemccand I'll try to do it overnight as I have a terrible uplink. FWIW the file I have locally is `enwiki-20130102-lines.txt`, not the `enwiki-20100302-pages-articles-lines.txt` file that luceneutil refers to.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jpountz commented on a change in pull request #101: LUCENE-9335: [Discussion Only] Add BMM scorer and use it for pure disjunction term query

Posted by GitBox <gi...@apache.org>.
jpountz commented on a change in pull request #101:
URL: https://github.com/apache/lucene/pull/101#discussion_r627389726



##########
File path: lucene/core/src/java/org/apache/lucene/search/BlockMaxMaxscoreScorer.java
##########
@@ -0,0 +1,339 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.search;
+
+import static org.apache.lucene.search.ScorerUtil.costWithMinShouldMatch;
+
+import java.io.IOException;
+import java.util.*;
+
+/** Scorer implementing Block-Max Maxscore algorithm */
+public class BlockMaxMaxscoreScorer extends Scorer {
+  private final ScoreMode scoreMode;
+  private final int scalingFactor;
+
+  // current doc ID of the leads
+  private int doc;
+
+  // doc id boundary that all scorers maxScore are valid
+  private int upTo = -1;
+
+  // heap of scorers ordered by doc ID
+  private final DisiPriorityQueue essentialsScorers;
+
+  // list of scorers whose sum of maxScore is less than minCompetitiveScore, ordered by maxScore
+  private final List<DisiWrapper> nonEssentialScorers;
+
+  // sum of max scores of scorers in nonEssentialScorers list
+  private long nonEssentialMaxScoreSum;
+
+  // sum of score of scorers in essentialScorers list that are positioned on matching doc
+  private long matchedDocScoreSum;
+
+  private long cost;
+
+  private final MaxScoreSumPropagator maxScoreSumPropagator;
+
+  private final List<Scorer> scorers;
+
+  // scaled min competitive score
+  private long minCompetitiveScore = 0;
+
+  /**
+   * Constructs a Scorer
+   *
+   * @param weight The weight to be used.
+   * @param scorers The sub scorers this Scorer should iterate on for optional clauses
+   * @param scoreMode The scoreMode
+   */
+  public BlockMaxMaxscoreScorer(Weight weight, List<Scorer> scorers, ScoreMode scoreMode)
+      throws IOException {
+    super(weight);
+    assert scoreMode == ScoreMode.TOP_SCORES;
+
+    this.scoreMode = scoreMode;
+    this.doc = -1;
+    this.scorers = scorers;
+    this.cost =
+        costWithMinShouldMatch(
+            scorers.stream().map(Scorer::iterator).mapToLong(DocIdSetIterator::cost),
+            scorers.size(),
+            1);
+
+    essentialsScorers = new DisiPriorityQueue(scorers.size());
+    nonEssentialScorers = new LinkedList<>();
+
+    scalingFactor = WANDScorer.getScalingFactor(scorers);
+    maxScoreSumPropagator = new MaxScoreSumPropagator(scorers);
+
+    for (Scorer scorer : scorers) {
+      nonEssentialScorers.add(new DisiWrapper(scorer));
+    }
+  }
+
+  @Override
+  public DocIdSetIterator iterator() {
+    return TwoPhaseIterator.asDocIdSetIterator(twoPhaseIterator());
+  }
+
+  @Override
+  public TwoPhaseIterator twoPhaseIterator() {
+    DocIdSetIterator approximation =
+        new DocIdSetIterator() {
+          private long lastMinCompetitiveScore;
+
+          @Override
+          public int docID() {
+            return doc;
+          }
+
+          @Override
+          public int nextDoc() throws IOException {
+            return advance(doc + 1);
+          }
+
+          @Override
+          public int advance(int target) throws IOException {
+            doAdvance(target);
+
+            while (doc != DocIdSetIterator.NO_MORE_DOCS
+                && nonEssentialMaxScoreSum + matchedDocScoreSum < minCompetitiveScore) {
+              doAdvance(doc + 1);
+            }
+
+            return doc;
+          }
+
+          private void doAdvance(int target) throws IOException {
+            matchedDocScoreSum = 0;
+            // Find next smallest doc id that is larger than or equal to target from the essential
+            // scorers
+
+            // If the next candidate doc id is still within interval boundary,
+            if (lastMinCompetitiveScore == minCompetitiveScore && target <= upTo) {
+              while (essentialsScorers.top().doc < target) {
+                DisiWrapper w = essentialsScorers.pop();
+                w.doc = w.iterator.advance(target);
+                essentialsScorers.add(w);
+              }
+
+              if (essentialsScorers.top().doc <= upTo) {
+                doc = essentialsScorers.top().doc;
+
+                if (doc == NO_MORE_DOCS) {
+                  return;
+                }
+              } else {
+                doc = upTo + 1;
+              }
+            } else {
+              lastMinCompetitiveScore = minCompetitiveScore;
+              // Next candidate doc id is above interval boundary, or minCompetitive has increased.
+              // Find next interval boundary.
+              // Block boundary alignment strategy is adapted from "Optimizing Top-k Document
+              // Retrieval Strategies for Block-Max Indexes" by Dimopoulos, Nepomnyachiy and Suel.
+              // Find the block interval boundary that is the minimum of all participating scorer's
+              // block boundary. Then run BMM within each interval.
+              updateUpToAndMaxScore(target);
+
+              repartitionLists();
+
+              // maxScore of all scorers sum to less than minCompetitiveScore, no more result is
+              // available up to upTo
+              if (essentialsScorers.size() == 0) {
+                // current bound no long valid
+                doc = upTo + 1;
+              } else {
+                doc = essentialsScorers.top().doc;
+
+                if (doc == NO_MORE_DOCS) {
+                  return;
+                }
+              }
+            }
+
+            for (DisiWrapper w : essentialsScorers) {
+              if (w.doc == doc && doc != NO_MORE_DOCS) {
+                matchedDocScoreSum += WANDScorer.scaleMaxScore(w.scorer.score(), scalingFactor);
+              }
+            }
+          }
+
+          private void updateUpToAndMaxScore(int target) throws IOException {
+            while (essentialsScorers.size() > 0) {
+              nonEssentialScorers.add(essentialsScorers.pop());
+            }
+
+            // reset upTo
+            upTo = DocIdSetIterator.NO_MORE_DOCS;
+            for (DisiWrapper w : nonEssentialScorers) {
+              if (w.doc < target) {
+                upTo = Math.min(w.scorer.advanceShallow(target), upTo);
+              } else {
+                upTo = Math.min(w.scorer.advanceShallow(w.doc), upTo);
+              }
+            }
+            assert target <= upTo;
+
+            for (DisiWrapper w : nonEssentialScorers) {
+              if (w.doc <= upTo) {
+                w.maxScore = WANDScorer.scaleMaxScore(w.scorer.getMaxScore(upTo), scalingFactor);
+              } else {
+                // This scorer won't be able to contribute to match for target, setting its maxScore
+                // to 0 so it goes into nonEssentialList
+                w.maxScore = 0;
+              }
+              if (w.doc < target) {
+                w.doc = w.iterator.advance(target);
+              }
+            }
+          }
+
+          private void repartitionLists() {
+            Collections.sort(nonEssentialScorers, (w1, w2) -> (int) (w1.maxScore - w2.maxScore));
+
+            // Re-partition the scorers into non-essential list and essential list, as defined in
+            // the "Optimizing Top-k Document Retrieval Strategies for Block-Max Indexes" paper.
+            nonEssentialMaxScoreSum = 0;
+            for (int i = 0; i < nonEssentialScorers.size(); i++) {
+              DisiWrapper w = nonEssentialScorers.get(i);
+              if (nonEssentialMaxScoreSum + w.maxScore < minCompetitiveScore) {
+                nonEssentialMaxScoreSum += w.maxScore;
+              } else {
+                // the logic is a bit ugly here...but as soon as we find maxScore of scorers in
+                // non-essential list sum to above minCompetitiveScore, we move the rest of
+                // scorers
+                // into essential list
+                for (int j = nonEssentialScorers.size() - 1; j >= i; j--) {
+                  essentialsScorers.add(nonEssentialScorers.remove(j));
+                }
+                break;
+              }
+            }
+          }
+
+          @Override
+          public long cost() {
+            // fixed at initialization
+            return cost;
+          }
+        };
+
+    return new TwoPhaseIterator(approximation) {
+      @Override
+      public boolean matches() throws IOException {
+        // The doc is a match when all scores sum above minCompetitiveScore
+        for (DisiWrapper w : nonEssentialScorers) {
+          if (w.doc < doc) {
+            w.doc = w.iterator.advance(doc);
+          }
+        }
+
+        if (matchedDocScoreSum >= minCompetitiveScore) {
+          return true;
+        }
+
+        for (DisiWrapper w : nonEssentialScorers) {
+          if (w.doc == doc) {
+            matchedDocScoreSum += WANDScorer.scaleMaxScore(w.scorer.score(), scalingFactor);
+
+            if (matchedDocScoreSum >= minCompetitiveScore) {
+              return true;
+            }
+          }
+        }
+
+        return false;
+      }

Review comment:
       I don't think that implementing a TwoPhaseIterator this way helps, since we're computing scores once here to check whether the hit is competitive, and then again in `score()` if the hit is competitive. Let's return the approximation directly as `iterator()` and hits that don't have a competitive scores will naturally be ignored after `score()` is computed since their scores will be lower than the score of the top of the priority queue?

##########
File path: lucene/core/src/java/org/apache/lucene/search/BlockMaxMaxscoreScorer.java
##########
@@ -0,0 +1,339 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.search;
+
+import static org.apache.lucene.search.ScorerUtil.costWithMinShouldMatch;
+
+import java.io.IOException;
+import java.util.*;
+
+/** Scorer implementing Block-Max Maxscore algorithm */
+public class BlockMaxMaxscoreScorer extends Scorer {
+  private final ScoreMode scoreMode;
+  private final int scalingFactor;
+
+  // current doc ID of the leads
+  private int doc;
+
+  // doc id boundary that all scorers maxScore are valid
+  private int upTo = -1;
+
+  // heap of scorers ordered by doc ID
+  private final DisiPriorityQueue essentialsScorers;
+
+  // list of scorers whose sum of maxScore is less than minCompetitiveScore, ordered by maxScore
+  private final List<DisiWrapper> nonEssentialScorers;
+
+  // sum of max scores of scorers in nonEssentialScorers list
+  private long nonEssentialMaxScoreSum;
+
+  // sum of score of scorers in essentialScorers list that are positioned on matching doc
+  private long matchedDocScoreSum;
+
+  private long cost;
+
+  private final MaxScoreSumPropagator maxScoreSumPropagator;
+
+  private final List<Scorer> scorers;
+
+  // scaled min competitive score
+  private long minCompetitiveScore = 0;
+
+  /**
+   * Constructs a Scorer
+   *
+   * @param weight The weight to be used.
+   * @param scorers The sub scorers this Scorer should iterate on for optional clauses
+   * @param scoreMode The scoreMode
+   */
+  public BlockMaxMaxscoreScorer(Weight weight, List<Scorer> scorers, ScoreMode scoreMode)
+      throws IOException {
+    super(weight);
+    assert scoreMode == ScoreMode.TOP_SCORES;
+
+    this.scoreMode = scoreMode;
+    this.doc = -1;
+    this.scorers = scorers;
+    this.cost =
+        costWithMinShouldMatch(
+            scorers.stream().map(Scorer::iterator).mapToLong(DocIdSetIterator::cost),
+            scorers.size(),
+            1);
+
+    essentialsScorers = new DisiPriorityQueue(scorers.size());
+    nonEssentialScorers = new LinkedList<>();
+
+    scalingFactor = WANDScorer.getScalingFactor(scorers);
+    maxScoreSumPropagator = new MaxScoreSumPropagator(scorers);
+
+    for (Scorer scorer : scorers) {
+      nonEssentialScorers.add(new DisiWrapper(scorer));
+    }
+  }
+
+  @Override
+  public DocIdSetIterator iterator() {
+    return TwoPhaseIterator.asDocIdSetIterator(twoPhaseIterator());
+  }
+
+  @Override
+  public TwoPhaseIterator twoPhaseIterator() {
+    DocIdSetIterator approximation =
+        new DocIdSetIterator() {
+          private long lastMinCompetitiveScore;
+
+          @Override
+          public int docID() {
+            return doc;
+          }
+
+          @Override
+          public int nextDoc() throws IOException {
+            return advance(doc + 1);
+          }
+
+          @Override
+          public int advance(int target) throws IOException {
+            doAdvance(target);
+
+            while (doc != DocIdSetIterator.NO_MORE_DOCS
+                && nonEssentialMaxScoreSum + matchedDocScoreSum < minCompetitiveScore) {
+              doAdvance(doc + 1);
+            }
+
+            return doc;
+          }
+
+          private void doAdvance(int target) throws IOException {
+            matchedDocScoreSum = 0;
+            // Find next smallest doc id that is larger than or equal to target from the essential
+            // scorers
+
+            // If the next candidate doc id is still within interval boundary,
+            if (lastMinCompetitiveScore == minCompetitiveScore && target <= upTo) {
+              while (essentialsScorers.top().doc < target) {
+                DisiWrapper w = essentialsScorers.pop();
+                w.doc = w.iterator.advance(target);
+                essentialsScorers.add(w);
+              }
+
+              if (essentialsScorers.top().doc <= upTo) {
+                doc = essentialsScorers.top().doc;
+
+                if (doc == NO_MORE_DOCS) {
+                  return;
+                }
+              } else {
+                doc = upTo + 1;
+              }
+            } else {
+              lastMinCompetitiveScore = minCompetitiveScore;
+              // Next candidate doc id is above interval boundary, or minCompetitive has increased.
+              // Find next interval boundary.
+              // Block boundary alignment strategy is adapted from "Optimizing Top-k Document
+              // Retrieval Strategies for Block-Max Indexes" by Dimopoulos, Nepomnyachiy and Suel.
+              // Find the block interval boundary that is the minimum of all participating scorer's
+              // block boundary. Then run BMM within each interval.
+              updateUpToAndMaxScore(target);
+
+              repartitionLists();
+
+              // maxScore of all scorers sum to less than minCompetitiveScore, no more result is
+              // available up to upTo
+              if (essentialsScorers.size() == 0) {
+                // current bound no long valid
+                doc = upTo + 1;
+              } else {
+                doc = essentialsScorers.top().doc;
+
+                if (doc == NO_MORE_DOCS) {
+                  return;
+                }
+              }
+            }
+
+            for (DisiWrapper w : essentialsScorers) {
+              if (w.doc == doc && doc != NO_MORE_DOCS) {
+                matchedDocScoreSum += WANDScorer.scaleMaxScore(w.scorer.score(), scalingFactor);
+              }
+            }
+          }
+
+          private void updateUpToAndMaxScore(int target) throws IOException {
+            while (essentialsScorers.size() > 0) {
+              nonEssentialScorers.add(essentialsScorers.pop());
+            }
+
+            // reset upTo
+            upTo = DocIdSetIterator.NO_MORE_DOCS;
+            for (DisiWrapper w : nonEssentialScorers) {
+              if (w.doc < target) {
+                upTo = Math.min(w.scorer.advanceShallow(target), upTo);
+              } else {
+                upTo = Math.min(w.scorer.advanceShallow(w.doc), upTo);
+              }
+            }
+            assert target <= upTo;
+
+            for (DisiWrapper w : nonEssentialScorers) {
+              if (w.doc <= upTo) {
+                w.maxScore = WANDScorer.scaleMaxScore(w.scorer.getMaxScore(upTo), scalingFactor);
+              } else {
+                // This scorer won't be able to contribute to match for target, setting its maxScore
+                // to 0 so it goes into nonEssentialList
+                w.maxScore = 0;
+              }
+              if (w.doc < target) {
+                w.doc = w.iterator.advance(target);
+              }

Review comment:
       I wouldn't expect that we would have to advance scorers when updating `upTo`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] zacharymorn commented on pull request #101: LUCENE-9335: [Discussion Only] Add BMM scorer and use it for pure disjunction term query

Posted by GitBox <gi...@apache.org>.
zacharymorn commented on pull request #101:
URL: https://github.com/apache/lucene/pull/101#issuecomment-840934854


   Hi @jpountz, upon further reviewing the result above, I found one issue for this particular scorer implementation (but not for the other BulkScorer based implementations). Currently this BMM scorer would also be used for nested boolean queries such as  `+body:interview +(body:at body:united)` (AndMedOrHighHigh), but in the jira ticket you had suggested to use BMM for top-level (flat?) boolean query only. Do you think this will need to be fixed? I took a quick look at the code and it seems the query and weight interfaces do not provide an API to tell whether they are top-level or not currently, so this may involve some changes in `BooleanWeight` code I think.
   
   On the other hand, from the results above it looks like 2 out of 3 instances, having BMM scorer to run for nested queries doesn’t hurt its performance (the `AndMedOrHighHigh ` one). So I’m wondering maybe it’s ok to leave it as is, as BMM may still give a boost for the pure disjunction part of the query? The one result that does show negative impact to `AndMedOrHighHigh` also shows impact to  `OrHighMed`, so it’s a bit strange and may need further looking into to see the cause.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] zacharymorn commented on pull request #101: LUCENE-9335: [Discussion Only] Add BMM scorer and use it for pure disjunction term query

Posted by GitBox <gi...@apache.org>.
zacharymorn commented on pull request #101:
URL: https://github.com/apache/lucene/pull/101#issuecomment-841606431


   > > in the jira ticket you had suggested to use BMM for top-level (flat?) boolean query only. Do you think this will need to be fixed?
   > 
   > I opened this JIRA ticket because it felt like we could do better for top-level disjunctions, but if BMM appears to work better most of the time, we could just move to it all the time.
   > 
   > > The one result that does show negative impact to AndMedOrHighHigh also shows impact to OrHighMed, so it’s a bit strange and may need further looking into to see the cause.
   > 
   > Yeah, I suspect there will always be cases when BMW will perform better than BMM or vice-versa, sometimes for subtle reasons.
   
   Makes sense! I'll not attempt to fix it for now then.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org