You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2021/04/21 07:21:08 UTC
[GitHub] [lucene] zacharymorn opened a new pull request #101: LUCENE-9335: [Discussion Only] Add BMM scorer and use it for pure disjunction term query
zacharymorn opened a new pull request #101:
URL: https://github.com/apache/lucene/pull/101
Implement BMM algorithm from "Optimizing Top-k Document Retrieval Strategies for Block-Max Indexes" by Dimopoulos, Nepomnyachiy and Suel.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] jpountz commented on pull request #101: LUCENE-9335: [Discussion Only] Add BMM scorer and use it for pure disjunction term query
Posted by GitBox <gi...@apache.org>.
jpountz commented on pull request #101:
URL: https://github.com/apache/lucene/pull/101#issuecomment-841124271
> in the jira ticket you had suggested to use BMM for top-level (flat?) boolean query only. Do you think this will need to be fixed?
I opened this JIRA ticket because it felt like we could do better for top-level disjunctions, but if BMM appears to work better most of the time, we could just move to it all the time.
> The one result that does show negative impact to AndMedOrHighHigh also shows impact to OrHighMed, so it’s a bit strange and may need further looking into to see the cause.
Yeah, I suspect there will always be cases when BMW will perform better than BMM or vice-versa, sometimes for subtle reasons.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] jpountz commented on pull request #101: LUCENE-9335: [Discussion Only] Add BMM scorer and use it for pure disjunction term query
Posted by GitBox <gi...@apache.org>.
jpountz commented on pull request #101:
URL: https://github.com/apache/lucene/pull/101#issuecomment-836880288
Wonderful, thanks @mikemccand !
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] jpountz commented on a change in pull request #101: LUCENE-9335: [Discussion Only] Add BMM scorer and use it for pure disjunction term query
Posted by GitBox <gi...@apache.org>.
jpountz commented on a change in pull request #101:
URL: https://github.com/apache/lucene/pull/101#discussion_r625979068
##########
File path: lucene/core/src/java/org/apache/lucene/search/BlockMaxMaxscoreScorer.java
##########
@@ -0,0 +1,339 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.search;
+
+import static org.apache.lucene.search.ScorerUtil.costWithMinShouldMatch;
+
+import java.io.IOException;
+import java.util.*;
+
+/** Scorer implementing Block-Max Maxscore algorithm */
+public class BlockMaxMaxscoreScorer extends Scorer {
+ private final ScoreMode scoreMode;
+ private final int scalingFactor;
+
+ // current doc ID of the leads
+ private int doc;
+
+ // doc id boundary that all scorers maxScore are valid
+ private int upTo = -1;
+
+ // heap of scorers ordered by doc ID
+ private final DisiPriorityQueue essentialsScorers;
+
+ // list of scorers whose sum of maxScore is less than minCompetitiveScore, ordered by maxScore
+ private final List<DisiWrapper> nonEssentialScorers;
+
+ // sum of max scores of scorers in nonEssentialScorers list
+ private long nonEssentialMaxScoreSum;
+
+ // sum of score of scorers in essentialScorers list that are positioned on matching doc
+ private long matchedDocScoreSum;
+
+ private long cost;
+
+ private final MaxScoreSumPropagator maxScoreSumPropagator;
+
+ private final List<Scorer> scorers;
+
+ // scaled min competitive score
+ private long minCompetitiveScore = 0;
+
+ /**
+ * Constructs a Scorer
+ *
+ * @param weight The weight to be used.
+ * @param scorers The sub scorers this Scorer should iterate on for optional clauses
+ * @param scoreMode The scoreMode
+ */
+ public BlockMaxMaxscoreScorer(Weight weight, List<Scorer> scorers, ScoreMode scoreMode)
+ throws IOException {
+ super(weight);
+ assert scoreMode == ScoreMode.TOP_SCORES;
+
+ this.scoreMode = scoreMode;
+ this.doc = -1;
+ this.scorers = scorers;
+ this.cost =
+ costWithMinShouldMatch(
+ scorers.stream().map(Scorer::iterator).mapToLong(DocIdSetIterator::cost),
+ scorers.size(),
+ 1);
+
+ essentialsScorers = new DisiPriorityQueue(scorers.size());
+ nonEssentialScorers = new LinkedList<>();
+
+ scalingFactor = WANDScorer.getScalingFactor(scorers);
+ maxScoreSumPropagator = new MaxScoreSumPropagator(scorers);
+
+ for (Scorer scorer : scorers) {
+ nonEssentialScorers.add(new DisiWrapper(scorer));
+ }
+ }
+
+ @Override
+ public DocIdSetIterator iterator() {
+ return TwoPhaseIterator.asDocIdSetIterator(twoPhaseIterator());
+ }
+
+ @Override
+ public TwoPhaseIterator twoPhaseIterator() {
+ DocIdSetIterator approximation =
+ new DocIdSetIterator() {
+ private long lastMinCompetitiveScore;
+
+ @Override
+ public int docID() {
+ return doc;
+ }
+
+ @Override
+ public int nextDoc() throws IOException {
+ return advance(doc + 1);
+ }
+
+ @Override
+ public int advance(int target) throws IOException {
+ doAdvance(target);
+
+ while (doc != DocIdSetIterator.NO_MORE_DOCS
+ && nonEssentialMaxScoreSum + matchedDocScoreSum < minCompetitiveScore) {
+ doAdvance(doc + 1);
+ }
+
+ return doc;
+ }
+
+ private void doAdvance(int target) throws IOException {
+ matchedDocScoreSum = 0;
+ // Find next smallest doc id that is larger than or equal to target from the essential
+ // scorers
+
+ // If the next candidate doc id is still within interval boundary,
+ if (lastMinCompetitiveScore == minCompetitiveScore && target <= upTo) {
+ while (essentialsScorers.top().doc < target) {
+ DisiWrapper w = essentialsScorers.pop();
+ w.doc = w.iterator.advance(target);
+ essentialsScorers.add(w);
Review comment:
can you use updateTop instead? It's usually faster than pop+add
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] zacharymorn commented on pull request #101: LUCENE-9335: [Discussion Only] Add BMM scorer and use it for pure disjunction term query
Posted by GitBox <gi...@apache.org>.
zacharymorn commented on pull request #101:
URL: https://github.com/apache/lucene/pull/101#issuecomment-837880326
> Thanks @mikemccand and @jpountz for the uploads!
>
> > The nightly benchmarks uses the binary form of wikibigall, to reduce thread bottleneck when reading/parsing documents to index. Hmm it is sampled from a different date (01/15/2011) ... OK I am uploading that one to https://home.apache.org/~mikemccand/enwiki-20110115-lines.bin (ETA ~20 minutes more).
>
> I'm assuming we should now run `nightlyBench.py` instead of `wikibigall`, since the latter depends on `enwiki-20100302-pages-articles-lines.txt` but it's no longer available?
Actually, I'm still not quite sure how to run it with `enwiki-20110115-lines.bin`, since it's not referred to in luceneutil code I think? I'll dig around in nightly benchmark job tomorrow to find out.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] jpountz commented on pull request #101: LUCENE-9335: [Discussion Only] Add BMM scorer and use it for pure disjunction term query
Posted by GitBox <gi...@apache.org>.
jpountz commented on pull request #101:
URL: https://github.com/apache/lucene/pull/101#issuecomment-837909869
Actually you don't need nightlyBench.py, you can use the standard python script. I think the following should work to test out on larger documents:
- Download https://home.apache.org/~mikemccand/enwiki-20130102-lines.txt.lzma and put it under the same directory as the other data files like `enwiki-20120502-lines-1k.txt`.
- Decompress it with `unlzma enwiki-20130102-lines.txt.lzma`.
- Add the following to your `localconstants.py`:
```
WIKI_BIG_DOCS_LINE_FILE = '%s/data/enwiki-20130102-lines.txt' % BASE_DIR
WIKI_BIG_DOCS_COUNT = 6647577
WIKI_BIG_TASKS_FILE = '%s/tasks/wikinightly.tasks' % BENCH_BASE_DIR
```
- Run the benchmark with `-source wikibig10k` to make sure everything works.
- Finally run the benchmark with `-source wikibigall`.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] zacharymorn commented on pull request #101: LUCENE-9335: [Discussion Only] Add BMM scorer and use it for pure disjunction term query
Posted by GitBox <gi...@apache.org>.
zacharymorn commented on pull request #101:
URL: https://github.com/apache/lucene/pull/101#issuecomment-829928067
2 luceneutil runs with wikimedium5m results:
```
TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value
OrHighMed 196.58 (3.5%) 134.01 (2.5%) -31.8% ( -36% - -26%) 0.000
OrHighHigh 43.32 (2.0%) 31.45 (2.0%) -27.4% ( -30% - -23%) 0.000
Fuzzy1 94.71 (14.2%) 89.31 (16.9%) -5.7% ( -32% - 29%) 0.249
OrHighLow 549.71 (4.7%) 521.55 (5.8%) -5.1% ( -14% - 5%) 0.002
HighTermDayOfYearSort 284.66 (11.4%) 274.13 (11.4%) -3.7% ( -23% - 21%) 0.304
HighTerm 1392.83 (4.6%) 1352.76 (4.7%) -2.9% ( -11% - 6%) 0.050
HighPhrase 199.03 (2.8%) 193.93 (3.0%) -2.6% ( -8% - 3%) 0.006
TermDTSort 200.40 (10.9%) 196.04 (13.2%) -2.2% ( -23% - 24%) 0.570
OrNotHighLow 1137.55 (3.8%) 1114.38 (4.8%) -2.0% ( -10% - 6%) 0.138
OrNotHighMed 874.47 (5.8%) 857.02 (5.3%) -2.0% ( -12% - 9%) 0.259
HighTermMonthSort 285.86 (16.9%) 280.37 (15.4%) -1.9% ( -29% - 36%) 0.707
OrHighNotLow 797.17 (5.2%) 782.04 (5.5%) -1.9% ( -11% - 9%) 0.263
AndHighMed 259.03 (3.2%) 254.45 (4.8%) -1.8% ( -9% - 6%) 0.169
LowTerm 1619.44 (3.4%) 1592.71 (4.4%) -1.7% ( -9% - 6%) 0.184
MedTerm 1448.09 (3.7%) 1432.36 (4.6%) -1.1% ( -9% - 7%) 0.413
LowPhrase 530.26 (3.0%) 524.77 (3.8%) -1.0% ( -7% - 6%) 0.343
Wildcard 237.52 (1.7%) 235.18 (2.1%) -1.0% ( -4% - 2%) 0.104
PKLookup 226.73 (2.8%) 224.67 (2.5%) -0.9% ( -6% - 4%) 0.277
AndHighHigh 167.49 (3.4%) 166.40 (2.7%) -0.6% ( -6% - 5%) 0.505
MedSloppyPhrase 52.92 (1.9%) 52.60 (2.2%) -0.6% ( -4% - 3%) 0.339
MedSpanNear 128.06 (2.0%) 127.34 (2.3%) -0.6% ( -4% - 3%) 0.405
AndHighLow 1016.64 (3.9%) 1011.41 (4.2%) -0.5% ( -8% - 7%) 0.689
Prefix3 333.77 (3.9%) 332.40 (4.0%) -0.4% ( -8% - 7%) 0.742
LowSpanNear 45.55 (2.6%) 45.37 (2.6%) -0.4% ( -5% - 4%) 0.636
HighTermTitleBDVSort 152.08 (15.6%) 151.52 (14.8%) -0.4% ( -26% - 35%) 0.938
HighIntervalsOrdered 23.28 (1.9%) 23.20 (2.4%) -0.4% ( -4% - 3%) 0.597
HighSloppyPhrase 276.19 (2.4%) 275.46 (2.8%) -0.3% ( -5% - 5%) 0.749
BrowseMonthTaxoFacets 13.43 (2.7%) 13.40 (3.0%) -0.2% ( -5% - 5%) 0.827
OrNotHighHigh 730.45 (6.8%) 729.04 (5.1%) -0.2% ( -11% - 12%) 0.919
IntNRQ 137.68 (16.3%) 137.51 (16.6%) -0.1% ( -28% - 39%) 0.981
OrHighNotHigh 629.90 (4.9%) 630.02 (5.0%) 0.0% ( -9% - 10%) 0.990
LowSloppyPhrase 20.78 (1.2%) 20.79 (1.7%) 0.1% ( -2% - 3%) 0.895
Respell 117.77 (2.5%) 117.85 (2.4%) 0.1% ( -4% - 5%) 0.930
BrowseDayOfYearSSDVFacets 28.73 (2.6%) 28.75 (2.6%) 0.1% ( -4% - 5%) 0.930
MedPhrase 51.36 (3.1%) 51.41 (2.1%) 0.1% ( -4% - 5%) 0.902
HighSpanNear 3.58 (2.2%) 3.58 (2.8%) 0.2% ( -4% - 5%) 0.835
Fuzzy2 20.57 (4.2%) 20.66 (3.5%) 0.5% ( -6% - 8%) 0.710
OrHighNotMed 753.94 (6.4%) 758.23 (5.1%) 0.6% ( -10% - 12%) 0.756
BrowseDateTaxoFacets 11.29 (3.1%) 11.35 (3.0%) 0.6% ( -5% - 6%) 0.547
BrowseMonthSSDVFacets 31.97 (6.9%) 32.21 (5.4%) 0.8% ( -10% - 13%) 0.698
BrowseDayOfYearTaxoFacets 11.30 (3.0%) 11.38 (3.1%) 0.8% ( -5% - 7%) 0.427
```
```
TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value
OrHighHigh 77.68 (3.2%) 36.44 (3.3%) -53.1% ( -57% - -48%) 0.000
OrHighMed 265.36 (4.9%) 148.05 (2.9%) -44.2% ( -49% - -38%) 0.000
OrHighLow 563.62 (6.2%) 392.94 (3.6%) -30.3% ( -37% - -21%) 0.000
OrHighNotHigh 647.72 (6.3%) 625.42 (7.9%) -3.4% ( -16% - 11%) 0.126
IntNRQ 422.98 (3.5%) 410.61 (6.9%) -2.9% ( -12% - 7%) 0.091
OrHighNotLow 840.72 (7.0%) 818.98 (4.4%) -2.6% ( -13% - 9%) 0.162
HighTerm 1464.33 (4.7%) 1428.42 (5.3%) -2.5% ( -11% - 7%) 0.119
MedPhrase 486.70 (4.1%) 478.83 (3.3%) -1.6% ( -8% - 6%) 0.171
OrHighNotMed 738.23 (5.9%) 726.30 (7.4%) -1.6% ( -14% - 12%) 0.444
OrNotHighHigh 713.95 (4.5%) 703.94 (5.2%) -1.4% ( -10% - 8%) 0.358
BrowseDateTaxoFacets 10.78 (3.6%) 10.63 (4.1%) -1.4% ( -8% - 6%) 0.260
MedTerm 1709.93 (4.8%) 1686.84 (5.8%) -1.4% ( -11% - 9%) 0.422
LowTerm 1836.19 (5.4%) 1811.93 (6.6%) -1.3% ( -12% - 11%) 0.491
BrowseDayOfYearTaxoFacets 10.80 (3.8%) 10.66 (4.0%) -1.3% ( -8% - 6%) 0.309
HighPhrase 40.45 (2.7%) 39.99 (2.5%) -1.1% ( -6% - 4%) 0.160
BrowseMonthTaxoFacets 13.11 (2.6%) 12.98 (3.2%) -1.0% ( -6% - 4%) 0.254
HighSloppyPhrase 32.04 (5.6%) 31.70 (4.9%) -1.0% ( -10% - 10%) 0.535
TermDTSort 431.47 (12.5%) 427.83 (13.2%) -0.8% ( -23% - 28%) 0.836
Wildcard 289.98 (7.5%) 287.96 (5.3%) -0.7% ( -12% - 13%) 0.735
PKLookup 217.64 (3.8%) 216.31 (3.2%) -0.6% ( -7% - 6%) 0.580
HighIntervalsOrdered 4.21 (2.6%) 4.20 (3.0%) -0.4% ( -5% - 5%) 0.638
BrowseDayOfYearSSDVFacets 27.75 (2.6%) 27.64 (2.3%) -0.4% ( -5% - 4%) 0.589
AndHighMed 227.13 (2.3%) 226.33 (3.9%) -0.4% ( -6% - 5%) 0.729
AndHighHigh 110.85 (2.4%) 110.49 (3.1%) -0.3% ( -5% - 5%) 0.709
Prefix3 235.98 (3.2%) 235.41 (2.2%) -0.2% ( -5% - 5%) 0.787
LowSpanNear 45.89 (1.7%) 45.78 (1.6%) -0.2% ( -3% - 3%) 0.663
LowSloppyPhrase 24.25 (3.7%) 24.21 (3.5%) -0.2% ( -7% - 7%) 0.867
HighSpanNear 50.67 (3.8%) 50.59 (4.0%) -0.2% ( -7% - 7%) 0.902
LowPhrase 576.08 (4.7%) 575.26 (5.2%) -0.1% ( -9% - 10%) 0.928
OrNotHighMed 692.19 (6.3%) 691.63 (5.1%) -0.1% ( -10% - 12%) 0.965
Fuzzy2 30.66 (4.3%) 30.64 (2.9%) -0.1% ( -7% - 7%) 0.949
MedSpanNear 23.95 (2.0%) 23.95 (2.2%) -0.0% ( -4% - 4%) 0.992
MedSloppyPhrase 36.54 (5.7%) 36.57 (4.2%) 0.1% ( -9% - 10%) 0.962
BrowseMonthSSDVFacets 31.13 (5.1%) 31.20 (4.8%) 0.2% ( -9% - 10%) 0.888
HighTermTitleBDVSort 89.16 (8.9%) 89.37 (8.7%) 0.2% ( -15% - 19%) 0.934
Respell 83.91 (3.1%) 84.11 (2.7%) 0.2% ( -5% - 6%) 0.798
OrNotHighLow 1073.98 (6.5%) 1076.76 (6.1%) 0.3% ( -11% - 13%) 0.897
AndHighLow 905.72 (4.4%) 908.07 (5.2%) 0.3% ( -8% - 10%) 0.864
Fuzzy1 81.24 (12.0%) 81.73 (9.6%) 0.6% ( -18% - 25%) 0.861
HighTermDayOfYearSort 292.09 (15.6%) 294.83 (14.1%) 0.9% ( -24% - 36%) 0.842
HighTermMonthSort 71.72 (11.6%) 74.49 (17.1%) 3.9% ( -22% - 36%) 0.403
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] zacharymorn commented on pull request #101: LUCENE-9335: [Discussion Only] Add BMM scorer and use it for pure disjunction term query
Posted by GitBox <gi...@apache.org>.
zacharymorn commented on pull request #101:
URL: https://github.com/apache/lucene/pull/101#issuecomment-835707134
I cherry-picked your commit and pushed to this branch / PR to further explore the changes and their effect, hope that's ok.
> (2) feels natural to avoid doing useless score computations, though it might only work well when score upper bounds are very close to the actual scores. Maybe we should test on wikibig instead of wikimedium to get better confidence that this change makes things better.
> Regarding (3) does it actually push more scorers into nonEssentialScorers? I thought I just reorganized the existing logic a bit. If it pushes more scorers into nonEssentialScorers it's probably a bug. :)
I did some explorations to see how these two affect QPS by removing them here https://github.com/apache/lucene/pull/101/commits/2835055979fa6a739972d2abf30888e855b7683c, and restore the optimizations in later commits (the luceneutil results from OrMedMedMedMedMed for these changes are added into git commit messages). From a few runs it seems that these two accounted for about 15% improvement (from -7% to +7%), although for (3) after some more thought I think the different implementations should actually be the same, since the scorers were already sorted by maxScore before they were partitioned. The changes to iterator to avoid unneeded score computation should probably account for the rest of the improvement.
I also tried to run `wikibigall` as well, which seems to require `enwiki-20100302-pages-articles-lines.txt` but it's not downloaded by the util. It appears the archive should be coming from http://home.apache.org/~mikemccand/enwiki-20100302-pages-articles-lines.txt.bz2, but it's giving 404 now. Is there a new place for this archive to live now that I can download from?
> Maybe we can also try to port similar changes to the bulk scorer to see if it yields even greater benefits?
Yes I would love to do that! Will work on that next and try out more different queries.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] zacharymorn commented on pull request #101: LUCENE-9335: [Discussion Only] Add BMM scorer and use it for pure disjunction term query
Posted by GitBox <gi...@apache.org>.
zacharymorn commented on pull request #101:
URL: https://github.com/apache/lucene/pull/101#issuecomment-834114823
> I played with your branch in order to try to speed things up a bit and I'm getting a bit closer to our current WAND implementation: https://github.com/jpountz/lucene/tree/LUCENE-9335-BMMScorer. Hopefully it's not too buggy. On the latest benchmark I ran it was 15% faster than WANScorer on your OrMedMedMedMedMed task.
Oh wow thanks Adrien for trying it out! It's quite a speed up from -20 ~ 40% to +15%!! I also run it with the full benchmark queries and see it's making + 20% - 90% speed up for `OrHighHigh`, with the cost of -10 ~ 20% slow down for `OrHighLow` and `OrHighMed`, but it seems like a good net speedup given the latter twos already have much faster QPS compared to `OrHighHigh`?
I took a quick look at your implementation and will study it more in the next few days, but right now I can see a few optimizations you applied and suggested earlier that can explain the speed up:
1. The use of faster methods on the priority queue `updateTop`, `topList` etc
2. The use of maxScore instead of `score()` to pick competitive doc from the simplified iterator
3. Pushing more scorers into `nonEssentialScorers` in `repartitionLists`
The last two are optimization techniques not mentioned in the paper I think?
By the way, please feel free to push directly into this PR branch as well, as I think your changes are in pretty good shape?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] mikemccand commented on pull request #101: LUCENE-9335: [Discussion Only] Add BMM scorer and use it for pure disjunction term query
Posted by GitBox <gi...@apache.org>.
mikemccand commented on pull request #101:
URL: https://github.com/apache/lucene/pull/101#issuecomment-836788286
> > I also tried to run wikibigall as well, which seems to require enwiki-20100302-pages-articles-lines.txt but it's not downloaded by the util. It appears the archive should be coming from http://home.apache.org/~mikemccand/enwiki-20100302-pages-articles-lines.txt.bz2, but it's giving 404 now.
>
> Hmm good question, I had downloaded this file years ago, I'm not sure where to find it nowadays. @mikemccand Do you know where to find it? Otherwise I'll upload mine somewhere.
Egads, it is indeed missing! I will re-upload it. Hmm, I do not seem to have that exact file locally cached. Confusing ;) I have the `-1kb` version (medium sized docs), but not the `big` docs. @jpountz could you please post somewhere, maybe your `home.apache.org/~jpountz`, using `sftp`? Then I'll download and copy it up to `/~mikemccand`.
The nightly benchmarks uses the binary form of `wikibigall`, to reduce thread bottleneck when reading/parsing documents to index. Hmm it is sampled from a different date (01/15/2011) ... OK I am uploading that one to https://home.apache.org/~mikemccand/enwiki-20110115-lines.bin (ETA ~20 minutes more).
BTW there is a [`luceneutil` issue to re-sample the Wikipedia export](https://github.com/mikemccand/luceneutil/issues/91), but, alas, it is snagged up because the latest `enwiki` download, after converting XML -> text, is SMALLER than the sample we pulled 11 years ago! Which I could not yet explain ...
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] zacharymorn commented on pull request #101: LUCENE-9335: [Discussion Only] Add BMM scorer and use it for pure disjunction term query
Posted by GitBox <gi...@apache.org>.
zacharymorn commented on pull request #101:
URL: https://github.com/apache/lucene/pull/101#issuecomment-823838918
luceneutil benchmark result with wikimedium5m
```
TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value
OrHighLow 537.97 (6.2%) 256.35 (2.8%) -52.3% ( -57% - -46%) 0.000
OrHighHigh 46.17 (2.4%) 22.41 (2.4%) -51.5% ( -54% - -47%) 0.000
OrHighMed 174.93 (3.3%) 87.04 (2.2%) -50.2% ( -54% - -46%) 0.000
Fuzzy2 81.84 (15.8%) 59.19 (8.7%) -27.7% ( -45% - -3%) 0.000
Fuzzy1 92.74 (7.4%) 75.87 (6.5%) -18.2% ( -29% - -4%) 0.000
OrHighNotLow 817.07 (4.2%) 790.10 (6.6%) -3.3% ( -13% - 7%) 0.060
LowTerm 1634.11 (3.3%) 1602.43 (4.3%) -1.9% ( -9% - 5%) 0.107
Prefix3 411.01 (2.4%) 403.38 (2.0%) -1.9% ( -6% - 2%) 0.008
MedTerm 1579.45 (5.0%) 1553.86 (4.7%) -1.6% ( -10% - 8%) 0.292
OrHighNotHigh 649.97 (4.4%) 641.91 (6.8%) -1.2% ( -11% - 10%) 0.496
HighPhrase 134.74 (3.0%) 133.12 (2.7%) -1.2% ( -6% - 4%) 0.184
HighTerm 1580.99 (3.7%) 1563.39 (4.8%) -1.1% ( -9% - 7%) 0.409
OrNotHighMed 748.48 (5.6%) 740.18 (6.2%) -1.1% ( -12% - 11%) 0.551
Wildcard 128.78 (2.4%) 127.73 (2.2%) -0.8% ( -5% - 3%) 0.271
OrNotHighLow 1013.49 (4.1%) 1008.11 (3.5%) -0.5% ( -7% - 7%) 0.656
MedPhrase 232.36 (2.1%) 231.17 (3.3%) -0.5% ( -5% - 5%) 0.558
AndHighHigh 100.09 (2.7%) 99.60 (3.2%) -0.5% ( -6% - 5%) 0.604
IntNRQ 118.54 (1.2%) 117.97 (1.8%) -0.5% ( -3% - 2%) 0.332
LowPhrase 82.96 (3.5%) 82.57 (3.1%) -0.5% ( -6% - 6%) 0.646
LowSloppyPhrase 68.53 (3.7%) 68.28 (3.2%) -0.4% ( -7% - 6%) 0.744
HighTermTitleBDVSort 302.06 (12.3%) 301.05 (15.3%) -0.3% ( -24% - 31%) 0.939
AndHighMed 405.77 (3.2%) 404.65 (3.2%) -0.3% ( -6% - 6%) 0.784
PKLookup 217.46 (2.9%) 217.05 (2.6%) -0.2% ( -5% - 5%) 0.829
BrowseMonthSSDVFacets 31.87 (1.7%) 31.81 (1.8%) -0.2% ( -3% - 3%) 0.752
Respell 105.24 (2.4%) 105.14 (2.5%) -0.1% ( -4% - 4%) 0.904
HighIntervalsOrdered 27.43 (2.3%) 27.45 (2.5%) 0.1% ( -4% - 5%) 0.902
MedSpanNear 175.50 (2.3%) 175.75 (2.5%) 0.1% ( -4% - 5%) 0.848
HighSloppyPhrase 29.61 (4.6%) 29.66 (3.4%) 0.2% ( -7% - 8%) 0.895
OrNotHighHigh 723.11 (4.7%) 724.68 (4.4%) 0.2% ( -8% - 9%) 0.880
LowSpanNear 108.75 (1.6%) 109.21 (2.0%) 0.4% ( -3% - 4%) 0.465
MedSloppyPhrase 90.19 (3.8%) 90.59 (3.0%) 0.4% ( -6% - 7%) 0.682
HighSpanNear 51.05 (3.7%) 51.31 (3.4%) 0.5% ( -6% - 7%) 0.646
BrowseDateTaxoFacets 11.05 (3.7%) 11.14 (3.3%) 0.8% ( -5% - 8%) 0.487
BrowseDayOfYearTaxoFacets 11.06 (3.7%) 11.16 (3.2%) 0.8% ( -5% - 8%) 0.446
BrowseMonthTaxoFacets 13.14 (3.0%) 13.27 (2.7%) 1.0% ( -4% - 6%) 0.295
BrowseDayOfYearSSDVFacets 27.88 (5.0%) 28.24 (2.0%) 1.3% ( -5% - 8%) 0.287
HighTermDayOfYearSort 184.68 (8.9%) 187.15 (9.1%) 1.3% ( -15% - 21%) 0.638
HighTermMonthSort 243.63 (11.9%) 247.11 (10.6%) 1.4% ( -18% - 27%) 0.690
OrHighNotMed 722.87 (4.3%) 733.49 (4.5%) 1.5% ( -7% - 10%) 0.293
TermDTSort 256.82 (13.0%) 261.88 (14.6%) 2.0% ( -22% - 33%) 0.651
AndHighLow 1106.79 (4.4%) 1144.40 (4.1%) 3.4% ( -4% - 12%) 0.012
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] zacharymorn commented on pull request #101: LUCENE-9335: [Discussion Only] Add BMM scorer and use it for pure disjunction term query
Posted by GitBox <gi...@apache.org>.
zacharymorn commented on pull request #101:
URL: https://github.com/apache/lucene/pull/101#issuecomment-840255508
I'm able to run wikibigall with the above configuration (and a few small changes to `localrun.py` to take in vector dictionary and file), and get the following results:
wikibigall run 1
```
TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value
AndMedOrHighHigh 25.47 (2.5%) 21.21 (4.1%) -16.7% ( -22% - -10%) 0.000
OrHighMed 72.05 (2.5%) 60.17 (4.1%) -16.5% ( -22% - -10%) 0.000
Fuzzy2 31.51 (8.4%) 29.27 (9.5%) -7.1% ( -23% - 11%) 0.012
Fuzzy1 59.00 (8.4%) 56.38 (9.8%) -4.4% ( -20% - 14%) 0.122
TermDayOfYearSort 48.68 (14.6%) 46.86 (8.9%) -3.7% ( -23% - 23%) 0.328
TermMonthSort 61.44 (9.3%) 60.23 (9.8%) -2.0% ( -19% - 18%) 0.514
TermTitleSort 110.13 (8.9%) 108.54 (9.2%) -1.4% ( -18% - 18%) 0.616
TermDTSort 56.74 (9.5%) 56.04 (10.2%) -1.2% ( -19% - 20%) 0.692
TermGroup10K 15.95 (2.6%) 15.75 (3.2%) -1.2% ( -6% - 4%) 0.183
TermGroup100 19.34 (3.5%) 19.11 (3.4%) -1.2% ( -7% - 5%) 0.276
TermBGroup1M 17.92 (2.7%) 17.72 (3.8%) -1.1% ( -7% - 5%) 0.280
TermGroup1M 19.06 (2.5%) 18.86 (3.1%) -1.1% ( -6% - 4%) 0.234
TermBGroup1M1P 43.93 (3.8%) 43.66 (4.4%) -0.6% ( -8% - 7%) 0.633
SpanNear 24.55 (2.5%) 24.46 (2.4%) -0.4% ( -5% - 4%) 0.620
SloppyPhrase 2.99 (8.2%) 2.98 (8.0%) -0.3% ( -15% - 17%) 0.897
IntNRQ 139.92 (1.2%) 139.61 (1.7%) -0.2% ( -3% - 2%) 0.627
IntervalsOrdered 2.02 (3.7%) 2.02 (3.6%) -0.1% ( -7% - 7%) 0.957
BrowseMonthSSDVFacets 19.28 (1.0%) 19.27 (0.9%) -0.0% ( -1% - 1%) 0.888
BrowseDayOfYearSSDVFacets 17.61 (1.9%) 17.61 (1.8%) 0.0% ( -3% - 3%) 0.989
Respell 48.40 (2.6%) 48.59 (3.1%) 0.4% ( -5% - 6%) 0.660
PKLookup 211.17 (3.4%) 212.28 (3.6%) 0.5% ( -6% - 7%) 0.631
BrowseDayOfYearTaxoFacets 7.30 (6.4%) 7.34 (5.8%) 0.6% ( -10% - 13%) 0.766
Prefix3 120.15 (6.1%) 120.95 (4.6%) 0.7% ( -9% - 12%) 0.694
Phrase 54.76 (3.6%) 55.13 (4.3%) 0.7% ( -6% - 8%) 0.586
BrowseDateTaxoFacets 7.62 (6.7%) 7.67 (6.1%) 0.7% ( -11% - 14%) 0.718
Wildcard 72.89 (2.8%) 73.46 (2.2%) 0.8% ( -4% - 5%) 0.331
BrowseMonthTaxoFacets 8.42 (7.0%) 8.49 (6.4%) 0.8% ( -11% - 15%) 0.694
TermDateFacets 8.56 (7.4%) 8.64 (7.3%) 0.9% ( -12% - 16%) 0.702
VectorSearch 954.06 (3.7%) 967.28 (2.5%) 1.4% ( -4% - 7%) 0.167
AndHighHigh 32.04 (3.1%) 32.65 (4.1%) 1.9% ( -5% - 9%) 0.100
AndHighMed 83.72 (1.8%) 85.38 (2.9%) 2.0% ( -2% - 6%) 0.009
AndHighOrMedMed 44.41 (2.2%) 45.34 (4.2%) 2.1% ( -4% - 8%) 0.049
Term 1132.83 (5.2%) 1157.74 (5.0%) 2.2% ( -7% - 13%) 0.173
OrHighHigh 11.35 (2.9%) 16.95 (4.1%) 49.4% ( 41% - 58%) 0.000
```
wikibigall run 2
```
TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value
TermDayOfYearSort 59.31 (11.3%) 56.25 (9.5%) -5.2% ( -23% - 17%) 0.116
TermTitleSort 175.57 (10.3%) 169.31 (9.3%) -3.6% ( -21% - 17%) 0.252
TermMonthSort 61.73 (10.5%) 59.73 (9.2%) -3.2% ( -20% - 18%) 0.300
TermBGroup1M 11.87 (3.2%) 11.57 (2.6%) -2.6% ( -8% - 3%) 0.005
TermGroup10K 15.60 (2.8%) 15.26 (3.3%) -2.2% ( -8% - 4%) 0.023
TermGroup100 23.39 (2.6%) 22.94 (3.0%) -1.9% ( -7% - 3%) 0.030
TermGroup1M 12.16 (2.6%) 11.94 (2.9%) -1.8% ( -7% - 3%) 0.036
TermDTSort 78.08 (6.4%) 76.75 (5.4%) -1.7% ( -12% - 10%) 0.361
Fuzzy1 28.49 (5.6%) 28.08 (5.8%) -1.4% ( -12% - 10%) 0.419
TermBGroup1M1P 42.98 (2.9%) 42.49 (3.3%) -1.1% ( -7% - 5%) 0.247
IntNRQ 136.00 (3.4%) 134.92 (3.5%) -0.8% ( -7% - 6%) 0.473
AndHighHigh 22.60 (3.9%) 22.42 (2.4%) -0.8% ( -6% - 5%) 0.455
TermDateFacets 8.38 (7.0%) 8.32 (6.6%) -0.7% ( -13% - 13%) 0.738
Prefix3 33.76 (3.1%) 33.55 (2.8%) -0.6% ( -6% - 5%) 0.502
Wildcard 88.25 (3.0%) 87.72 (2.3%) -0.6% ( -5% - 4%) 0.479
BrowseDayOfYearSSDVFacets 17.39 (1.4%) 17.31 (1.5%) -0.4% ( -3% - 2%) 0.363
Term 1144.71 (8.2%) 1139.98 (8.5%) -0.4% ( -15% - 17%) 0.876
VectorSearch 807.48 (4.7%) 804.27 (4.9%) -0.4% ( -9% - 9%) 0.794
IntervalsOrdered 0.82 (5.3%) 0.82 (5.4%) -0.3% ( -10% - 11%) 0.872
PKLookup 207.38 (4.8%) 206.99 (5.5%) -0.2% ( -10% - 10%) 0.910
Respell 48.27 (3.6%) 48.18 (3.6%) -0.2% ( -7% - 7%) 0.878
BrowseDateTaxoFacets 7.58 (6.8%) 7.57 (6.5%) -0.2% ( -12% - 14%) 0.933
BrowseDayOfYearTaxoFacets 7.26 (6.3%) 7.24 (6.2%) -0.2% ( -11% - 13%) 0.932
SloppyPhrase 5.55 (5.7%) 5.54 (5.8%) -0.1% ( -10% - 12%) 0.946
BrowseMonthTaxoFacets 8.38 (6.7%) 8.38 (6.6%) -0.1% ( -12% - 14%) 0.969
SpanNear 2.12 (3.5%) 2.12 (3.9%) 0.0% ( -7% - 7%) 0.993
BrowseMonthSSDVFacets 19.52 (5.7%) 19.53 (5.7%) 0.0% ( -10% - 12%) 0.985
Phrase 59.48 (3.8%) 59.72 (3.9%) 0.4% ( -6% - 8%) 0.735
AndHighMed 60.30 (4.6%) 60.57 (2.3%) 0.4% ( -6% - 7%) 0.705
Fuzzy2 53.28 (11.3%) 53.93 (10.6%) 1.2% ( -18% - 26%) 0.727
AndMedOrHighHigh 27.58 (3.1%) 27.93 (3.4%) 1.3% ( -5% - 8%) 0.209
AndHighOrMedMed 24.38 (3.1%) 25.06 (3.3%) 2.8% ( -3% - 9%) 0.006
OrHighMed 45.08 (4.3%) 57.44 (6.9%) 27.4% ( 15% - 40%) 0.000
OrHighHigh 11.22 (4.8%) 16.16 (9.8%) 44.0% ( 28% - 61%) 0.000
```
wikibigall run 3
```
TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value
Fuzzy2 51.26 (10.6%) 47.91 (11.4%) -6.5% ( -25% - 17%) 0.060
TermDTSort 83.68 (11.6%) 79.69 (7.7%) -4.8% ( -21% - 16%) 0.126
Fuzzy1 60.65 (6.6%) 57.96 (8.5%) -4.4% ( -18% - 11%) 0.065
TermMonthSort 61.55 (15.5%) 59.70 (8.4%) -3.0% ( -23% - 24%) 0.447
TermTitleSort 88.13 (15.1%) 85.68 (8.4%) -2.8% ( -22% - 24%) 0.471
TermDayOfYearSort 47.32 (8.7%) 46.58 (10.0%) -1.5% ( -18% - 18%) 0.602
Wildcard 37.71 (9.9%) 37.32 (9.1%) -1.0% ( -18% - 19%) 0.729
Prefix3 164.10 (12.8%) 162.91 (11.5%) -0.7% ( -22% - 27%) 0.850
TermGroup100 23.62 (3.0%) 23.50 (2.9%) -0.5% ( -6% - 5%) 0.617
TermBGroup1M 11.90 (2.7%) 11.86 (3.1%) -0.3% ( -5% - 5%) 0.714
Term 1044.58 (8.0%) 1041.84 (6.5%) -0.3% ( -13% - 15%) 0.910
TermGroup10K 12.18 (2.9%) 12.15 (2.7%) -0.2% ( -5% - 5%) 0.808
IntervalsOrdered 3.91 (2.7%) 3.91 (2.7%) -0.1% ( -5% - 5%) 0.934
BrowseDayOfYearSSDVFacets 17.47 (2.3%) 17.47 (2.0%) -0.0% ( -4% - 4%) 0.994
VectorSearch 851.38 (5.2%) 851.53 (5.6%) 0.0% ( -10% - 11%) 0.992
SpanNear 24.59 (2.2%) 24.61 (2.3%) 0.1% ( -4% - 4%) 0.923
TermGroup1M 15.44 (2.7%) 15.45 (2.9%) 0.1% ( -5% - 5%) 0.936
BrowseMonthSSDVFacets 19.19 (1.4%) 19.22 (1.5%) 0.2% ( -2% - 3%) 0.746
TermBGroup1M1P 18.19 (4.7%) 18.24 (4.6%) 0.3% ( -8% - 9%) 0.858
IntNRQ 136.29 (3.4%) 136.75 (3.5%) 0.3% ( -6% - 7%) 0.755
Respell 40.87 (5.4%) 41.04 (4.8%) 0.4% ( -9% - 11%) 0.793
Phrase 24.71 (3.1%) 24.87 (2.7%) 0.7% ( -4% - 6%) 0.472
PKLookup 205.99 (5.9%) 208.11 (6.4%) 1.0% ( -10% - 14%) 0.596
BrowseDayOfYearTaxoFacets 7.04 (7.8%) 7.11 (7.5%) 1.0% ( -13% - 17%) 0.665
BrowseDateTaxoFacets 7.36 (7.9%) 7.45 (7.7%) 1.1% ( -13% - 18%) 0.646
SloppyPhrase 1.17 (10.0%) 1.19 (9.8%) 1.2% ( -16% - 23%) 0.705
AndHighMed 45.25 (3.5%) 45.82 (3.8%) 1.3% ( -5% - 8%) 0.274
BrowseMonthTaxoFacets 8.13 (7.9%) 8.25 (7.9%) 1.4% ( -13% - 18%) 0.569
AndHighHigh 23.27 (3.6%) 23.60 (4.0%) 1.4% ( -5% - 9%) 0.227
TermDateFacets 11.08 (9.9%) 11.27 (10.0%) 1.7% ( -16% - 23%) 0.582
AndHighOrMedMed 24.91 (3.8%) 25.72 (4.0%) 3.3% ( -4% - 11%) 0.008
AndMedOrHighHigh 10.00 (4.2%) 10.61 (4.2%) 6.1% ( -2% - 15%) 0.000
OrHighMed 59.17 (5.5%) 67.57 (4.5%) 14.2% ( 3% - 25%) 0.000
OrHighHigh 17.06 (4.9%) 24.26 (5.2%) 42.2% ( 30% - 55%) 0.000
```
I'll run it for the other two implementations as well and post the results in the other PR.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] zacharymorn commented on pull request #101: LUCENE-9335: [Discussion Only] Add BMM scorer and use it for pure disjunction term query
Posted by GitBox <gi...@apache.org>.
zacharymorn commented on pull request #101:
URL: https://github.com/apache/lucene/pull/101#issuecomment-830494893
2 luceneutil runs with wikimedium5m results:
```
TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value
OrHighMed 199.09 (3.5%) 105.27 (2.5%) -47.1% ( -51% - -42%) 0.000
OrHighLow 536.68 (6.0%) 309.93 (3.2%) -42.3% ( -48% - -35%) 0.000
OrHighHigh 34.53 (2.3%) 23.21 (3.8%) -32.8% ( -38% - -27%) 0.000
LowTerm 1830.50 (4.8%) 1784.41 (5.1%) -2.5% ( -11% - 7%) 0.109
HighPhrase 243.73 (2.6%) 237.72 (2.9%) -2.5% ( -7% - 3%) 0.005
OrNotHighLow 1251.92 (3.8%) 1236.67 (4.1%) -1.2% ( -8% - 6%) 0.328
OrHighNotMed 760.26 (4.4%) 751.82 (4.7%) -1.1% ( -9% - 8%) 0.443
HighTerm 1384.99 (4.2%) 1370.00 (4.2%) -1.1% ( -9% - 7%) 0.415
LowPhrase 80.03 (3.2%) 79.19 (3.5%) -1.1% ( -7% - 5%) 0.321
LowSpanNear 194.21 (2.3%) 192.63 (2.4%) -0.8% ( -5% - 4%) 0.281
OrNotHighHigh 687.96 (6.5%) 682.76 (4.9%) -0.8% ( -11% - 11%) 0.679
MedPhrase 435.41 (3.1%) 432.56 (3.8%) -0.7% ( -7% - 6%) 0.554
MedSpanNear 143.42 (3.0%) 142.58 (2.8%) -0.6% ( -6% - 5%) 0.520
HighTermMonthSort 153.33 (13.2%) 152.43 (11.4%) -0.6% ( -22% - 27%) 0.880
HighSpanNear 46.53 (2.7%) 46.31 (2.4%) -0.5% ( -5% - 4%) 0.568
PKLookup 222.89 (3.2%) 221.95 (3.3%) -0.4% ( -6% - 6%) 0.683
TermDTSort 210.61 (15.6%) 209.73 (15.1%) -0.4% ( -26% - 35%) 0.932
HighTermDayOfYearSort 261.28 (8.7%) 260.58 (8.2%) -0.3% ( -15% - 18%) 0.921
MedTerm 1647.86 (3.4%) 1644.52 (5.7%) -0.2% ( -8% - 9%) 0.891
MedSloppyPhrase 30.70 (2.8%) 30.65 (2.4%) -0.2% ( -5% - 5%) 0.838
BrowseMonthSSDVFacets 31.90 (2.1%) 31.86 (2.2%) -0.1% ( -4% - 4%) 0.870
BrowseDayOfYearTaxoFacets 11.13 (2.9%) 11.13 (2.7%) -0.0% ( -5% - 5%) 0.958
LowSloppyPhrase 72.26 (1.9%) 72.26 (1.5%) -0.0% ( -3% - 3%) 0.985
AndHighLow 1093.17 (4.7%) 1093.14 (5.8%) -0.0% ( -10% - 11%) 0.999
BrowseMonthTaxoFacets 13.15 (2.4%) 13.15 (2.6%) 0.0% ( -4% - 5%) 0.995
BrowseDateTaxoFacets 11.10 (2.8%) 11.11 (2.6%) 0.0% ( -5% - 5%) 0.956
Prefix3 435.49 (5.6%) 435.75 (4.6%) 0.1% ( -9% - 10%) 0.971
HighTermTitleBDVSort 349.10 (15.4%) 349.53 (15.9%) 0.1% ( -26% - 37%) 0.980
IntNRQ 196.54 (1.2%) 196.83 (1.4%) 0.1% ( -2% - 2%) 0.726
HighIntervalsOrdered 39.29 (1.7%) 39.35 (1.7%) 0.2% ( -3% - 3%) 0.779
Respell 86.26 (2.8%) 86.41 (2.2%) 0.2% ( -4% - 5%) 0.828
OrNotHighMed 630.88 (5.5%) 632.63 (4.2%) 0.3% ( -9% - 10%) 0.859
Wildcard 219.64 (3.3%) 220.47 (3.0%) 0.4% ( -5% - 6%) 0.700
AndHighMed 446.48 (3.2%) 448.84 (2.5%) 0.5% ( -5% - 6%) 0.566
HighSloppyPhrase 48.95 (5.3%) 49.29 (4.3%) 0.7% ( -8% - 10%) 0.653
Fuzzy1 77.46 (5.8%) 78.08 (6.6%) 0.8% ( -10% - 13%) 0.683
OrHighNotHigh 646.82 (5.2%) 653.41 (5.8%) 1.0% ( -9% - 12%) 0.558
BrowseDayOfYearSSDVFacets 27.71 (6.9%) 28.04 (5.3%) 1.2% ( -10% - 14%) 0.543
AndHighHigh 153.06 (3.3%) 155.03 (4.2%) 1.3% ( -6% - 9%) 0.283
OrHighNotLow 783.18 (5.7%) 802.15 (5.5%) 2.4% ( -8% - 14%) 0.173
Fuzzy2 82.84 (18.2%) 86.90 (16.6%) 4.9% ( -25% - 48%) 0.373
```
```
TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value
OrHighMed 217.30 (3.7%) 124.89 (1.6%) -42.5% ( -46% - -38%) 0.000
OrHighHigh 66.63 (2.2%) 49.45 (1.5%) -25.8% ( -28% - -22%) 0.000
OrHighLow 225.66 (2.9%) 210.48 (4.3%) -6.7% ( -13% - 0%) 0.000
OrNotHighHigh 828.22 (5.6%) 803.12 (5.7%) -3.0% ( -13% - 8%) 0.090
OrHighNotHigh 788.67 (5.7%) 766.95 (5.1%) -2.8% ( -12% - 8%) 0.107
HighTermTitleBDVSort 233.10 (10.2%) 227.07 (8.9%) -2.6% ( -19% - 18%) 0.393
HighTermMonthSort 399.02 (12.3%) 389.61 (13.3%) -2.4% ( -24% - 26%) 0.560
MedTerm 1607.99 (7.4%) 1570.56 (3.6%) -2.3% ( -12% - 9%) 0.206
OrNotHighMed 702.25 (4.6%) 687.85 (5.8%) -2.1% ( -11% - 8%) 0.213
HighTermDayOfYearSort 258.38 (14.9%) 253.13 (13.3%) -2.0% ( -26% - 30%) 0.649
Prefix3 499.62 (5.8%) 490.49 (5.9%) -1.8% ( -12% - 10%) 0.323
LowTerm 1675.66 (4.5%) 1645.62 (4.4%) -1.8% ( -10% - 7%) 0.203
Wildcard 148.88 (2.8%) 146.54 (3.8%) -1.6% ( -7% - 5%) 0.138
OrHighNotLow 752.35 (5.4%) 741.10 (7.0%) -1.5% ( -13% - 11%) 0.446
MedSloppyPhrase 35.95 (3.7%) 35.59 (3.1%) -1.0% ( -7% - 6%) 0.354
MedPhrase 738.72 (5.1%) 731.71 (5.3%) -0.9% ( -10% - 9%) 0.564
LowSloppyPhrase 75.04 (3.4%) 74.41 (3.1%) -0.8% ( -7% - 5%) 0.414
TermDTSort 444.12 (14.6%) 440.54 (14.0%) -0.8% ( -25% - 32%) 0.858
OrNotHighLow 903.20 (3.9%) 898.10 (4.3%) -0.6% ( -8% - 7%) 0.662
PKLookup 217.06 (3.8%) 216.21 (3.6%) -0.4% ( -7% - 7%) 0.739
MedSpanNear 124.58 (3.3%) 124.14 (1.7%) -0.3% ( -5% - 4%) 0.676
HighPhrase 84.33 (2.9%) 84.05 (3.0%) -0.3% ( -6% - 5%) 0.718
LowPhrase 37.84 (3.7%) 37.72 (3.7%) -0.3% ( -7% - 7%) 0.782
Fuzzy2 64.65 (13.7%) 64.47 (11.6%) -0.3% ( -22% - 29%) 0.943
Fuzzy1 57.26 (8.4%) 57.13 (7.4%) -0.2% ( -14% - 16%) 0.930
AndHighHigh 126.99 (2.2%) 126.72 (3.0%) -0.2% ( -5% - 5%) 0.804
HighIntervalsOrdered 13.60 (2.8%) 13.59 (2.5%) -0.1% ( -5% - 5%) 0.916
Respell 102.67 (2.9%) 102.59 (2.4%) -0.1% ( -5% - 5%) 0.930
LowSpanNear 63.07 (3.0%) 63.16 (2.3%) 0.2% ( -4% - 5%) 0.857
HighSloppyPhrase 91.17 (3.5%) 91.31 (3.3%) 0.2% ( -6% - 7%) 0.887
AndHighMed 284.25 (2.0%) 285.44 (2.6%) 0.4% ( -4% - 5%) 0.568
HighTerm 1258.74 (4.9%) 1264.45 (3.7%) 0.5% ( -7% - 9%) 0.741
OrHighNotMed 699.99 (6.1%) 703.29 (7.3%) 0.5% ( -12% - 14%) 0.824
BrowseDayOfYearTaxoFacets 10.94 (3.8%) 11.00 (3.5%) 0.5% ( -6% - 8%) 0.653
BrowseMonthTaxoFacets 13.06 (2.4%) 13.13 (2.6%) 0.5% ( -4% - 5%) 0.505
BrowseDateTaxoFacets 10.92 (3.8%) 10.98 (3.6%) 0.6% ( -6% - 8%) 0.625
AndHighLow 1026.60 (4.0%) 1032.70 (4.5%) 0.6% ( -7% - 9%) 0.659
BrowseMonthSSDVFacets 31.41 (5.4%) 31.63 (2.4%) 0.7% ( -6% - 9%) 0.595
IntNRQ 293.66 (3.3%) 295.77 (2.0%) 0.7% ( -4% - 6%) 0.407
HighSpanNear 147.03 (3.1%) 148.42 (1.8%) 0.9% ( -3% - 6%) 0.243
BrowseDayOfYearSSDVFacets 27.66 (5.2%) 28.13 (1.7%) 1.7% ( -4% - 9%) 0.163
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] jpountz commented on pull request #101: LUCENE-9335: [Discussion Only] Add BMM scorer and use it for pure disjunction term query
Posted by GitBox <gi...@apache.org>.
jpountz commented on pull request #101:
URL: https://github.com/apache/lucene/pull/101#issuecomment-836296845
> I cherry-picked your commit and pushed to this branch / PR to further explore the changes and their effect, hope that's ok.
Of course!
> I also tried to run wikibigall as well, which seems to require enwiki-20100302-pages-articles-lines.txt but it's not downloaded by the util. It appears the archive should be coming from http://home.apache.org/~mikemccand/enwiki-20100302-pages-articles-lines.txt.bz2, but it's giving 404 now.
Hmm good question, I had downloaded this file years ago, I'm not sure where to find it nowadays. @mikemccand Do you know where to find it? Otherwise I'll upload mine somewhere.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] zacharymorn commented on a change in pull request #101: LUCENE-9335: [Discussion Only] Add BMM scorer and use it for pure disjunction term query
Posted by GitBox <gi...@apache.org>.
zacharymorn commented on a change in pull request #101:
URL: https://github.com/apache/lucene/pull/101#discussion_r627975325
##########
File path: lucene/core/src/java/org/apache/lucene/search/BlockMaxMaxscoreScorer.java
##########
@@ -0,0 +1,339 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.search;
+
+import static org.apache.lucene.search.ScorerUtil.costWithMinShouldMatch;
+
+import java.io.IOException;
+import java.util.*;
+
+/** Scorer implementing Block-Max Maxscore algorithm */
+public class BlockMaxMaxscoreScorer extends Scorer {
+ private final ScoreMode scoreMode;
+ private final int scalingFactor;
+
+ // current doc ID of the leads
+ private int doc;
+
+ // doc id boundary that all scorers maxScore are valid
+ private int upTo = -1;
+
+ // heap of scorers ordered by doc ID
+ private final DisiPriorityQueue essentialsScorers;
+
+ // list of scorers whose sum of maxScore is less than minCompetitiveScore, ordered by maxScore
+ private final List<DisiWrapper> nonEssentialScorers;
+
+ // sum of max scores of scorers in nonEssentialScorers list
+ private long nonEssentialMaxScoreSum;
+
+ // sum of score of scorers in essentialScorers list that are positioned on matching doc
+ private long matchedDocScoreSum;
+
+ private long cost;
+
+ private final MaxScoreSumPropagator maxScoreSumPropagator;
+
+ private final List<Scorer> scorers;
+
+ // scaled min competitive score
+ private long minCompetitiveScore = 0;
+
+ /**
+ * Constructs a Scorer
+ *
+ * @param weight The weight to be used.
+ * @param scorers The sub scorers this Scorer should iterate on for optional clauses
+ * @param scoreMode The scoreMode
+ */
+ public BlockMaxMaxscoreScorer(Weight weight, List<Scorer> scorers, ScoreMode scoreMode)
+ throws IOException {
+ super(weight);
+ assert scoreMode == ScoreMode.TOP_SCORES;
+
+ this.scoreMode = scoreMode;
+ this.doc = -1;
+ this.scorers = scorers;
+ this.cost =
+ costWithMinShouldMatch(
+ scorers.stream().map(Scorer::iterator).mapToLong(DocIdSetIterator::cost),
+ scorers.size(),
+ 1);
+
+ essentialsScorers = new DisiPriorityQueue(scorers.size());
+ nonEssentialScorers = new LinkedList<>();
+
+ scalingFactor = WANDScorer.getScalingFactor(scorers);
+ maxScoreSumPropagator = new MaxScoreSumPropagator(scorers);
+
+ for (Scorer scorer : scorers) {
+ nonEssentialScorers.add(new DisiWrapper(scorer));
+ }
+ }
+
+ @Override
+ public DocIdSetIterator iterator() {
+ return TwoPhaseIterator.asDocIdSetIterator(twoPhaseIterator());
+ }
+
+ @Override
+ public TwoPhaseIterator twoPhaseIterator() {
+ DocIdSetIterator approximation =
+ new DocIdSetIterator() {
+ private long lastMinCompetitiveScore;
+
+ @Override
+ public int docID() {
+ return doc;
+ }
+
+ @Override
+ public int nextDoc() throws IOException {
+ return advance(doc + 1);
+ }
+
+ @Override
+ public int advance(int target) throws IOException {
+ doAdvance(target);
+
+ while (doc != DocIdSetIterator.NO_MORE_DOCS
+ && nonEssentialMaxScoreSum + matchedDocScoreSum < minCompetitiveScore) {
+ doAdvance(doc + 1);
+ }
+
+ return doc;
+ }
+
+ private void doAdvance(int target) throws IOException {
+ matchedDocScoreSum = 0;
+ // Find next smallest doc id that is larger than or equal to target from the essential
+ // scorers
+
+ // If the next candidate doc id is still within interval boundary,
+ if (lastMinCompetitiveScore == minCompetitiveScore && target <= upTo) {
+ while (essentialsScorers.top().doc < target) {
+ DisiWrapper w = essentialsScorers.pop();
+ w.doc = w.iterator.advance(target);
+ essentialsScorers.add(w);
+ }
+
+ if (essentialsScorers.top().doc <= upTo) {
+ doc = essentialsScorers.top().doc;
+
+ if (doc == NO_MORE_DOCS) {
+ return;
+ }
+ } else {
+ doc = upTo + 1;
+ }
+ } else {
+ lastMinCompetitiveScore = minCompetitiveScore;
+ // Next candidate doc id is above interval boundary, or minCompetitive has increased.
+ // Find next interval boundary.
+ // Block boundary alignment strategy is adapted from "Optimizing Top-k Document
+ // Retrieval Strategies for Block-Max Indexes" by Dimopoulos, Nepomnyachiy and Suel.
+ // Find the block interval boundary that is the minimum of all participating scorer's
+ // block boundary. Then run BMM within each interval.
+ updateUpToAndMaxScore(target);
+
+ repartitionLists();
+
+ // maxScore of all scorers sum to less than minCompetitiveScore, no more result is
+ // available up to upTo
+ if (essentialsScorers.size() == 0) {
+ // current bound no long valid
+ doc = upTo + 1;
+ } else {
+ doc = essentialsScorers.top().doc;
+
+ if (doc == NO_MORE_DOCS) {
+ return;
+ }
+ }
+ }
+
+ for (DisiWrapper w : essentialsScorers) {
+ if (w.doc == doc && doc != NO_MORE_DOCS) {
+ matchedDocScoreSum += WANDScorer.scaleMaxScore(w.scorer.score(), scalingFactor);
+ }
+ }
+ }
+
+ private void updateUpToAndMaxScore(int target) throws IOException {
+ while (essentialsScorers.size() > 0) {
+ nonEssentialScorers.add(essentialsScorers.pop());
+ }
+
+ // reset upTo
+ upTo = DocIdSetIterator.NO_MORE_DOCS;
+ for (DisiWrapper w : nonEssentialScorers) {
+ if (w.doc < target) {
+ upTo = Math.min(w.scorer.advanceShallow(target), upTo);
+ } else {
+ upTo = Math.min(w.scorer.advanceShallow(w.doc), upTo);
+ }
+ }
+ assert target <= upTo;
+
+ for (DisiWrapper w : nonEssentialScorers) {
+ if (w.doc <= upTo) {
+ w.maxScore = WANDScorer.scaleMaxScore(w.scorer.getMaxScore(upTo), scalingFactor);
+ } else {
+ // This scorer won't be able to contribute to match for target, setting its maxScore
+ // to 0 so it goes into nonEssentialList
+ w.maxScore = 0;
+ }
+ if (w.doc < target) {
+ w.doc = w.iterator.advance(target);
+ }
Review comment:
Hmm I think the purpose here is similar to https://github.com/jpountz/lucene/blob/ce2e5baed8831aac0f44398b97c54264293beb40/lucene/core/src/java/org/apache/lucene/search/BlockMaxMaxscoreScorer.java#L117-L120, to make sure scorer gets positioned on or after `target` if `upTo` is updated. But your approach would require less positioning than mine since it only updates the ones in essential list.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] mikemccand commented on pull request #101: LUCENE-9335: [Discussion Only] Add BMM scorer and use it for pure disjunction term query
Posted by GitBox <gi...@apache.org>.
mikemccand commented on pull request #101:
URL: https://github.com/apache/lucene/pull/101#issuecomment-836879100
> FWIW the file I have locally is enwiki-20130102-lines.txt, not the enwiki-20100302-pages-articles-lines.txt file that luceneutil refers to.
Aha! I have that one locally, compressed with `.lzma` -- I'll upload it.
OK it's copying now to https://home.apache.org/~mikemccand/enwiki-20130102-lines.txt.lzma! ETA ~15 minutes.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] jpountz commented on pull request #101: LUCENE-9335: [Discussion Only] Add BMM scorer and use it for pure disjunction term query
Posted by GitBox <gi...@apache.org>.
jpountz commented on pull request #101:
URL: https://github.com/apache/lucene/pull/101#issuecomment-834791219
> The last two are optimization techniques not mentioned in the paper I think?
To be honest I didn't read the paper recently so it's possible I diverged a bit from it.
(2) feels natural to avoid doing useless score computations, though it might only work well when score upper bounds are very close to the actual scores. Maybe we should test on wikibig instead of wikimedium to get better confidence that this change makes things better.
Regarding (3) does it actually push more scorers into `nonEssentialScorers`? I thought I just reorganized the existing logic a bit. If it pushes more scorers into `nonEssentialScorers` it's probably a bug. :)
> but it seems like a good net speedup given the latter twos already have much faster QPS compared to OrHighHigh?
Agreed, we already made similar choices in the past. It's probably still worth playing with a wider variety of queries, e.g. queries with many terms that have mixed frequencies, e.g. something like OrHighHighMedMedLowLow, or disjunctions within conjunctions/conjunctions within disjunctions.
Maybe we can also try to port similar changes to the bulk scorer to see if it yields even greater benefits?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] zacharymorn commented on pull request #101: LUCENE-9335: [Discussion Only] Add BMM scorer and use it for pure disjunction term query
Posted by GitBox <gi...@apache.org>.
zacharymorn commented on pull request #101:
URL: https://github.com/apache/lucene/pull/101#issuecomment-837864536
Thanks @mikemccand and @jpountz for the uploads!
> The nightly benchmarks uses the binary form of wikibigall, to reduce thread bottleneck when reading/parsing documents to index. Hmm it is sampled from a different date (01/15/2011) ... OK I am uploading that one to https://home.apache.org/~mikemccand/enwiki-20110115-lines.bin (ETA ~20 minutes more).
I'm assuming we should now run `nightlyBench.py` instead of `wikibigall`, since the latter depends on `enwiki-20100302-pages-articles-lines.txt` but it's no longer available?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] zacharymorn commented on a change in pull request #101: LUCENE-9335: [Discussion Only] Add BMM scorer and use it for pure disjunction term query
Posted by GitBox <gi...@apache.org>.
zacharymorn commented on a change in pull request #101:
URL: https://github.com/apache/lucene/pull/101#discussion_r627978058
##########
File path: lucene/core/src/java/org/apache/lucene/search/BlockMaxMaxscoreScorer.java
##########
@@ -0,0 +1,339 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.search;
+
+import static org.apache.lucene.search.ScorerUtil.costWithMinShouldMatch;
+
+import java.io.IOException;
+import java.util.*;
+
+/** Scorer implementing Block-Max Maxscore algorithm */
+public class BlockMaxMaxscoreScorer extends Scorer {
+ private final ScoreMode scoreMode;
+ private final int scalingFactor;
+
+ // current doc ID of the leads
+ private int doc;
+
+ // doc id boundary that all scorers maxScore are valid
+ private int upTo = -1;
+
+ // heap of scorers ordered by doc ID
+ private final DisiPriorityQueue essentialsScorers;
+
+ // list of scorers whose sum of maxScore is less than minCompetitiveScore, ordered by maxScore
+ private final List<DisiWrapper> nonEssentialScorers;
+
+ // sum of max scores of scorers in nonEssentialScorers list
+ private long nonEssentialMaxScoreSum;
+
+ // sum of score of scorers in essentialScorers list that are positioned on matching doc
+ private long matchedDocScoreSum;
+
+ private long cost;
+
+ private final MaxScoreSumPropagator maxScoreSumPropagator;
+
+ private final List<Scorer> scorers;
+
+ // scaled min competitive score
+ private long minCompetitiveScore = 0;
+
+ /**
+ * Constructs a Scorer
+ *
+ * @param weight The weight to be used.
+ * @param scorers The sub scorers this Scorer should iterate on for optional clauses
+ * @param scoreMode The scoreMode
+ */
+ public BlockMaxMaxscoreScorer(Weight weight, List<Scorer> scorers, ScoreMode scoreMode)
+ throws IOException {
+ super(weight);
+ assert scoreMode == ScoreMode.TOP_SCORES;
+
+ this.scoreMode = scoreMode;
+ this.doc = -1;
+ this.scorers = scorers;
+ this.cost =
+ costWithMinShouldMatch(
+ scorers.stream().map(Scorer::iterator).mapToLong(DocIdSetIterator::cost),
+ scorers.size(),
+ 1);
+
+ essentialsScorers = new DisiPriorityQueue(scorers.size());
+ nonEssentialScorers = new LinkedList<>();
+
+ scalingFactor = WANDScorer.getScalingFactor(scorers);
+ maxScoreSumPropagator = new MaxScoreSumPropagator(scorers);
+
+ for (Scorer scorer : scorers) {
+ nonEssentialScorers.add(new DisiWrapper(scorer));
+ }
+ }
+
+ @Override
+ public DocIdSetIterator iterator() {
+ return TwoPhaseIterator.asDocIdSetIterator(twoPhaseIterator());
+ }
+
+ @Override
+ public TwoPhaseIterator twoPhaseIterator() {
+ DocIdSetIterator approximation =
+ new DocIdSetIterator() {
+ private long lastMinCompetitiveScore;
+
+ @Override
+ public int docID() {
+ return doc;
+ }
+
+ @Override
+ public int nextDoc() throws IOException {
+ return advance(doc + 1);
+ }
+
+ @Override
+ public int advance(int target) throws IOException {
+ doAdvance(target);
+
+ while (doc != DocIdSetIterator.NO_MORE_DOCS
+ && nonEssentialMaxScoreSum + matchedDocScoreSum < minCompetitiveScore) {
+ doAdvance(doc + 1);
+ }
+
+ return doc;
+ }
+
+ private void doAdvance(int target) throws IOException {
+ matchedDocScoreSum = 0;
+ // Find next smallest doc id that is larger than or equal to target from the essential
+ // scorers
+
+ // If the next candidate doc id is still within interval boundary,
+ if (lastMinCompetitiveScore == minCompetitiveScore && target <= upTo) {
+ while (essentialsScorers.top().doc < target) {
+ DisiWrapper w = essentialsScorers.pop();
+ w.doc = w.iterator.advance(target);
+ essentialsScorers.add(w);
+ }
+
+ if (essentialsScorers.top().doc <= upTo) {
+ doc = essentialsScorers.top().doc;
+
+ if (doc == NO_MORE_DOCS) {
+ return;
+ }
+ } else {
+ doc = upTo + 1;
+ }
+ } else {
+ lastMinCompetitiveScore = minCompetitiveScore;
+ // Next candidate doc id is above interval boundary, or minCompetitive has increased.
+ // Find next interval boundary.
+ // Block boundary alignment strategy is adapted from "Optimizing Top-k Document
+ // Retrieval Strategies for Block-Max Indexes" by Dimopoulos, Nepomnyachiy and Suel.
+ // Find the block interval boundary that is the minimum of all participating scorer's
+ // block boundary. Then run BMM within each interval.
+ updateUpToAndMaxScore(target);
+
+ repartitionLists();
+
+ // maxScore of all scorers sum to less than minCompetitiveScore, no more result is
+ // available up to upTo
+ if (essentialsScorers.size() == 0) {
+ // current bound no long valid
+ doc = upTo + 1;
+ } else {
+ doc = essentialsScorers.top().doc;
+
+ if (doc == NO_MORE_DOCS) {
+ return;
+ }
+ }
+ }
+
+ for (DisiWrapper w : essentialsScorers) {
+ if (w.doc == doc && doc != NO_MORE_DOCS) {
+ matchedDocScoreSum += WANDScorer.scaleMaxScore(w.scorer.score(), scalingFactor);
+ }
+ }
+ }
+
+ private void updateUpToAndMaxScore(int target) throws IOException {
+ while (essentialsScorers.size() > 0) {
+ nonEssentialScorers.add(essentialsScorers.pop());
+ }
+
+ // reset upTo
+ upTo = DocIdSetIterator.NO_MORE_DOCS;
+ for (DisiWrapper w : nonEssentialScorers) {
+ if (w.doc < target) {
+ upTo = Math.min(w.scorer.advanceShallow(target), upTo);
+ } else {
+ upTo = Math.min(w.scorer.advanceShallow(w.doc), upTo);
+ }
+ }
+ assert target <= upTo;
+
+ for (DisiWrapper w : nonEssentialScorers) {
+ if (w.doc <= upTo) {
+ w.maxScore = WANDScorer.scaleMaxScore(w.scorer.getMaxScore(upTo), scalingFactor);
+ } else {
+ // This scorer won't be able to contribute to match for target, setting its maxScore
+ // to 0 so it goes into nonEssentialList
+ w.maxScore = 0;
+ }
+ if (w.doc < target) {
+ w.doc = w.iterator.advance(target);
+ }
+ }
+ }
+
+ private void repartitionLists() {
+ Collections.sort(nonEssentialScorers, (w1, w2) -> (int) (w1.maxScore - w2.maxScore));
+
+ // Re-partition the scorers into non-essential list and essential list, as defined in
+ // the "Optimizing Top-k Document Retrieval Strategies for Block-Max Indexes" paper.
+ nonEssentialMaxScoreSum = 0;
+ for (int i = 0; i < nonEssentialScorers.size(); i++) {
+ DisiWrapper w = nonEssentialScorers.get(i);
+ if (nonEssentialMaxScoreSum + w.maxScore < minCompetitiveScore) {
+ nonEssentialMaxScoreSum += w.maxScore;
+ } else {
+ // the logic is a bit ugly here...but as soon as we find maxScore of scorers in
+ // non-essential list sum to above minCompetitiveScore, we move the rest of
+ // scorers
+ // into essential list
+ for (int j = nonEssentialScorers.size() - 1; j >= i; j--) {
+ essentialsScorers.add(nonEssentialScorers.remove(j));
+ }
+ break;
+ }
+ }
+ }
+
+ @Override
+ public long cost() {
+ // fixed at initialization
+ return cost;
+ }
+ };
+
+ return new TwoPhaseIterator(approximation) {
+ @Override
+ public boolean matches() throws IOException {
+ // The doc is a match when all scores sum above minCompetitiveScore
+ for (DisiWrapper w : nonEssentialScorers) {
+ if (w.doc < doc) {
+ w.doc = w.iterator.advance(doc);
+ }
+ }
+
+ if (matchedDocScoreSum >= minCompetitiveScore) {
+ return true;
+ }
+
+ for (DisiWrapper w : nonEssentialScorers) {
+ if (w.doc == doc) {
+ matchedDocScoreSum += WANDScorer.scaleMaxScore(w.scorer.score(), scalingFactor);
+
+ if (matchedDocScoreSum >= minCompetitiveScore) {
+ return true;
+ }
+ }
+ }
+
+ return false;
+ }
Review comment:
Yeah I think I was following the approach mentioned in the paper too strictly here (candidate doc was picked based on sum of maxScore from non-essential list scorers, and actual score from essential list scorers, on the positioned doc). Implementing it this way definitely will save a lot of computations.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] zacharymorn commented on pull request #101: LUCENE-9335: [Discussion Only] Add BMM scorer and use it for pure disjunction term query
Posted by GitBox <gi...@apache.org>.
zacharymorn commented on pull request #101:
URL: https://github.com/apache/lucene/pull/101#issuecomment-839419774
Thanks Adrien for the suggestion! Let me give it a try.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] jpountz commented on pull request #101: LUCENE-9335: [Discussion Only] Add BMM scorer and use it for pure disjunction term query
Posted by GitBox <gi...@apache.org>.
jpountz commented on pull request #101:
URL: https://github.com/apache/lucene/pull/101#issuecomment-836811596
@mikemccand I'll try to do it overnight as I have a terrible uplink. FWIW the file I have locally is `enwiki-20130102-lines.txt`, not the `enwiki-20100302-pages-articles-lines.txt` file that luceneutil refers to.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] jpountz commented on a change in pull request #101: LUCENE-9335: [Discussion Only] Add BMM scorer and use it for pure disjunction term query
Posted by GitBox <gi...@apache.org>.
jpountz commented on a change in pull request #101:
URL: https://github.com/apache/lucene/pull/101#discussion_r627389726
##########
File path: lucene/core/src/java/org/apache/lucene/search/BlockMaxMaxscoreScorer.java
##########
@@ -0,0 +1,339 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.search;
+
+import static org.apache.lucene.search.ScorerUtil.costWithMinShouldMatch;
+
+import java.io.IOException;
+import java.util.*;
+
+/** Scorer implementing Block-Max Maxscore algorithm */
+public class BlockMaxMaxscoreScorer extends Scorer {
+ private final ScoreMode scoreMode;
+ private final int scalingFactor;
+
+ // current doc ID of the leads
+ private int doc;
+
+ // doc id boundary that all scorers maxScore are valid
+ private int upTo = -1;
+
+ // heap of scorers ordered by doc ID
+ private final DisiPriorityQueue essentialsScorers;
+
+ // list of scorers whose sum of maxScore is less than minCompetitiveScore, ordered by maxScore
+ private final List<DisiWrapper> nonEssentialScorers;
+
+ // sum of max scores of scorers in nonEssentialScorers list
+ private long nonEssentialMaxScoreSum;
+
+ // sum of score of scorers in essentialScorers list that are positioned on matching doc
+ private long matchedDocScoreSum;
+
+ private long cost;
+
+ private final MaxScoreSumPropagator maxScoreSumPropagator;
+
+ private final List<Scorer> scorers;
+
+ // scaled min competitive score
+ private long minCompetitiveScore = 0;
+
+ /**
+ * Constructs a Scorer
+ *
+ * @param weight The weight to be used.
+ * @param scorers The sub scorers this Scorer should iterate on for optional clauses
+ * @param scoreMode The scoreMode
+ */
+ public BlockMaxMaxscoreScorer(Weight weight, List<Scorer> scorers, ScoreMode scoreMode)
+ throws IOException {
+ super(weight);
+ assert scoreMode == ScoreMode.TOP_SCORES;
+
+ this.scoreMode = scoreMode;
+ this.doc = -1;
+ this.scorers = scorers;
+ this.cost =
+ costWithMinShouldMatch(
+ scorers.stream().map(Scorer::iterator).mapToLong(DocIdSetIterator::cost),
+ scorers.size(),
+ 1);
+
+ essentialsScorers = new DisiPriorityQueue(scorers.size());
+ nonEssentialScorers = new LinkedList<>();
+
+ scalingFactor = WANDScorer.getScalingFactor(scorers);
+ maxScoreSumPropagator = new MaxScoreSumPropagator(scorers);
+
+ for (Scorer scorer : scorers) {
+ nonEssentialScorers.add(new DisiWrapper(scorer));
+ }
+ }
+
+ @Override
+ public DocIdSetIterator iterator() {
+ return TwoPhaseIterator.asDocIdSetIterator(twoPhaseIterator());
+ }
+
+ @Override
+ public TwoPhaseIterator twoPhaseIterator() {
+ DocIdSetIterator approximation =
+ new DocIdSetIterator() {
+ private long lastMinCompetitiveScore;
+
+ @Override
+ public int docID() {
+ return doc;
+ }
+
+ @Override
+ public int nextDoc() throws IOException {
+ return advance(doc + 1);
+ }
+
+ @Override
+ public int advance(int target) throws IOException {
+ doAdvance(target);
+
+ while (doc != DocIdSetIterator.NO_MORE_DOCS
+ && nonEssentialMaxScoreSum + matchedDocScoreSum < minCompetitiveScore) {
+ doAdvance(doc + 1);
+ }
+
+ return doc;
+ }
+
+ private void doAdvance(int target) throws IOException {
+ matchedDocScoreSum = 0;
+ // Find next smallest doc id that is larger than or equal to target from the essential
+ // scorers
+
+ // If the next candidate doc id is still within interval boundary,
+ if (lastMinCompetitiveScore == minCompetitiveScore && target <= upTo) {
+ while (essentialsScorers.top().doc < target) {
+ DisiWrapper w = essentialsScorers.pop();
+ w.doc = w.iterator.advance(target);
+ essentialsScorers.add(w);
+ }
+
+ if (essentialsScorers.top().doc <= upTo) {
+ doc = essentialsScorers.top().doc;
+
+ if (doc == NO_MORE_DOCS) {
+ return;
+ }
+ } else {
+ doc = upTo + 1;
+ }
+ } else {
+ lastMinCompetitiveScore = minCompetitiveScore;
+ // Next candidate doc id is above interval boundary, or minCompetitive has increased.
+ // Find next interval boundary.
+ // Block boundary alignment strategy is adapted from "Optimizing Top-k Document
+ // Retrieval Strategies for Block-Max Indexes" by Dimopoulos, Nepomnyachiy and Suel.
+ // Find the block interval boundary that is the minimum of all participating scorer's
+ // block boundary. Then run BMM within each interval.
+ updateUpToAndMaxScore(target);
+
+ repartitionLists();
+
+ // maxScore of all scorers sum to less than minCompetitiveScore, no more result is
+ // available up to upTo
+ if (essentialsScorers.size() == 0) {
+ // current bound no long valid
+ doc = upTo + 1;
+ } else {
+ doc = essentialsScorers.top().doc;
+
+ if (doc == NO_MORE_DOCS) {
+ return;
+ }
+ }
+ }
+
+ for (DisiWrapper w : essentialsScorers) {
+ if (w.doc == doc && doc != NO_MORE_DOCS) {
+ matchedDocScoreSum += WANDScorer.scaleMaxScore(w.scorer.score(), scalingFactor);
+ }
+ }
+ }
+
+ private void updateUpToAndMaxScore(int target) throws IOException {
+ while (essentialsScorers.size() > 0) {
+ nonEssentialScorers.add(essentialsScorers.pop());
+ }
+
+ // reset upTo
+ upTo = DocIdSetIterator.NO_MORE_DOCS;
+ for (DisiWrapper w : nonEssentialScorers) {
+ if (w.doc < target) {
+ upTo = Math.min(w.scorer.advanceShallow(target), upTo);
+ } else {
+ upTo = Math.min(w.scorer.advanceShallow(w.doc), upTo);
+ }
+ }
+ assert target <= upTo;
+
+ for (DisiWrapper w : nonEssentialScorers) {
+ if (w.doc <= upTo) {
+ w.maxScore = WANDScorer.scaleMaxScore(w.scorer.getMaxScore(upTo), scalingFactor);
+ } else {
+ // This scorer won't be able to contribute to match for target, setting its maxScore
+ // to 0 so it goes into nonEssentialList
+ w.maxScore = 0;
+ }
+ if (w.doc < target) {
+ w.doc = w.iterator.advance(target);
+ }
+ }
+ }
+
+ private void repartitionLists() {
+ Collections.sort(nonEssentialScorers, (w1, w2) -> (int) (w1.maxScore - w2.maxScore));
+
+ // Re-partition the scorers into non-essential list and essential list, as defined in
+ // the "Optimizing Top-k Document Retrieval Strategies for Block-Max Indexes" paper.
+ nonEssentialMaxScoreSum = 0;
+ for (int i = 0; i < nonEssentialScorers.size(); i++) {
+ DisiWrapper w = nonEssentialScorers.get(i);
+ if (nonEssentialMaxScoreSum + w.maxScore < minCompetitiveScore) {
+ nonEssentialMaxScoreSum += w.maxScore;
+ } else {
+ // the logic is a bit ugly here...but as soon as we find maxScore of scorers in
+ // non-essential list sum to above minCompetitiveScore, we move the rest of
+ // scorers
+ // into essential list
+ for (int j = nonEssentialScorers.size() - 1; j >= i; j--) {
+ essentialsScorers.add(nonEssentialScorers.remove(j));
+ }
+ break;
+ }
+ }
+ }
+
+ @Override
+ public long cost() {
+ // fixed at initialization
+ return cost;
+ }
+ };
+
+ return new TwoPhaseIterator(approximation) {
+ @Override
+ public boolean matches() throws IOException {
+ // The doc is a match when all scores sum above minCompetitiveScore
+ for (DisiWrapper w : nonEssentialScorers) {
+ if (w.doc < doc) {
+ w.doc = w.iterator.advance(doc);
+ }
+ }
+
+ if (matchedDocScoreSum >= minCompetitiveScore) {
+ return true;
+ }
+
+ for (DisiWrapper w : nonEssentialScorers) {
+ if (w.doc == doc) {
+ matchedDocScoreSum += WANDScorer.scaleMaxScore(w.scorer.score(), scalingFactor);
+
+ if (matchedDocScoreSum >= minCompetitiveScore) {
+ return true;
+ }
+ }
+ }
+
+ return false;
+ }
Review comment:
I don't think that implementing a TwoPhaseIterator this way helps, since we're computing scores once here to check whether the hit is competitive, and then again in `score()` if the hit is competitive. Let's return the approximation directly as `iterator()` and hits that don't have a competitive scores will naturally be ignored after `score()` is computed since their scores will be lower than the score of the top of the priority queue?
##########
File path: lucene/core/src/java/org/apache/lucene/search/BlockMaxMaxscoreScorer.java
##########
@@ -0,0 +1,339 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.search;
+
+import static org.apache.lucene.search.ScorerUtil.costWithMinShouldMatch;
+
+import java.io.IOException;
+import java.util.*;
+
+/** Scorer implementing Block-Max Maxscore algorithm */
+public class BlockMaxMaxscoreScorer extends Scorer {
+ private final ScoreMode scoreMode;
+ private final int scalingFactor;
+
+ // current doc ID of the leads
+ private int doc;
+
+ // doc id boundary that all scorers maxScore are valid
+ private int upTo = -1;
+
+ // heap of scorers ordered by doc ID
+ private final DisiPriorityQueue essentialsScorers;
+
+ // list of scorers whose sum of maxScore is less than minCompetitiveScore, ordered by maxScore
+ private final List<DisiWrapper> nonEssentialScorers;
+
+ // sum of max scores of scorers in nonEssentialScorers list
+ private long nonEssentialMaxScoreSum;
+
+ // sum of score of scorers in essentialScorers list that are positioned on matching doc
+ private long matchedDocScoreSum;
+
+ private long cost;
+
+ private final MaxScoreSumPropagator maxScoreSumPropagator;
+
+ private final List<Scorer> scorers;
+
+ // scaled min competitive score
+ private long minCompetitiveScore = 0;
+
+ /**
+ * Constructs a Scorer
+ *
+ * @param weight The weight to be used.
+ * @param scorers The sub scorers this Scorer should iterate on for optional clauses
+ * @param scoreMode The scoreMode
+ */
+ public BlockMaxMaxscoreScorer(Weight weight, List<Scorer> scorers, ScoreMode scoreMode)
+ throws IOException {
+ super(weight);
+ assert scoreMode == ScoreMode.TOP_SCORES;
+
+ this.scoreMode = scoreMode;
+ this.doc = -1;
+ this.scorers = scorers;
+ this.cost =
+ costWithMinShouldMatch(
+ scorers.stream().map(Scorer::iterator).mapToLong(DocIdSetIterator::cost),
+ scorers.size(),
+ 1);
+
+ essentialsScorers = new DisiPriorityQueue(scorers.size());
+ nonEssentialScorers = new LinkedList<>();
+
+ scalingFactor = WANDScorer.getScalingFactor(scorers);
+ maxScoreSumPropagator = new MaxScoreSumPropagator(scorers);
+
+ for (Scorer scorer : scorers) {
+ nonEssentialScorers.add(new DisiWrapper(scorer));
+ }
+ }
+
+ @Override
+ public DocIdSetIterator iterator() {
+ return TwoPhaseIterator.asDocIdSetIterator(twoPhaseIterator());
+ }
+
+ @Override
+ public TwoPhaseIterator twoPhaseIterator() {
+ DocIdSetIterator approximation =
+ new DocIdSetIterator() {
+ private long lastMinCompetitiveScore;
+
+ @Override
+ public int docID() {
+ return doc;
+ }
+
+ @Override
+ public int nextDoc() throws IOException {
+ return advance(doc + 1);
+ }
+
+ @Override
+ public int advance(int target) throws IOException {
+ doAdvance(target);
+
+ while (doc != DocIdSetIterator.NO_MORE_DOCS
+ && nonEssentialMaxScoreSum + matchedDocScoreSum < minCompetitiveScore) {
+ doAdvance(doc + 1);
+ }
+
+ return doc;
+ }
+
+ private void doAdvance(int target) throws IOException {
+ matchedDocScoreSum = 0;
+ // Find next smallest doc id that is larger than or equal to target from the essential
+ // scorers
+
+ // If the next candidate doc id is still within interval boundary,
+ if (lastMinCompetitiveScore == minCompetitiveScore && target <= upTo) {
+ while (essentialsScorers.top().doc < target) {
+ DisiWrapper w = essentialsScorers.pop();
+ w.doc = w.iterator.advance(target);
+ essentialsScorers.add(w);
+ }
+
+ if (essentialsScorers.top().doc <= upTo) {
+ doc = essentialsScorers.top().doc;
+
+ if (doc == NO_MORE_DOCS) {
+ return;
+ }
+ } else {
+ doc = upTo + 1;
+ }
+ } else {
+ lastMinCompetitiveScore = minCompetitiveScore;
+ // Next candidate doc id is above interval boundary, or minCompetitive has increased.
+ // Find next interval boundary.
+ // Block boundary alignment strategy is adapted from "Optimizing Top-k Document
+ // Retrieval Strategies for Block-Max Indexes" by Dimopoulos, Nepomnyachiy and Suel.
+ // Find the block interval boundary that is the minimum of all participating scorer's
+ // block boundary. Then run BMM within each interval.
+ updateUpToAndMaxScore(target);
+
+ repartitionLists();
+
+ // maxScore of all scorers sum to less than minCompetitiveScore, no more result is
+ // available up to upTo
+ if (essentialsScorers.size() == 0) {
+ // current bound no long valid
+ doc = upTo + 1;
+ } else {
+ doc = essentialsScorers.top().doc;
+
+ if (doc == NO_MORE_DOCS) {
+ return;
+ }
+ }
+ }
+
+ for (DisiWrapper w : essentialsScorers) {
+ if (w.doc == doc && doc != NO_MORE_DOCS) {
+ matchedDocScoreSum += WANDScorer.scaleMaxScore(w.scorer.score(), scalingFactor);
+ }
+ }
+ }
+
+ private void updateUpToAndMaxScore(int target) throws IOException {
+ while (essentialsScorers.size() > 0) {
+ nonEssentialScorers.add(essentialsScorers.pop());
+ }
+
+ // reset upTo
+ upTo = DocIdSetIterator.NO_MORE_DOCS;
+ for (DisiWrapper w : nonEssentialScorers) {
+ if (w.doc < target) {
+ upTo = Math.min(w.scorer.advanceShallow(target), upTo);
+ } else {
+ upTo = Math.min(w.scorer.advanceShallow(w.doc), upTo);
+ }
+ }
+ assert target <= upTo;
+
+ for (DisiWrapper w : nonEssentialScorers) {
+ if (w.doc <= upTo) {
+ w.maxScore = WANDScorer.scaleMaxScore(w.scorer.getMaxScore(upTo), scalingFactor);
+ } else {
+ // This scorer won't be able to contribute to match for target, setting its maxScore
+ // to 0 so it goes into nonEssentialList
+ w.maxScore = 0;
+ }
+ if (w.doc < target) {
+ w.doc = w.iterator.advance(target);
+ }
Review comment:
I wouldn't expect that we would have to advance scorers when updating `upTo`?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] zacharymorn commented on pull request #101: LUCENE-9335: [Discussion Only] Add BMM scorer and use it for pure disjunction term query
Posted by GitBox <gi...@apache.org>.
zacharymorn commented on pull request #101:
URL: https://github.com/apache/lucene/pull/101#issuecomment-840934854
Hi @jpountz, upon further reviewing the result above, I found one issue for this particular scorer implementation (but not for the other BulkScorer based implementations). Currently this BMM scorer would also be used for nested boolean queries such as `+body:interview +(body:at body:united)` (AndMedOrHighHigh), but in the jira ticket you had suggested to use BMM for top-level (flat?) boolean query only. Do you think this will need to be fixed? I took a quick look at the code and it seems the query and weight interfaces do not provide an API to tell whether they are top-level or not currently, so this may involve some changes in `BooleanWeight` code I think.
On the other hand, from the results above it looks like 2 out of 3 instances, having BMM scorer to run for nested queries doesn’t hurt its performance (the `AndMedOrHighHigh ` one). So I’m wondering maybe it’s ok to leave it as is, as BMM may still give a boost for the pure disjunction part of the query? The one result that does show negative impact to `AndMedOrHighHigh` also shows impact to `OrHighMed`, so it’s a bit strange and may need further looking into to see the cause.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] zacharymorn commented on pull request #101: LUCENE-9335: [Discussion Only] Add BMM scorer and use it for pure disjunction term query
Posted by GitBox <gi...@apache.org>.
zacharymorn commented on pull request #101:
URL: https://github.com/apache/lucene/pull/101#issuecomment-841606431
> > in the jira ticket you had suggested to use BMM for top-level (flat?) boolean query only. Do you think this will need to be fixed?
>
> I opened this JIRA ticket because it felt like we could do better for top-level disjunctions, but if BMM appears to work better most of the time, we could just move to it all the time.
>
> > The one result that does show negative impact to AndMedOrHighHigh also shows impact to OrHighMed, so it’s a bit strange and may need further looking into to see the cause.
>
> Yeah, I suspect there will always be cases when BMW will perform better than BMM or vice-versa, sometimes for subtle reasons.
Makes sense! I'll not attempt to fix it for now then.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org