You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2023/01/10 19:45:00 UTC
[GitHub] [lucene] gsmiller opened a new pull request, #12073: Move ReqExclScorer exclusion checking into first-phase when the exclusion Scorer has no second-phase check
gsmiller opened a new pull request, #12073:
URL: https://github.com/apache/lucene/pull/12073
### Description
When the exclusion scorer in `ReqExclScorer` has no second-phase check, we can move the exclusion checking into the first phase fairly easily. This intuitively seems like the right thing to do, and `luceneutil#wikimedium10m` appears to confirm the performance is a little better for relevant negation tasks (`OrNotHighHigh`, `OrHighNotMed`, `OrHighNotLow`, `OrNotHighMed`, and `OrHighNotHigh`).
Full benchmark results for `wikimedium10m`:
```
TaskQPS baseline StdDevQPS candidate StdDev Pct diff p-value
AndHighHigh 50.29 (4.8%) 49.54 (5.1%) -1.5% ( -10% - 8%) 0.333
MedTerm 521.14 (4.5%) 513.29 (4.0%) -1.5% ( -9% - 7%) 0.263
HighTerm 539.31 (4.3%) 532.46 (3.9%) -1.3% ( -9% - 7%) 0.324
BrowseMonthTaxoFacets 32.65 (3.8%) 32.24 (7.1%) -1.3% ( -11% - 9%) 0.484
MedSloppyPhrase 113.51 (2.5%) 112.23 (3.0%) -1.1% ( -6% - 4%) 0.200
HighSloppyPhrase 23.06 (3.2%) 22.81 (3.6%) -1.1% ( -7% - 5%) 0.318
HighTermMonthSort 2860.46 (2.8%) 2831.28 (3.4%) -1.0% ( -7% - 5%) 0.301
LowSloppyPhrase 29.08 (1.9%) 28.80 (2.1%) -1.0% ( -4% - 3%) 0.127
AndHighMed 245.19 (2.7%) 242.90 (3.4%) -0.9% ( -6% - 5%) 0.334
LowTerm 778.95 (4.6%) 772.26 (3.2%) -0.9% ( -8% - 7%) 0.491
BrowseMonthSSDVFacets 14.09 (1.8%) 13.98 (1.8%) -0.8% ( -4% - 2%) 0.187
IntNRQ 215.47 (6.6%) 214.33 (7.4%) -0.5% ( -13% - 14%) 0.811
OrHighMed 98.72 (3.0%) 98.31 (2.5%) -0.4% ( -5% - 5%) 0.629
LowPhrase 47.03 (2.9%) 46.86 (3.2%) -0.4% ( -6% - 5%) 0.711
HighPhrase 62.54 (4.2%) 62.34 (4.8%) -0.3% ( -8% - 9%) 0.827
Prefix3 214.70 (3.4%) 214.10 (2.9%) -0.3% ( -6% - 6%) 0.780
MedPhrase 36.02 (3.1%) 35.93 (3.6%) -0.2% ( -6% - 6%) 0.829
Wildcard 116.23 (3.4%) 115.99 (2.6%) -0.2% ( -6% - 6%) 0.826
BrowseRandomLabelSSDVFacets 10.21 (8.2%) 10.19 (8.3%) -0.2% ( -15% - 17%) 0.943
OrHighHigh 34.57 (3.2%) 34.59 (3.8%) 0.1% ( -6% - 7%) 0.960
AndHighLow 1255.64 (2.2%) 1256.69 (2.1%) 0.1% ( -4% - 4%) 0.904
Respell 46.56 (1.0%) 46.61 (1.1%) 0.1% ( -2% - 2%) 0.795
MedIntervalsOrdered 6.62 (2.7%) 6.62 (2.8%) 0.1% ( -5% - 5%) 0.901
AndHighMedDayTaxoFacets 82.57 (1.7%) 82.67 (1.5%) 0.1% ( -3% - 3%) 0.825
AndHighHighDayTaxoFacets 9.35 (2.0%) 9.36 (2.2%) 0.1% ( -3% - 4%) 0.849
LowIntervalsOrdered 27.77 (5.0%) 27.80 (5.4%) 0.1% ( -9% - 11%) 0.935
BrowseRandomLabelTaxoFacets 23.47 (2.8%) 23.50 (2.8%) 0.1% ( -5% - 5%) 0.874
Fuzzy2 75.07 (1.0%) 75.21 (1.1%) 0.2% ( -1% - 2%) 0.587
Fuzzy1 79.08 (0.7%) 79.22 (0.8%) 0.2% ( -1% - 1%) 0.431
MedSpanNear 9.07 (1.9%) 9.08 (2.2%) 0.2% ( -3% - 4%) 0.754
HighSpanNear 5.25 (1.5%) 5.26 (1.4%) 0.3% ( -2% - 3%) 0.588
HighIntervalsOrdered 11.17 (4.5%) 11.20 (4.3%) 0.3% ( -8% - 9%) 0.821
BrowseDateTaxoFacets 31.80 (4.2%) 31.93 (3.7%) 0.4% ( -7% - 8%) 0.727
BrowseDayOfYearTaxoFacets 32.02 (4.4%) 32.19 (4.0%) 0.5% ( -7% - 9%) 0.689
MedTermDayTaxoFacets 42.99 (3.8%) 43.24 (3.3%) 0.6% ( -6% - 7%) 0.605
HighTermTitleSort 112.14 (3.8%) 112.79 (4.3%) 0.6% ( -7% - 9%) 0.652
LowSpanNear 198.11 (1.7%) 199.54 (1.6%) 0.7% ( -2% - 4%) 0.172
OrNotHighLow 722.41 (2.9%) 728.48 (2.8%) 0.8% ( -4% - 6%) 0.351
OrHighLow 370.98 (2.9%) 374.17 (3.2%) 0.9% ( -5% - 7%) 0.372
TermDTSort 73.90 (3.1%) 74.56 (4.1%) 0.9% ( -6% - 8%) 0.432
OrHighMedDayTaxoFacets 4.23 (4.3%) 4.27 (3.5%) 0.9% ( -6% - 9%) 0.460
PKLookup 180.56 (1.5%) 182.54 (2.4%) 1.1% ( -2% - 5%) 0.080
HighTermTitleBDVSort 11.73 (5.1%) 11.87 (6.3%) 1.2% ( -9% - 13%) 0.512
HighTermDayOfYearSort 260.53 (2.9%) 263.83 (2.8%) 1.3% ( -4% - 7%) 0.167
OrNotHighHigh 396.17 (3.8%) 406.17 (3.9%) 2.5% ( -4% - 10%) 0.038
BrowseDateSSDVFacets 3.73 (5.4%) 3.83 (9.2%) 2.6% ( -11% - 18%) 0.280
OrHighNotMed 355.05 (4.5%) 364.37 (5.2%) 2.6% ( -6% - 12%) 0.088
BrowseDayOfYearSSDVFacets 13.03 (2.1%) 13.40 (11.5%) 2.8% ( -10% - 16%) 0.282
OrHighNotLow 373.18 (5.1%) 383.76 (5.7%) 2.8% ( -7% - 14%) 0.098
OrNotHighMed 365.05 (3.8%) 377.09 (3.4%) 3.3% ( -3% - 10%) 0.004
OrHighNotHigh 200.46 (4.2%) 208.91 (4.3%) 4.2% ( -4% - 13%) 0.002
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] gsmiller commented on pull request #12073: Move ReqExclScorer exclusion checking into first-phase when the exclusion Scorer has no second-phase check
Posted by GitBox <gi...@apache.org>.
gsmiller commented on PR #12073:
URL: https://github.com/apache/lucene/pull/12073#issuecomment-1379399579
@jpountz:
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] rmuir commented on pull request #12073: Move ReqExclScorer exclusion checking into first-phase when the exclusion Scorer has no second-phase check
Posted by GitBox <gi...@apache.org>.
rmuir commented on PR #12073:
URL: https://github.com/apache/lucene/pull/12073#issuecomment-1378122618
But the benchmark checks only the best case scenario here... TermQuery? What about more expensive excl? e.g. a wildcard or something (with the ConstantScoreScorer that has no two-phase)
The optimization seems risky to me and just not intuitive: personally I think of approximation(ReqExcl) = Req. Why do additional work in the first phase, unless we really need to: it could slow everything down?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] jpountz commented on pull request #12073: Move ReqExclScorer exclusion checking into first-phase when the exclusion Scorer has no second-phase check
Posted by GitBox <gi...@apache.org>.
jpountz commented on PR #12073:
URL: https://github.com/apache/lucene/pull/12073#issuecomment-1378939790
Do these queries actually use `ReqExclScorer`? I would have expected them to use `ReqExclBulkScorer`?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] rmuir commented on pull request #12073: Move ReqExclScorer exclusion checking into first-phase when the exclusion Scorer has no second-phase check
Posted by GitBox <gi...@apache.org>.
rmuir commented on PR #12073:
URL: https://github.com/apache/lucene/pull/12073#issuecomment-1378947355
I cant keep up with the conditions where bulk is unsuitable. But just generally, my concern is to implement a best-case 1-2% speedup and create a large worst-case slowdown somewhere else.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] gsmiller commented on pull request #12073: Move ReqExclScorer exclusion checking into first-phase when the exclusion Scorer has no second-phase check
Posted by GitBox <gi...@apache.org>.
gsmiller commented on PR #12073:
URL: https://github.com/apache/lucene/pull/12073#issuecomment-1381092287
Moved this into "draft" state until I'm able to come back and figure out why there was a benchmark improvement with this change, given the feedback that `ReqExclBulkScorer` would be expected to kick in. I've gotten busy with some other work, but plan to come back and dig into this. In the meantime though, moving to "draft."
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] gsmiller closed pull request #12073: Move ReqExclScorer exclusion checking into first-phase when the exclusion Scorer has no second-phase check
Posted by "gsmiller (via GitHub)" <gi...@apache.org>.
gsmiller closed pull request #12073: Move ReqExclScorer exclusion checking into first-phase when the exclusion Scorer has no second-phase check
URL: https://github.com/apache/lucene/pull/12073
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] rmuir commented on pull request #12073: Move ReqExclScorer exclusion checking into first-phase when the exclusion Scorer has no second-phase check
Posted by GitBox <gi...@apache.org>.
rmuir commented on PR #12073:
URL: https://github.com/apache/lucene/pull/12073#issuecomment-1378128948
and maybe wildcard query isn't great example either: confirming matches is gonna be fast since its a bitset.
But I am still suspicious of the logic here, I don't think we can infer anything about performance from "twoPhaseIterator() == null"... and it seems that is what we are doing here. We can't infer that matches are fast just because nobody `@Override` the method :) It could just be a query that isn't well optimized (e.g. custom user query or something). There are probably some of these in lucene that are an interesting worst-case here.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] jpountz commented on pull request #12073: Move ReqExclScorer exclusion checking into first-phase when the exclusion Scorer has no second-phase check
Posted by GitBox <gi...@apache.org>.
jpountz commented on PR #12073:
URL: https://github.com/apache/lucene/pull/12073#issuecomment-1378951278
Sorry my comment was not targeted at your comment but at understanding why we're seeing a speedup with this change, since I wouldn't expect this scorer to be used (though my understanding could be outdated).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] gsmiller commented on pull request #12073: Move ReqExclScorer exclusion checking into first-phase when the exclusion Scorer has no second-phase check
Posted by GitBox <gi...@apache.org>.
gsmiller commented on PR #12073:
URL: https://github.com/apache/lucene/pull/12073#issuecomment-1379404470
@jpountz:
> understanding why we're seeing a speedup with this change
Good question. I was not familiar with `ReqExclBulkScorer`, but after taking a bit of a look, I have the same question now. I'll have to spend more time looking at these benchmark tasks to understand what might be going on here. Thank you for pointing this out.
@rmuir:
Thanks for the feedback! You raise some good questions here. My intuition was that reducing the size of the first phase match set would be beneficial in general. The reasoning being that, 1) it reduces the number of "candidates" that other potential conjunction clauses need to evaluate, and 2) it reduces the number of candidates that need to go to a second-phase (particularly if there are 2nd phase checks occurring _before_ the exclusion check). But I hear your point about term queries being a best-case scenario here. If the exclusion clauses are more complex, it's more difficult to reason about.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org