You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Adrien Grand (Jira)" <ji...@apache.org> on 2021/09/23 18:23:00 UTC
[jira] [Commented] (LUCENE-10121) WANDScorer could skip more
[ https://issues.apache.org/jira/browse/LUCENE-10121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17419385#comment-17419385 ]
Adrien Grand commented on LUCENE-10121:
---------------------------------------
I opened a pull request that tries to avoid this issue by looking at the floating-point scores as well, still in a way that is prone to rounding errors.
Here are the results of luceneutil on wikibigall:
{noformat}
TaskQPS baseline StdDev QPS patch StdDev Pct diff p-value
HighTermDayOfYearSort 2048.74 (29.0%) 1984.55 (30.6%) -3.1% ( -48% - 79%) 0.740
Fuzzy2 100.30 (4.8%) 98.92 (5.2%) -1.4% ( -10% - 9%) 0.383
HighPhrase 84.34 (2.6%) 83.47 (1.8%) -1.0% ( -5% - 3%) 0.141
OrHighLow 576.29 (2.8%) 570.57 (2.4%) -1.0% ( -6% - 4%) 0.227
AndHighLow 484.18 (3.8%) 480.17 (3.5%) -0.8% ( -7% - 6%) 0.473
OrHighMed 95.16 (4.0%) 94.41 (3.8%) -0.8% ( -8% - 7%) 0.520
Respell 177.03 (2.4%) 176.16 (2.4%) -0.5% ( -5% - 4%) 0.517
HighSloppyPhrase 3.50 (3.5%) 3.49 (4.2%) -0.5% ( -7% - 7%) 0.692
AndHighMed 154.00 (4.0%) 153.27 (3.8%) -0.5% ( -8% - 7%) 0.704
Prefix3 210.72 (12.9%) 209.87 (13.2%) -0.4% ( -23% - 29%) 0.922
HighTerm 1546.28 (3.6%) 1540.74 (2.8%) -0.4% ( -6% - 6%) 0.727
HighTermMonthSort 116.31 (6.0%) 115.94 (4.9%) -0.3% ( -10% - 11%) 0.853
IntNRQ 435.27 (1.9%) 434.13 (1.5%) -0.3% ( -3% - 3%) 0.622
Wildcard 126.26 (12.5%) 125.93 (13.2%) -0.3% ( -23% - 28%) 0.950
Fuzzy1 181.58 (8.5%) 181.12 (6.9%) -0.3% ( -14% - 16%) 0.917
LowPhrase 60.14 (2.2%) 60.02 (2.0%) -0.2% ( -4% - 4%) 0.750
MedTerm 1549.63 (2.5%) 1547.41 (3.2%) -0.1% ( -5% - 5%) 0.874
LowSpanNear 13.72 (3.3%) 13.71 (2.8%) -0.1% ( -6% - 6%) 0.944
AndHighHigh 73.67 (3.5%) 73.62 (2.9%) -0.1% ( -6% - 6%) 0.950
LowTerm 2856.45 (3.4%) 2855.10 (4.3%) -0.0% ( -7% - 7%) 0.969
MedSpanNear 5.15 (9.9%) 5.15 (9.1%) -0.0% ( -17% - 21%) 0.996
MedPhrase 25.88 (2.4%) 25.87 (2.4%) -0.0% ( -4% - 5%) 0.987
LowSloppyPhrase 79.38 (3.8%) 79.48 (3.6%) 0.1% ( -7% - 7%) 0.917
MedSloppyPhrase 12.25 (3.1%) 12.28 (3.4%) 0.2% ( -6% - 6%) 0.817
HighSpanNear 6.20 (4.5%) 6.22 (3.3%) 0.3% ( -7% - 8%) 0.782
OrHighHigh 20.94 (3.3%) 21.35 (4.2%) 2.0% ( -5% - 9%) 0.098
{noformat}
There is a modest (it's consistently reproducible so I believe it's not noise) improvement to OrHighHigh with no slowdown of other queries. This is expected since most blocks generally have different maximum scores.
However the (cab_color:y OR cab_color:g) query on the sorted sparse NYC Taxis goes from 80ms to 2ms.
> WANDScorer could skip more
> --------------------------
>
> Key: LUCENE-10121
> URL: https://issues.apache.org/jira/browse/LUCENE-10121
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Adrien Grand
> Priority: Minor
> Time Spent: 10m
> Remaining Estimate: 0h
>
> I was looking at the NYC Taxis benchmark recently and got puzzled by the fact that the query (cab_color:y OR cab_color:g) ran so slowly: http://people.apache.org/~mikemccand/lucenebench/sparseResults.html#search_bq_qps. This is supposed to be a best-case scenario for WAND: there are only two possible scores for documents, this query should return instantly in the sorted case.
> After digging I noticed that this is due to the scaling that we due in WANDScorer to avoid floating-point rounding errors: documents can be considered as possible matches according to the scaled scores (which are rounded) while they cannot possibly match according to the actual scores. This is especially visible when many blocks contain a document that has the maximum score across the entire postings list, so any field indexed with indexOptions=DOCS or constant-scoring queries for instance.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org