You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Adrien Grand (Jira)" <ji...@apache.org> on 2021/09/23 18:23:00 UTC
[jira] [Commented] (LUCENE-10121) WANDScorer could skip more

    [ https://issues.apache.org/jira/browse/LUCENE-10121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17419385#comment-17419385 ] 

Adrien Grand commented on LUCENE-10121:
---------------------------------------

I opened a pull request that tries to avoid this issue by looking at the floating-point scores as well, still in a way that is prone to rounding errors.

Here are the results of luceneutil on wikibigall:

{noformat}
                            TaskQPS baseline      StdDev   QPS patch      StdDev                Pct diff p-value
           HighTermDayOfYearSort     2048.74     (29.0%)     1984.55     (30.6%)   -3.1% ( -48% -   79%) 0.740
                          Fuzzy2      100.30      (4.8%)       98.92      (5.2%)   -1.4% ( -10% -    9%) 0.383
                      HighPhrase       84.34      (2.6%)       83.47      (1.8%)   -1.0% (  -5% -    3%) 0.141
                       OrHighLow      576.29      (2.8%)      570.57      (2.4%)   -1.0% (  -6% -    4%) 0.227
                      AndHighLow      484.18      (3.8%)      480.17      (3.5%)   -0.8% (  -7% -    6%) 0.473
                       OrHighMed       95.16      (4.0%)       94.41      (3.8%)   -0.8% (  -8% -    7%) 0.520
                         Respell      177.03      (2.4%)      176.16      (2.4%)   -0.5% (  -5% -    4%) 0.517
                HighSloppyPhrase        3.50      (3.5%)        3.49      (4.2%)   -0.5% (  -7% -    7%) 0.692
                      AndHighMed      154.00      (4.0%)      153.27      (3.8%)   -0.5% (  -8% -    7%) 0.704
                         Prefix3      210.72     (12.9%)      209.87     (13.2%)   -0.4% ( -23% -   29%) 0.922
                        HighTerm     1546.28      (3.6%)     1540.74      (2.8%)   -0.4% (  -6% -    6%) 0.727
               HighTermMonthSort      116.31      (6.0%)      115.94      (4.9%)   -0.3% ( -10% -   11%) 0.853
                          IntNRQ      435.27      (1.9%)      434.13      (1.5%)   -0.3% (  -3% -    3%) 0.622
                        Wildcard      126.26     (12.5%)      125.93     (13.2%)   -0.3% ( -23% -   28%) 0.950
                          Fuzzy1      181.58      (8.5%)      181.12      (6.9%)   -0.3% ( -14% -   16%) 0.917
                       LowPhrase       60.14      (2.2%)       60.02      (2.0%)   -0.2% (  -4% -    4%) 0.750
                         MedTerm     1549.63      (2.5%)     1547.41      (3.2%)   -0.1% (  -5% -    5%) 0.874
                     LowSpanNear       13.72      (3.3%)       13.71      (2.8%)   -0.1% (  -6% -    6%) 0.944
                     AndHighHigh       73.67      (3.5%)       73.62      (2.9%)   -0.1% (  -6% -    6%) 0.950
                         LowTerm     2856.45      (3.4%)     2855.10      (4.3%)   -0.0% (  -7% -    7%) 0.969
                     MedSpanNear        5.15      (9.9%)        5.15      (9.1%)   -0.0% ( -17% -   21%) 0.996
                       MedPhrase       25.88      (2.4%)       25.87      (2.4%)   -0.0% (  -4% -    5%) 0.987
                 LowSloppyPhrase       79.38      (3.8%)       79.48      (3.6%)    0.1% (  -7% -    7%) 0.917
                 MedSloppyPhrase       12.25      (3.1%)       12.28      (3.4%)    0.2% (  -6% -    6%) 0.817
                    HighSpanNear        6.20      (4.5%)        6.22      (3.3%)    0.3% (  -7% -    8%) 0.782
                      OrHighHigh       20.94      (3.3%)       21.35      (4.2%)    2.0% (  -5% -    9%) 0.098
{noformat}

There is a modest (it's consistently reproducible so I believe it's not noise) improvement to OrHighHigh with no slowdown of other queries. This is expected since most blocks generally have different maximum scores.

However the (cab_color:y OR cab_color:g) query on the sorted sparse NYC Taxis goes from 80ms to 2ms.

> WANDScorer could skip more
> --------------------------
>
>                 Key: LUCENE-10121
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10121
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Adrien Grand
>            Priority: Minor
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> I was looking at the NYC Taxis benchmark recently and got puzzled by the fact that the query (cab_color:y OR cab_color:g) ran so slowly: http://people.apache.org/~mikemccand/lucenebench/sparseResults.html#search_bq_qps. This is supposed to be a best-case scenario for WAND: there are only two possible scores for documents, this query should return instantly in the sorted case.
> After digging I noticed that this is due to the scaling that we due in WANDScorer to avoid floating-point rounding errors: documents can be considered as possible matches according to the scaled scores (which are rounded) while they cannot possibly match according to the actual scores. This is especially visible when many blocks contain a document that has the maximum score across the entire postings list, so any field indexed with indexOptions=DOCS or constant-scoring queries for instance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org