You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2023/01/10 19:45:00 UTC

[GitHub] [lucene] gsmiller opened a new pull request, #12073: Move ReqExclScorer exclusion checking into first-phase when the exclusion Scorer has no second-phase check

gsmiller opened a new pull request, #12073:
URL: https://github.com/apache/lucene/pull/12073

   ### Description
   
   When the exclusion scorer in `ReqExclScorer` has no second-phase check, we can move the exclusion checking into the first phase fairly easily. This intuitively seems like the right thing to do, and `luceneutil#wikimedium10m` appears to confirm the performance is a little better for relevant negation tasks (`OrNotHighHigh`, `OrHighNotMed`, `OrHighNotLow`, `OrNotHighMed`, and `OrHighNotHigh`). 
   
   Full benchmark results for `wikimedium10m`:
   ```
                               TaskQPS baseline      StdDevQPS candidate      StdDev                Pct diff p-value
                        AndHighHigh       50.29      (4.8%)       49.54      (5.1%)   -1.5% ( -10% -    8%) 0.333
                            MedTerm      521.14      (4.5%)      513.29      (4.0%)   -1.5% (  -9% -    7%) 0.263
                           HighTerm      539.31      (4.3%)      532.46      (3.9%)   -1.3% (  -9% -    7%) 0.324
              BrowseMonthTaxoFacets       32.65      (3.8%)       32.24      (7.1%)   -1.3% ( -11% -    9%) 0.484
                    MedSloppyPhrase      113.51      (2.5%)      112.23      (3.0%)   -1.1% (  -6% -    4%) 0.200
                   HighSloppyPhrase       23.06      (3.2%)       22.81      (3.6%)   -1.1% (  -7% -    5%) 0.318
                  HighTermMonthSort     2860.46      (2.8%)     2831.28      (3.4%)   -1.0% (  -7% -    5%) 0.301
                    LowSloppyPhrase       29.08      (1.9%)       28.80      (2.1%)   -1.0% (  -4% -    3%) 0.127
                         AndHighMed      245.19      (2.7%)      242.90      (3.4%)   -0.9% (  -6% -    5%) 0.334
                            LowTerm      778.95      (4.6%)      772.26      (3.2%)   -0.9% (  -8% -    7%) 0.491
              BrowseMonthSSDVFacets       14.09      (1.8%)       13.98      (1.8%)   -0.8% (  -4% -    2%) 0.187
                             IntNRQ      215.47      (6.6%)      214.33      (7.4%)   -0.5% ( -13% -   14%) 0.811
                          OrHighMed       98.72      (3.0%)       98.31      (2.5%)   -0.4% (  -5% -    5%) 0.629
                          LowPhrase       47.03      (2.9%)       46.86      (3.2%)   -0.4% (  -6% -    5%) 0.711
                         HighPhrase       62.54      (4.2%)       62.34      (4.8%)   -0.3% (  -8% -    9%) 0.827
                            Prefix3      214.70      (3.4%)      214.10      (2.9%)   -0.3% (  -6% -    6%) 0.780
                          MedPhrase       36.02      (3.1%)       35.93      (3.6%)   -0.2% (  -6% -    6%) 0.829
                           Wildcard      116.23      (3.4%)      115.99      (2.6%)   -0.2% (  -6% -    6%) 0.826
        BrowseRandomLabelSSDVFacets       10.21      (8.2%)       10.19      (8.3%)   -0.2% ( -15% -   17%) 0.943
                         OrHighHigh       34.57      (3.2%)       34.59      (3.8%)    0.1% (  -6% -    7%) 0.960
                         AndHighLow     1255.64      (2.2%)     1256.69      (2.1%)    0.1% (  -4% -    4%) 0.904
                            Respell       46.56      (1.0%)       46.61      (1.1%)    0.1% (  -2% -    2%) 0.795
                MedIntervalsOrdered        6.62      (2.7%)        6.62      (2.8%)    0.1% (  -5% -    5%) 0.901
            AndHighMedDayTaxoFacets       82.57      (1.7%)       82.67      (1.5%)    0.1% (  -3% -    3%) 0.825
           AndHighHighDayTaxoFacets        9.35      (2.0%)        9.36      (2.2%)    0.1% (  -3% -    4%) 0.849
                LowIntervalsOrdered       27.77      (5.0%)       27.80      (5.4%)    0.1% (  -9% -   11%) 0.935
        BrowseRandomLabelTaxoFacets       23.47      (2.8%)       23.50      (2.8%)    0.1% (  -5% -    5%) 0.874
                             Fuzzy2       75.07      (1.0%)       75.21      (1.1%)    0.2% (  -1% -    2%) 0.587
                             Fuzzy1       79.08      (0.7%)       79.22      (0.8%)    0.2% (  -1% -    1%) 0.431
                        MedSpanNear        9.07      (1.9%)        9.08      (2.2%)    0.2% (  -3% -    4%) 0.754
                       HighSpanNear        5.25      (1.5%)        5.26      (1.4%)    0.3% (  -2% -    3%) 0.588
               HighIntervalsOrdered       11.17      (4.5%)       11.20      (4.3%)    0.3% (  -8% -    9%) 0.821
               BrowseDateTaxoFacets       31.80      (4.2%)       31.93      (3.7%)    0.4% (  -7% -    8%) 0.727
          BrowseDayOfYearTaxoFacets       32.02      (4.4%)       32.19      (4.0%)    0.5% (  -7% -    9%) 0.689
               MedTermDayTaxoFacets       42.99      (3.8%)       43.24      (3.3%)    0.6% (  -6% -    7%) 0.605
                  HighTermTitleSort      112.14      (3.8%)      112.79      (4.3%)    0.6% (  -7% -    9%) 0.652
                        LowSpanNear      198.11      (1.7%)      199.54      (1.6%)    0.7% (  -2% -    4%) 0.172
                       OrNotHighLow      722.41      (2.9%)      728.48      (2.8%)    0.8% (  -4% -    6%) 0.351
                          OrHighLow      370.98      (2.9%)      374.17      (3.2%)    0.9% (  -5% -    7%) 0.372
                         TermDTSort       73.90      (3.1%)       74.56      (4.1%)    0.9% (  -6% -    8%) 0.432
             OrHighMedDayTaxoFacets        4.23      (4.3%)        4.27      (3.5%)    0.9% (  -6% -    9%) 0.460
                           PKLookup      180.56      (1.5%)      182.54      (2.4%)    1.1% (  -2% -    5%) 0.080
               HighTermTitleBDVSort       11.73      (5.1%)       11.87      (6.3%)    1.2% (  -9% -   13%) 0.512
              HighTermDayOfYearSort      260.53      (2.9%)      263.83      (2.8%)    1.3% (  -4% -    7%) 0.167
                      OrNotHighHigh      396.17      (3.8%)      406.17      (3.9%)    2.5% (  -4% -   10%) 0.038
               BrowseDateSSDVFacets        3.73      (5.4%)        3.83      (9.2%)    2.6% ( -11% -   18%) 0.280
                       OrHighNotMed      355.05      (4.5%)      364.37      (5.2%)    2.6% (  -6% -   12%) 0.088
          BrowseDayOfYearSSDVFacets       13.03      (2.1%)       13.40     (11.5%)    2.8% ( -10% -   16%) 0.282
                       OrHighNotLow      373.18      (5.1%)      383.76      (5.7%)    2.8% (  -7% -   14%) 0.098
                       OrNotHighMed      365.05      (3.8%)      377.09      (3.4%)    3.3% (  -3% -   10%) 0.004
                      OrHighNotHigh      200.46      (4.2%)      208.91      (4.3%)    4.2% (  -4% -   13%) 0.002
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] gsmiller commented on pull request #12073: Move ReqExclScorer exclusion checking into first-phase when the exclusion Scorer has no second-phase check

Posted by GitBox <gi...@apache.org>.

gsmiller commented on PR #12073:
URL: https://github.com/apache/lucene/pull/12073#issuecomment-1379399579

   @jpountz:


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] rmuir commented on pull request #12073: Move ReqExclScorer exclusion checking into first-phase when the exclusion Scorer has no second-phase check

Posted by GitBox <gi...@apache.org>.

rmuir commented on PR #12073:
URL: https://github.com/apache/lucene/pull/12073#issuecomment-1378122618

   But the benchmark checks only the best case scenario here... TermQuery? What about more expensive excl? e.g. a wildcard or something (with the ConstantScoreScorer that has no two-phase)
   
   The optimization seems risky to me and just not intuitive: personally I think of approximation(ReqExcl) = Req. Why do additional work in the first phase, unless we really need to: it could slow everything down?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] jpountz commented on pull request #12073: Move ReqExclScorer exclusion checking into first-phase when the exclusion Scorer has no second-phase check

Posted by GitBox <gi...@apache.org>.

jpountz commented on PR #12073:
URL: https://github.com/apache/lucene/pull/12073#issuecomment-1378939790

   Do these queries actually use `ReqExclScorer`? I would have expected them to use `ReqExclBulkScorer`?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] rmuir commented on pull request #12073: Move ReqExclScorer exclusion checking into first-phase when the exclusion Scorer has no second-phase check

Posted by GitBox <gi...@apache.org>.

rmuir commented on PR #12073:
URL: https://github.com/apache/lucene/pull/12073#issuecomment-1378947355

   I cant keep up with the conditions where bulk is unsuitable. But just generally, my concern is to implement a best-case 1-2% speedup and create a large worst-case slowdown somewhere else.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] gsmiller commented on pull request #12073: Move ReqExclScorer exclusion checking into first-phase when the exclusion Scorer has no second-phase check

Posted by GitBox <gi...@apache.org>.

gsmiller commented on PR #12073:
URL: https://github.com/apache/lucene/pull/12073#issuecomment-1381092287

   Moved this into "draft" state until I'm able to come back and figure out why there was a benchmark improvement with this change, given the feedback that `ReqExclBulkScorer` would be expected to kick in. I've gotten busy with some other work, but plan to come back and dig into this. In the meantime though, moving to "draft."


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] gsmiller closed pull request #12073: Move ReqExclScorer exclusion checking into first-phase when the exclusion Scorer has no second-phase check

Posted by "gsmiller (via GitHub)" <gi...@apache.org>.

gsmiller closed pull request #12073: Move ReqExclScorer exclusion checking into first-phase when the exclusion Scorer has no second-phase check
URL: https://github.com/apache/lucene/pull/12073


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] rmuir commented on pull request #12073: Move ReqExclScorer exclusion checking into first-phase when the exclusion Scorer has no second-phase check

Posted by GitBox <gi...@apache.org>.

rmuir commented on PR #12073:
URL: https://github.com/apache/lucene/pull/12073#issuecomment-1378128948

   and maybe wildcard query isn't great example either: confirming matches is gonna be fast since its a bitset. 
   
   But I am still suspicious of the logic here, I don't think we can infer anything about performance from "twoPhaseIterator() == null"... and it seems that is what we are doing here. We can't infer that matches are fast just because nobody `@Override` the method :) It could just be a query that isn't well optimized (e.g. custom user query or something).  There are probably some of these in lucene that are an interesting worst-case here.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] jpountz commented on pull request #12073: Move ReqExclScorer exclusion checking into first-phase when the exclusion Scorer has no second-phase check

Posted by GitBox <gi...@apache.org>.

jpountz commented on PR #12073:
URL: https://github.com/apache/lucene/pull/12073#issuecomment-1378951278

   Sorry my comment was not targeted at your comment but at understanding why we're seeing a speedup with this change, since I wouldn't expect this scorer to be used (though my understanding could be outdated).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] gsmiller commented on pull request #12073: Move ReqExclScorer exclusion checking into first-phase when the exclusion Scorer has no second-phase check

Posted by GitBox <gi...@apache.org>.

gsmiller commented on PR #12073:
URL: https://github.com/apache/lucene/pull/12073#issuecomment-1379404470

@jpountz:
> understanding why we're seeing a speedup with this change
Good question. I was not familiar with `ReqExclBulkScorer`, but after taking a bit of a look, I have the same question now. I'll have to spend more time looking at these benchmark tasks to understand what might be going on here. Thank you for pointing this out.

@rmuir:
Thanks for the feedback! You raise some good questions here. My intuition was that reducing the size of the first phase match set would be beneficial in general. The reasoning being that, 1) it reduces the number of "candidates" that other potential conjunction clauses need to evaluate, and 2) it reduces the number of candidates that need to go to a second-phase (particularly if there are 2nd phase checks occurring _before_ the exclusion check). But I hear your point about term queries being a best-case scenario here. If the exclusion clauses are more complex, it's more difficult to reason about.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org