You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Adrien Grand (JIRA)" <ji...@apache.org> on 2017/10/13 08:02:00 UTC
[jira] [Updated] (LUCENE-7993) Speed up phrase queries when total hit count is not needed

     [ https://issues.apache.org/jira/browse/LUCENE-7993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Adrien Grand updated LUCENE-7993:
---------------------------------
    Attachment: LUCENE-7993.patch

Here is a patch that applies on top of LUCENE-4100 to show the idea. Luceneutil confirms it brings interesting gains on wikimedium10m:

{noformat}
                    TaskQPS baseline      StdDev   QPS patch      StdDev                Pct diff
            OrHighNotLow       88.30      (4.4%)       72.67      (2.4%)  -17.7% ( -23% -  -11%)
            OrHighNotMed       93.18      (3.3%)       86.58      (1.9%)   -7.1% ( -11% -   -1%)
            OrNotHighLow     1386.80      (4.0%)     1289.38      (3.3%)   -7.0% ( -13% -    0%)
           OrHighNotHigh       49.84      (3.2%)       47.59      (1.7%)   -4.5% (  -9% -    0%)
                  Fuzzy2      196.79     (16.6%)      188.44      (7.7%)   -4.2% ( -24% -   24%)
            HighSpanNear       58.01      (2.2%)       56.18      (2.4%)   -3.2% (  -7% -    1%)
            OrNotHighMed      184.60      (1.7%)      178.77      (2.4%)   -3.2% (  -7% -    0%)
              AndHighMed      224.60      (1.9%)      217.95      (2.3%)   -3.0% (  -7% -    1%)
             LowSpanNear      143.79      (2.4%)      139.98      (2.4%)   -2.7% (  -7% -    2%)
                  IntNRQ       19.47      (4.2%)       19.13      (5.0%)   -1.8% ( -10% -    7%)
                 MedTerm      248.95      (2.3%)      244.80      (1.9%)   -1.7% (  -5% -    2%)
                 LowTerm      766.37      (3.6%)      758.11      (3.9%)   -1.1% (  -8% -    6%)
                HighTerm      131.14      (2.5%)      129.74      (2.6%)   -1.1% (  -5% -    4%)
             AndHighHigh       30.70      (2.4%)       30.40      (1.5%)   -1.0% (  -4% -    3%)
           OrNotHighHigh       55.99      (2.7%)       55.50      (1.7%)   -0.9% (  -5% -    3%)
                 Prefix3      105.33      (4.8%)      104.60      (3.6%)   -0.7% (  -8% -    8%)
             MedSpanNear       13.38      (2.3%)       13.30      (2.1%)   -0.6% (  -4% -    3%)
                Wildcard       84.93      (4.8%)       84.59      (3.7%)   -0.4% (  -8% -    8%)
              AndHighLow     1419.89      (3.3%)     1432.43      (2.8%)    0.9% (  -4% -    7%)
         LowSloppyPhrase       38.50      (3.0%)       39.02      (1.7%)    1.3% (  -3% -    6%)
        HighSloppyPhrase       15.85      (4.2%)       16.10      (2.4%)    1.6% (  -4% -    8%)
         MedSloppyPhrase      118.20      (3.8%)      120.36      (2.4%)    1.8% (  -4% -    8%)
                 Respell      272.44      (6.5%)      279.22      (3.5%)    2.5% (  -7% -   13%)
       HighTermMonthSort      226.59      (9.1%)      233.94      (9.1%)    3.2% ( -13% -   23%)
                  Fuzzy1      163.36     (10.6%)      171.95      (8.7%)    5.3% ( -12% -   27%)
               LowPhrase      195.93      (2.2%)      222.77      (2.2%)   13.7% (   9% -   18%)
              OrHighHigh       34.58      (6.4%)       45.87      (6.8%)   32.6% (  18% -   49%)
   HighTermDayOfYearSort       65.42      (6.6%)       87.68     (12.5%)   34.0% (  14% -   56%)
               MedPhrase       40.05      (2.0%)       59.16      (2.3%)   47.7% (  42% -   53%)
               OrHighMed       41.35      (6.0%)       64.85      (7.3%)   56.8% (  41% -   74%)
              HighPhrase       22.51      (3.8%)       39.33      (4.0%)   74.8% (  64% -   85%)
               OrHighLow       61.15      (3.2%)      629.98     (41.3%)  930.3% ( 858% - 1007%)
{noformat}

Changes to the performance of disjunctions are thanks to MAXSCORE, however we can see that {{LowPhrase}} (+13.7%), {{MedPhrase}} (+47.7%) and {{HighPhrase}} (+74.8%) have good speedups too.

> Speed up phrase queries when total hit count is not needed
> ----------------------------------------------------------
>
>                 Key: LUCENE-7993
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7993
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Adrien Grand
>            Priority: Minor
>         Attachments: LUCENE-7993.patch
>
>
> Follow-up of LUCENE-4100: When thinking about the API that we needed to introduce to support MAXSCORE, I wondered whether the same API could support other optimizations. The idea is that when running phrase queries, before we start reading positions, we already have access to the term frequency of each term. And the frequency of the phrase is bounded by the minimum term frequency of the involved terms. So if the score for that minimum term frequency is not competitive then it means that the score for the phrase is not competitive either if we can assume that the score increases (or stagnates) when the term freq increases, which sounds like an ok requirement for a sane Similarity?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org