You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Adrien Grand (JIRA)" <ji...@apache.org> on 2015/02/19 10:56:11 UTC
[jira] [Updated] (LUCENE-6260) Simplify ExactPhraseScorer

     [ https://issues.apache.org/jira/browse/LUCENE-6260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Adrien Grand updated LUCENE-6260:
---------------------------------
    Attachment: LUCENE-6260.patch

Here is a patch which makes phrase intersection essentially look like ConjunctionDISI except that it works on positions instead of doc IDs. I ran luceneutil on wikibig1M and the performance loss looks quite small:

{noformat}
                    TaskQPS baseline      StdDev   QPS patch      StdDev                Pct diff
              HighPhrase       33.54      (1.3%)       31.72      (1.9%)   -5.4% (  -8% -   -2%)
               LowPhrase       48.76      (1.2%)       47.74      (2.1%)   -2.1% (  -5% -    1%)
            OrNotHighLow     1167.83      (4.0%)     1153.63      (4.4%)   -1.2% (  -9% -    7%)
                  Fuzzy1      112.76     (12.5%)      111.41     (11.9%)   -1.2% ( -22% -   26%)
               MedPhrase      126.21      (1.6%)      124.89      (2.8%)   -1.0% (  -5% -    3%)
                 LowTerm     2361.80      (5.3%)     2338.19      (5.0%)   -1.0% ( -10% -    9%)
              AndHighLow     1053.44      (2.6%)     1043.11      (5.6%)   -1.0% (  -8% -    7%)
            OrHighNotMed      180.00      (1.8%)      179.10      (2.1%)   -0.5% (  -4% -    3%)
            OrHighNotLow      139.58      (2.6%)      139.24      (3.1%)   -0.2% (  -5% -    5%)
                  IntNRQ      126.93      (6.3%)      126.72      (5.5%)   -0.2% ( -11% -   12%)
             AndHighHigh      130.72      (3.1%)      130.58      (3.2%)   -0.1% (  -6% -    6%)
            HighSpanNear       12.64      (1.2%)       12.63      (1.4%)   -0.1% (  -2% -    2%)
                 Prefix3       92.94      (7.8%)       92.92      (7.6%)   -0.0% ( -14% -   16%)
               OrHighMed      155.49     (10.5%)      155.60     (10.0%)    0.1% ( -18% -   22%)
              AndHighMed      181.53      (3.0%)      181.74      (3.0%)    0.1% (  -5% -    6%)
           OrNotHighHigh      137.81      (3.1%)      137.98      (2.2%)    0.1% (  -5% -    5%)
              OrHighHigh      136.52     (10.5%)      136.71      (9.8%)    0.1% ( -18% -   22%)
         MedSloppyPhrase       44.59      (2.8%)       44.67      (3.3%)    0.2% (  -5% -    6%)
           OrHighNotHigh      135.68      (1.6%)      135.93      (1.5%)    0.2% (  -2% -    3%)
                 MedTerm      949.94      (3.1%)      951.88      (2.9%)    0.2% (  -5% -    6%)
             LowSpanNear       26.02      (0.9%)       26.07      (1.3%)    0.2% (  -1% -    2%)
               OrHighLow       97.01     (11.1%)       97.22     (10.6%)    0.2% ( -19% -   24%)
             MedSpanNear       27.98      (1.1%)       28.04      (1.0%)    0.2% (  -1% -    2%)
                PKLookup      407.25      (2.2%)      408.17      (1.9%)    0.2% (  -3% -    4%)
            OrNotHighMed      434.88      (2.8%)      435.99      (2.5%)    0.3% (  -4% -    5%)
                Wildcard      166.20      (4.0%)      166.65      (4.6%)    0.3% (  -8% -    9%)
         LowSloppyPhrase      107.31      (3.7%)      107.65      (3.9%)    0.3% (  -7% -    8%)
        HighSloppyPhrase       13.76      (2.9%)       13.82      (2.9%)    0.4% (  -5% -    6%)
                HighTerm      328.62      (2.3%)      330.24      (2.1%)    0.5% (  -3% -    4%)
                 Respell       67.48      (4.8%)       67.84      (5.5%)    0.5% (  -9% -   11%)
                  Fuzzy2       73.37     (15.2%)       78.35     (13.4%)    6.8% ( -18% -   41%)
{noformat}

One advantage of this approach is that it would help phraseFreq return earlier if scores are not needed and there is a match at the beginning of the document.

> Simplify ExactPhraseScorer
> --------------------------
>
>                 Key: LUCENE-6260
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6260
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Adrien Grand
>            Priority: Minor
>         Attachments: LUCENE-6260.patch
>
>
> ExactPhraseScorer tries to intersect positions using windows of 4096 documents. In LUCENE-2410 it was reported that it helped a lot but I tried again on wikibig with a simpler impl that does advance one position at a time and the performance difference was only of a few percents. I'm guessing that maybe other changes (eg. the new postings format?) do not make this behaviour as useful as it used to be?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org