You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Adrien Grand (JIRA)" <ji...@apache.org> on 2018/01/23 16:57:00 UTC

[jira] [Commented] (LUCENE-8135) Implement Block-Max WAND

    [ https://issues.apache.org/jira/browse/LUCENE-8135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16336062#comment-16336062 ] 

Adrien Grand commented on LUCENE-8135:
--------------------------------------

Here is a patch that applies on top of the one that is posted on LUCENE-4198:
 - Scorer gets the same methods as ImpactsEnum: advanceShallow and getMaxScore, with the same contract
 - Disjunctions and conjunctions skip over entire blocks when the sum of max scores is not competitive.
 - Disjunctions now use the block max score instead of the global max score, which helps skip over more documents.
 - This is not documented in the paper, but when a minimum score is set, the score is computed on the fly in order to move to the next doc faster. I did this based on the observation that computing a score was often less costly than advancing another clause. This is especially useful due to the fact that on the contrary to term queries, the maximum score of a block is often not collected.
 - Another difference with the paper is the fact that we have information about blocks at multiple levels. This is one of the ideas described in a follow-up paper (see section 4 of http://engineering.nyu.edu/~suel/papers/bmm.pdf) which is based on the observation that you sometimes want to advance by more than ${block-size}.

Results on wikibigall need to be taken carefully as some tasks are either faster or slower depending on which query is run. Typically queries whose terms match lots of documents but the intersection matches few documents are a bit slower because it takes time to get a good lower bound for the score. Baseline is master, and patch is the combination of the patch on LUCENE-4198 and this patch.
 
{noformat}
                    TaskQPS baseline      StdDev   QPS patch      StdDev                Pct diff
               OrHighLow      439.34      (2.2%)      402.38      (1.8%)   -8.4% ( -12% -   -4%)
                  Fuzzy1      144.25      (4.3%)      137.08      (3.3%)   -5.0% ( -12% -    2%)
               MedPhrase       66.49      (1.1%)       63.83      (1.8%)   -4.0% (  -6% -   -1%)
                  Fuzzy2      153.94      (7.0%)      147.95      (7.3%)   -3.9% ( -16% -   11%)
   HighTermDayOfYearSort       54.35     (10.5%)       52.97     (12.0%)   -2.5% ( -22% -   22%)
       HighTermMonthSort       91.01      (9.0%)       89.31      (9.2%)   -1.9% ( -18% -   17%)
                 Prefix3      112.97      (7.6%)      111.66      (6.6%)   -1.2% ( -14% -   14%)
             MedSpanNear       46.12      (4.1%)       45.82      (4.5%)   -0.6% (  -8% -    8%)
                 Respell      249.65      (4.4%)      248.13      (4.1%)   -0.6% (  -8% -    8%)
                Wildcard       38.32      (6.9%)       38.09      (7.2%)   -0.6% ( -13% -   14%)
                  IntNRQ       34.26      (4.8%)       34.18      (6.0%)   -0.3% ( -10% -   11%)
             LowSpanNear       12.46      (4.8%)       12.44      (4.9%)   -0.2% (  -9% -   10%)
               LowPhrase       44.89      (1.4%)       44.85      (2.1%)   -0.1% (  -3% -    3%)
         MedSloppyPhrase       13.77      (6.6%)       13.88      (5.8%)    0.8% ( -10% -   14%)
            HighSpanNear        3.58      (7.0%)        3.61      (7.2%)    0.9% ( -12% -   16%)
         LowSloppyPhrase        4.59      (7.7%)        4.65      (6.8%)    1.4% ( -12% -   17%)
        HighSloppyPhrase        8.87      (8.1%)        9.02      (6.9%)    1.7% ( -12% -   18%)
              AndHighLow     1380.92      (3.2%)     1418.90      (3.2%)    2.8% (  -3% -    9%)
              HighPhrase       17.20      (3.3%)       18.70      (4.7%)    8.8% (   0% -   17%)
              OrHighHigh       21.38      (3.3%)       28.96      (4.6%)   35.5% (  26% -   44%)
              AndHighMed      123.15      (1.5%)      177.55      (5.3%)   44.2% (  36% -   51%)
               OrHighMed       42.31      (2.0%)       68.99      (4.0%)   63.0% (  55% -   70%)
             AndHighHigh       25.65      (1.5%)       44.33      (6.9%)   72.8% (  63% -   82%)
                 LowTerm      567.65      (2.2%)     2305.80     (16.4%)  306.2% ( 281% -  332%)
                 MedTerm      284.07      (2.1%)     1837.41     (26.1%)  546.8% ( 508% -  587%)
               AndCommon       56.30      (1.3%)      437.69     (33.5%)  677.5% ( 634% -  721%)
                HighTerm       86.50      (2.1%)      709.65     (38.8%)  720.4% ( 665% -  777%)
                OrCommon       13.95      (3.3%)      223.83     (66.0%) 1504.7% (1389% - 1627%)
{noformat}

For instance I had other runs that made OrHighMed slower. However the slowdown is usually contained and remains in the ~10%.

There are two tasks that I added: AndCommon and OrCommon. Current tasks have been computed rather randomly (I think!) based on the frequencies of terms, but user queries are a bit more likely to query terms that frequently occur together, so I wanted to see the impact on such queries:

{noformat}
OrCommon: united states
OrCommon: have been
OrCommon: see also
OrCommon: united kingdom
OrCommon: new york
AndCommon: +united +states
AndCommon: +have +been
AndCommon: +see +also
AndCommon: +united +kingdom
AndCommon: +new +york
{noformat}

On the contrary to some other tasks like OrHighMed, OrHighLow and AndHighLow, those tasks are consistently MUCH faster with the patch.

> Implement Block-Max WAND
> ------------------------
>
>                 Key: LUCENE-8135
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8135
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Adrien Grand
>            Priority: Minor
>
> This issue is about building on top of LUCENE-4198 in order to leverage block maximum scores instead of global maximum scores. This is documented in "Faster Top-k Document Retrieval Using Block-Max Indexes" ([http://engineering.nyu.edu/~suel/papers/bmw.pdf)|http://engineering.nyu.edu/~suel/papers/bmw.pdf] and called BMW (Block-Max WAND).
>  
> Using block max scores adds overhead to scorers, but also provides better upper bounds of the scores and is expected to remain efficient in presence of outliers (LUCENE-8087).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org