You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Adrien Grand (JIRA)" <ji...@apache.org> on 2015/04/29 13:35:06 UTC
[jira] [Comment Edited] (LUCENE-6458) MultiTermQuery's FILTER rewrite method should support skipping whenever possible

    [ https://issues.apache.org/jira/browse/LUCENE-6458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14519160#comment-14519160 ] 

Adrien Grand edited comment on LUCENE-6458 at 4/29/15 11:34 AM:
----------------------------------------------------------------

Here is a patch, it is quite similar to the old "auto" rewrite except that it rewrites per segment and only consumes the filtered terms enum once. Queries are executed as regular disjunctions when there are 50 matching terms or less.

{noformat}
                    TaskQPS baseline      StdDev   QPS patch      StdDev                Pct diff
                Wildcard       76.93      (1.7%)       66.55      (4.1%)  -13.5% ( -18% -   -7%)
                 Prefix3       99.69      (1.8%)       88.80      (2.3%)  -10.9% ( -14% -   -6%)
               OrHighMed       76.77      (3.7%)       76.26      (3.7%)   -0.7% (  -7% -    6%)
           OrHighNotHigh       37.88      (1.7%)       37.73      (2.2%)   -0.4% (  -4% -    3%)
                 MedTerm      306.74      (1.4%)      305.54      (1.5%)   -0.4% (  -3% -    2%)
              OrHighHigh       36.17      (4.5%)       36.05      (4.0%)   -0.3% (  -8% -    8%)
                HighTerm      120.67      (1.2%)      120.37      (1.7%)   -0.2% (  -3% -    2%)
         MedSloppyPhrase       36.30      (2.9%)       36.25      (2.8%)   -0.1% (  -5% -    5%)
                  IntNRQ        8.64      (2.9%)        8.63      (2.6%)   -0.1% (  -5% -    5%)
             LowSpanNear       70.11      (1.8%)       70.13      (2.2%)    0.0% (  -3% -    4%)
            HighSpanNear       17.55      (2.8%)       17.56      (3.0%)    0.1% (  -5% -    6%)
            OrHighNotMed       81.45      (1.8%)       81.51      (2.2%)    0.1% (  -3% -    4%)
               LowPhrase       14.47      (2.7%)       14.50      (3.0%)    0.2% (  -5% -    6%)
             MedSpanNear      120.55      (2.2%)      120.86      (2.0%)    0.3% (  -3% -    4%)
             AndHighHigh       58.08      (2.5%)       58.24      (2.6%)    0.3% (  -4% -    5%)
         LowSloppyPhrase       62.42      (4.3%)       62.60      (4.4%)    0.3% (  -8% -    9%)
            OrHighNotLow       76.06      (1.9%)       76.36      (2.4%)    0.4% (  -3% -    4%)
                 Respell       72.86      (3.9%)       73.17      (2.9%)    0.4% (  -6% -    7%)
           OrNotHighHigh       50.07      (1.5%)       50.30      (1.2%)    0.5% (  -2% -    3%)
        HighSloppyPhrase       24.92      (6.4%)       25.05      (6.5%)    0.5% ( -11% -   14%)
               OrHighLow       68.75      (4.6%)       69.17      (4.1%)    0.6% (  -7% -    9%)
              HighPhrase       20.89      (2.5%)       21.04      (1.8%)    0.7% (  -3% -    5%)
            OrNotHighMed      179.02      (1.9%)      180.37      (1.4%)    0.8% (  -2% -    4%)
                PKLookup      263.21      (2.8%)      265.42      (3.0%)    0.8% (  -4% -    6%)
               MedPhrase       34.60      (3.6%)       34.94      (3.4%)    1.0% (  -5% -    8%)
                 LowTerm      780.71      (3.2%)      790.04      (4.2%)    1.2% (  -5% -    8%)
            OrNotHighLow     1459.46      (3.5%)     1480.76      (5.0%)    1.5% (  -6% -   10%)
              AndHighMed      255.15      (2.6%)      258.93      (2.4%)    1.5% (  -3% -    6%)
                  Fuzzy1       77.69      (8.9%)       79.12      (7.7%)    1.8% ( -13% -   20%)
              AndHighLow      961.32      (3.9%)      980.23      (3.5%)    2.0% (  -5% -    9%)
                  Fuzzy2       24.48      (7.9%)       25.19      (7.4%)    2.9% ( -11% -   19%)
{noformat}

I was hoping it would kick in for numeric range queries but unfortunately they often need to match hundreds of terms. I'm wondering if it would be different for auto-prefix.

Prefix3 and Wildcard are a bit slower because these ones get actually executed as regular disjunctions. I think the slowdown is fair given that it also requires less memory and provides true skipping support (which the benchmark doesn't use).


was (Author: jpountz):
Here is a patch, it is quite similar to the old "auto" rewrite except that it rewrites per segment and only consumes the filtered terms enum once. Queries are executed as regular disjunctions when there are 50 matching terms or less.

{noformat}
                    TaskQPS baseline      StdDev   QPS patch      StdDev                Pct diff
                 Prefix3      113.17      (1.7%)       88.55      (2.7%)  -21.8% ( -25% -  -17%)
                Wildcard       37.43      (2.0%)       36.26      (3.2%)   -3.1% (  -8% -    2%)
            HighSpanNear        4.30      (2.6%)        4.24      (4.0%)   -1.6% (  -7% -    5%)
            OrHighNotLow       71.52      (1.5%)       70.51      (3.1%)   -1.4% (  -5% -    3%)
        HighSloppyPhrase       20.60      (6.3%)       20.34      (7.6%)   -1.3% ( -14% -   13%)
            OrHighNotMed       96.14      (2.0%)       95.11      (2.8%)   -1.1% (  -5% -    3%)
               MedPhrase       23.49      (1.8%)       23.30      (3.5%)   -0.8% (  -6% -    4%)
                 Respell       62.25      (8.9%)       62.01      (7.4%)   -0.4% ( -15% -   17%)
             AndHighHigh       52.43      (0.7%)       52.27      (1.1%)   -0.3% (  -2% -    1%)
           OrNotHighHigh       26.08      (3.5%)       26.02      (1.0%)   -0.2% (  -4% -    4%)
           OrHighNotHigh       61.96      (2.0%)       61.85      (2.1%)   -0.2% (  -4% -    4%)
                  IntNRQ        8.03      (3.1%)        8.02      (2.6%)   -0.2% (  -5% -    5%)
                 LowTerm      783.62      (4.9%)      783.25      (4.5%)   -0.0% (  -9% -    9%)
             MedSpanNear       18.77      (1.9%)       18.76      (3.6%)   -0.0% (  -5% -    5%)
             LowSpanNear       14.49      (2.5%)       14.49      (2.6%)   -0.0% (  -4% -    5%)
                 MedTerm      237.81      (2.1%)      237.76      (3.0%)   -0.0% (  -4% -    5%)
                PKLookup      266.15      (2.5%)      266.38      (2.5%)    0.1% (  -4% -    5%)
               OrHighMed       50.61      (6.0%)       50.68      (6.1%)    0.1% ( -11% -   13%)
                  Fuzzy2       19.87      (4.4%)       19.92      (7.8%)    0.2% ( -11% -   12%)
            OrNotHighMed       90.03      (1.1%)       90.25      (0.8%)    0.2% (  -1% -    2%)
              HighPhrase       15.56      (2.0%)       15.61      (2.7%)    0.3% (  -4% -    5%)
         MedSloppyPhrase      252.97      (5.2%)      253.93      (4.3%)    0.4% (  -8% -   10%)
               LowPhrase        8.16      (1.7%)        8.21      (1.9%)    0.6% (  -2% -    4%)
                HighTerm      115.17      (2.4%)      116.05      (2.7%)    0.8% (  -4% -    6%)
              OrHighHigh       25.19      (5.7%)       25.45      (6.4%)    1.0% ( -10% -   13%)
               OrHighLow       42.12      (7.5%)       42.60      (6.9%)    1.1% ( -12% -   16%)
         LowSloppyPhrase      129.20      (1.6%)      130.68      (2.0%)    1.2% (  -2% -    4%)
              AndHighMed      231.64      (1.3%)      235.28      (2.1%)    1.6% (  -1% -    4%)
              AndHighLow      733.51      (3.9%)      751.08      (3.5%)    2.4% (  -4% -   10%)
                  Fuzzy1       85.42     (17.0%)       91.04      (5.9%)    6.6% ( -13% -   35%)
            OrNotHighLow      893.55      (2.9%)      962.35      (4.6%)    7.7% (   0% -   15%)
{noformat}

I was hoping it would kick in for numeric range queries but unfortunately they often need to match hundreds of terms. I'm wondering if it would be different for auto-prefix.

Prefix3 and Wildcard are a bit slower because these ones get actually executed as regular disjunctions. I think the slowdown is fair given that it also requires less memory and provides true skipping support (which the benchmark doesn't use).

> MultiTermQuery's FILTER rewrite method should support skipping whenever possible
> --------------------------------------------------------------------------------
>
>                 Key: LUCENE-6458
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6458
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Adrien Grand
>            Assignee: Adrien Grand
>            Priority: Minor
>         Attachments: LUCENE-6458.patch
>
>
> Today MultiTermQuery's FILTER rewrite always builds a bit set fom all matching terms. This means that we need to consume the entire postings lists of all matching terms. Instead we should try to execute like regular disjunctions when there are few terms.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org