You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Adrien Grand (JIRA)" <ji...@apache.org> on 2015/04/29 13:35:05 UTC

[jira] [Updated] (LUCENE-6458) MultiTermQuery's FILTER rewrite method should support skipping whenever possible

     [ https://issues.apache.org/jira/browse/LUCENE-6458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Adrien Grand updated LUCENE-6458:
---------------------------------
    Attachment: LUCENE-6458.patch

Here is a patch, it is quite similar to the old "auto" rewrite except that it rewrites per segment and only consumes the filtered terms enum once. Queries are executed as regular disjunctions when there are 50 matching terms or less.

{noformat}
                    TaskQPS baseline      StdDev   QPS patch      StdDev                Pct diff
                 Prefix3      113.17      (1.7%)       88.55      (2.7%)  -21.8% ( -25% -  -17%)
                Wildcard       37.43      (2.0%)       36.26      (3.2%)   -3.1% (  -8% -    2%)
            HighSpanNear        4.30      (2.6%)        4.24      (4.0%)   -1.6% (  -7% -    5%)
            OrHighNotLow       71.52      (1.5%)       70.51      (3.1%)   -1.4% (  -5% -    3%)
        HighSloppyPhrase       20.60      (6.3%)       20.34      (7.6%)   -1.3% ( -14% -   13%)
            OrHighNotMed       96.14      (2.0%)       95.11      (2.8%)   -1.1% (  -5% -    3%)
               MedPhrase       23.49      (1.8%)       23.30      (3.5%)   -0.8% (  -6% -    4%)
                 Respell       62.25      (8.9%)       62.01      (7.4%)   -0.4% ( -15% -   17%)
             AndHighHigh       52.43      (0.7%)       52.27      (1.1%)   -0.3% (  -2% -    1%)
           OrNotHighHigh       26.08      (3.5%)       26.02      (1.0%)   -0.2% (  -4% -    4%)
           OrHighNotHigh       61.96      (2.0%)       61.85      (2.1%)   -0.2% (  -4% -    4%)
                  IntNRQ        8.03      (3.1%)        8.02      (2.6%)   -0.2% (  -5% -    5%)
                 LowTerm      783.62      (4.9%)      783.25      (4.5%)   -0.0% (  -9% -    9%)
             MedSpanNear       18.77      (1.9%)       18.76      (3.6%)   -0.0% (  -5% -    5%)
             LowSpanNear       14.49      (2.5%)       14.49      (2.6%)   -0.0% (  -4% -    5%)
                 MedTerm      237.81      (2.1%)      237.76      (3.0%)   -0.0% (  -4% -    5%)
                PKLookup      266.15      (2.5%)      266.38      (2.5%)    0.1% (  -4% -    5%)
               OrHighMed       50.61      (6.0%)       50.68      (6.1%)    0.1% ( -11% -   13%)
                  Fuzzy2       19.87      (4.4%)       19.92      (7.8%)    0.2% ( -11% -   12%)
            OrNotHighMed       90.03      (1.1%)       90.25      (0.8%)    0.2% (  -1% -    2%)
              HighPhrase       15.56      (2.0%)       15.61      (2.7%)    0.3% (  -4% -    5%)
         MedSloppyPhrase      252.97      (5.2%)      253.93      (4.3%)    0.4% (  -8% -   10%)
               LowPhrase        8.16      (1.7%)        8.21      (1.9%)    0.6% (  -2% -    4%)
                HighTerm      115.17      (2.4%)      116.05      (2.7%)    0.8% (  -4% -    6%)
              OrHighHigh       25.19      (5.7%)       25.45      (6.4%)    1.0% ( -10% -   13%)
               OrHighLow       42.12      (7.5%)       42.60      (6.9%)    1.1% ( -12% -   16%)
         LowSloppyPhrase      129.20      (1.6%)      130.68      (2.0%)    1.2% (  -2% -    4%)
              AndHighMed      231.64      (1.3%)      235.28      (2.1%)    1.6% (  -1% -    4%)
              AndHighLow      733.51      (3.9%)      751.08      (3.5%)    2.4% (  -4% -   10%)
                  Fuzzy1       85.42     (17.0%)       91.04      (5.9%)    6.6% ( -13% -   35%)
            OrNotHighLow      893.55      (2.9%)      962.35      (4.6%)    7.7% (   0% -   15%)
{noformat}

I was hoping it would kick in for numeric range queries but unfortunately they often need to match hundreds of terms. I'm wondering if it would be different for auto-prefix.

Prefix3 and Wildcard are a bit slower because these ones get actually executed as regular disjunctions. I think the slowdown is fair given that it also requires less memory and provides true skipping support (which the benchmark doesn't use).

> MultiTermQuery's FILTER rewrite method should support skipping whenever possible
> --------------------------------------------------------------------------------
>
>                 Key: LUCENE-6458
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6458
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Adrien Grand
>            Assignee: Adrien Grand
>            Priority: Minor
>         Attachments: LUCENE-6458.patch
>
>
> Today MultiTermQuery's FILTER rewrite always builds a bit set fom all matching terms. This means that we need to consume the entire postings lists of all matching terms. Instead we should try to execute like regular disjunctions when there are few terms.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org