You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Adrien Grand (JIRA)" <ji...@apache.org> on 2015/05/12 17:34:00 UTC

[jira] [Updated] (LUCENE-6458) MultiTermQuery's FILTER rewrite method should support skipping whenever possible

     [ https://issues.apache.org/jira/browse/LUCENE-6458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Adrien Grand updated LUCENE-6458:
---------------------------------
    Attachment: LUCENE-6458.patch
                wikimedium.10M.nostopwords.tasks

I did some more benchmarking of the change with filters (see attached tasks file) and various thresholds (and a fixed seed):

{noformat}
16
                    TaskQPS baseline      StdDev   QPS patch      StdDev                Pct diff
                     MTQ       24.33      (7.5%)       20.67      (7.3%)  -15.1% ( -27% -    0%)
                  IntNRQ       20.38      (7.3%)       17.85     (11.9%)  -12.4% ( -29% -    7%)
               IntNRQ_50        8.94     (10.1%)        8.67      (8.6%)   -3.0% ( -19% -   17%)
                  MTQ_50        9.05      (7.9%)        8.93      (5.3%)   -1.3% ( -13% -   12%)
               IntNRQ_10       13.72     (12.7%)       13.60     (11.9%)   -0.9% ( -22% -   27%)
                IntNRQ_1       17.53     (17.1%)       17.53     (16.3%)    0.0% ( -28% -   40%)
                  MTQ_10       13.70     (11.2%)       13.89      (8.7%)    1.4% ( -16% -   23%)
                   MTQ_1       19.11     (15.8%)       21.43     (18.0%)   12.1% ( -18% -   54%)

64
                    TaskQPS baseline      StdDev   QPS patch      StdDev                Pct diff
                  IntNRQ       20.53      (6.9%)       16.42      (5.3%)  -20.0% ( -30% -   -8%)
                     MTQ       24.31      (7.3%)       20.34      (6.4%)  -16.3% ( -27% -   -2%)
               IntNRQ_50        8.87      (9.2%)        8.31      (6.5%)   -6.3% ( -20% -   10%)
               IntNRQ_10       13.55     (12.7%)       12.80     (10.2%)   -5.6% ( -25% -   19%)
                IntNRQ_1       17.27     (16.3%)       16.38     (13.1%)   -5.2% ( -29% -   28%)
                  MTQ_50        9.00      (7.6%)        9.02      (4.5%)    0.3% ( -10% -   13%)
                  MTQ_10       13.65     (11.1%)       14.73      (8.2%)    7.9% ( -10% -   30%)
                   MTQ_1       18.95     (15.1%)       25.32     (17.2%)   33.6% (   1% -   77%)

256
                    TaskQPS baseline      StdDev   QPS patch      StdDev                Pct diff
                  IntNRQ       20.43      (9.4%)       12.69      (1.7%)  -37.9% ( -44% -  -29%)
                     MTQ       24.13      (9.3%)       19.32      (5.3%)  -19.9% ( -31% -   -5%)
                IntNRQ_1       17.21     (19.5%)       13.90      (7.7%)  -19.2% ( -38% -    9%)
               IntNRQ_10       13.49     (12.7%)       10.95      (5.7%)  -18.8% ( -33% -    0%)
               IntNRQ_50        8.85     (10.5%)        7.40      (3.8%)  -16.4% ( -27% -   -2%)
                  MTQ_50        8.94      (8.3%)        8.82      (4.4%)   -1.3% ( -12% -   12%)
                  MTQ_10       13.53     (12.6%)       14.64      (5.9%)    8.2% (  -9% -   30%)
                   MTQ_1       18.88     (15.6%)       26.52     (14.2%)   40.5% (   9% -   83%)

1024
                    TaskQPS baseline      StdDev   QPS patch      StdDev                Pct diff
                  IntNRQ       20.40      (7.7%)        6.54      (1.5%)  -67.9% ( -71% -  -63%)
                IntNRQ_1       17.57     (17.2%)        8.27      (2.9%)  -52.9% ( -62% -  -39%)
               IntNRQ_10       13.66     (13.0%)        6.72      (2.4%)  -50.8% ( -58% -  -40%)
               IntNRQ_50        8.96     (10.4%)        5.01      (1.5%)  -44.1% ( -50% -  -35%)
                     MTQ       24.41      (8.2%)       18.07      (4.4%)  -26.0% ( -35% -  -14%)
                  MTQ_50        9.05      (8.1%)        8.65      (3.5%)   -4.5% ( -14% -    7%)
                  MTQ_10       13.60     (11.5%)       14.41      (3.9%)    6.0% (  -8% -   24%)
                   MTQ_1       19.11     (15.6%)       27.32     (12.9%)   43.0% (  12% -   84%)
{noformat}

Rewriting to a BooleanQuery never helps when there is no filter, but something that the benchmark doesn't capture is that at least BooleanQuery does not allocate O(maxDoc) memory which can matter for large datasets.

When there are filters, it's more complicated, it depends on the density of the filter, on the number of terms and also apparently on how frequencies of the different terms compare (this is my current theory for why WildcardQuery performs better than NRQ).

Net/net I think this validates that 64 would be a good threshold to rewrite, with a minimum slowdown when filters are dense, and interesting speedups when filters are sparse?

> MultiTermQuery's FILTER rewrite method should support skipping whenever possible
> --------------------------------------------------------------------------------
>
>                 Key: LUCENE-6458
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6458
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Adrien Grand
>            Assignee: Adrien Grand
>            Priority: Minor
>         Attachments: LUCENE-6458.patch, LUCENE-6458.patch, wikimedium.10M.nostopwords.tasks
>
>
> Today MultiTermQuery's FILTER rewrite always builds a bit set fom all matching terms. This means that we need to consume the entire postings lists of all matching terms. Instead we should try to execute like regular disjunctions when there are few terms.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org