You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Adrien Grand (JIRA)" <ji...@apache.org> on 2015/12/18 12:04:46 UTC

[jira] [Updated] (LUCENE-6940) Bulk scoring could speed up MUST_NOT clauses

     [ https://issues.apache.org/jira/browse/LUCENE-6940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Adrien Grand updated LUCENE-6940:
---------------------------------
    Attachment: LUCENE-6940.patch

Here is a quick patch (disclaimer: not commented and not tested) to demonstrate the idea. It makes the new bulk scorer used either:
 - when there is a single FILTER/MUST clause, no SHOULD clauses, and some MUST_NOT clauses
 - or when there are some SHOULD clauses, no FILTER_MUST clauses and some MUST_NOT clauses

I added some tasks to wikimedium.10M.nostopwords.tasks and ran it through luceneutil. As expected this seems to especially yield a speedup when the negative clauses match many less documents than the positive clauses.

{noformat}
diff --git a/tasks/wikimedium.10M.nostopwords.tasks b/tasks/wikimedium.10M.nostopwords.tasks
index 342070c..8991121 100644
--- a/tasks/wikimedium.10M.nostopwords.tasks
+++ b/tasks/wikimedium.10M.nostopwords.tasks
@@ -13361,3 +13361,19 @@ OrNotHighLow: -do necessities # freq=511178 freq=1195
 OrHighNotLow: do -necessities # freq=511178 freq=1195
 OrNotHighLow: -had halfback # freq=1246743 freq=1205
 OrHighNotLow: had -halfback # freq=1246743 freq=1205
+AllNotHigh: *:* -been # freq=1041183
+AllNotHigh: *:* -states # freq=1034872
+AllNotHigh: *:* -time # freq=1032071
+AllNotHigh: *:* -when # freq=1027487
+AllNotLow: *:* -factor # freq=37866
+AllNotLow: *:* -migration # freq=37862
+AllNotLow: *:* -maintained # freq=37840
+AllNotLow: *:* -norwegian # freq=37836
+OrHighHighNotLow: several following -factor # freq=436129 freq=416515 freq=37866
+OrHighHighNotLow: publisher end -migration # freq=1289029 freq=526636 freq=37862
+OrHighHighNotLow: 2009 film -maintaine # freq=887702 freq=432758 freq=37840
+OrHighHighNotLow: http known -norwegian # freq=3493581 freq=607158 freq=37836
+OrHighLowNotHigh: 2005 jorgensen -been # freq=835460 freq=837 freq=1041183
+OrHighLowNotHigh: like undivided -states # freq=479390 freq=1512 freq=1034872
+OrHighLowNotHigh: use coy -time # freq=597053 freq=1198 freq=1032071
+OrHighLowNotHigh: been highperformanceengines -when # freq=1041183 freq=1155 freq=1027487
{noformat}

{noformat}
                    TaskQPS baseline      StdDev   QPS patch      StdDev                Pct diff
        OrHighLowNotHigh       19.54      (2.6%)       18.59      (4.2%)   -4.9% ( -11% -    1%)
               OrHighMed       34.32      (3.5%)       33.03      (4.8%)   -3.7% ( -11% -    4%)
              OrHighHigh       26.95      (3.7%)       25.97      (4.9%)   -3.6% ( -11% -    5%)
                  Fuzzy2       82.74     (16.0%)       80.29     (16.4%)   -3.0% ( -30% -   35%)
              AndHighLow      502.91      (5.7%)      496.63      (3.0%)   -1.2% (  -9% -    7%)
              AndHighMed      236.44      (2.9%)      234.34      (2.6%)   -0.9% (  -6% -    4%)
            OrNotHighMed      222.75      (2.9%)      220.87      (2.4%)   -0.8% (  -5% -    4%)
                 Respell       60.25      (3.0%)       60.38      (2.7%)    0.2% (  -5% -    6%)
         MedSloppyPhrase       21.73      (2.3%)       21.92      (2.5%)    0.8% (  -3% -    5%)
                  Fuzzy1       57.18      (8.0%)       57.78      (5.8%)    1.1% ( -11% -   16%)
         LowSloppyPhrase       25.96      (1.9%)       26.24      (2.1%)    1.1% (  -2% -    5%)
        HighSloppyPhrase       29.99      (2.5%)       30.37      (2.7%)    1.3% (  -3% -    6%)
               MedPhrase       60.11      (2.8%)       61.15      (3.1%)    1.7% (  -4% -    7%)
             AndHighHigh       32.86      (3.0%)       33.56      (3.0%)    2.1% (  -3% -    8%)
               LowPhrase       59.36      (2.7%)       60.69      (3.2%)    2.2% (  -3% -    8%)
               OrHighLow       78.50      (3.6%)       80.33      (4.3%)    2.3% (  -5% -   10%)
              HighPhrase       17.32      (2.1%)       17.73      (1.9%)    2.4% (  -1% -    6%)
             LowSpanNear       34.90      (2.8%)       35.75      (2.4%)    2.4% (  -2% -    7%)
             MedSpanNear       30.83      (2.9%)       31.59      (2.0%)    2.4% (  -2% -    7%)
            OrNotHighLow      982.57      (4.2%)     1009.18      (2.8%)    2.7% (  -4% -   10%)
            HighSpanNear       10.39      (3.8%)       10.76      (3.7%)    3.5% (  -3% -   11%)
                Wildcard       64.30      (4.2%)       67.27      (5.2%)    4.6% (  -4% -   14%)
                HighTerm      110.90      (5.2%)      117.51      (6.7%)    6.0% (  -5% -   18%)
                 MedTerm      155.42      (5.3%)      165.05      (6.9%)    6.2% (  -5% -   19%)
           OrNotHighHigh       40.19      (1.9%)       42.69      (3.2%)    6.2% (   1% -   11%)
                 Prefix3       87.35      (6.2%)       93.98      (6.9%)    7.6% (  -5% -   22%)
                 LowTerm      574.81      (9.0%)      625.04      (9.6%)    8.7% (  -9% -   30%)
                  IntNRQ       11.95      (9.1%)       13.31     (11.8%)   11.4% (  -8% -   35%)
           OrHighNotHigh       50.66      (2.0%)       56.55      (4.3%)   11.6% (   5% -   18%)
        OrHighHighNotLow       27.15      (3.3%)       33.91      (4.9%)   24.9% (  16% -   34%)
            OrHighNotMed       96.64      (2.7%)      130.20      (8.0%)   34.7% (  23% -   46%)
            OrHighNotLow       42.44      (4.0%)       62.60     (10.7%)   47.5% (  31% -   64%)
              AllNotHigh        6.51      (2.9%)       16.76     (26.8%)  157.4% ( 124% -  192%)
               AllNotLow        7.18      (3.0%)       21.93     (49.2%)  205.3% ( 148% -  265%)
{noformat}

> Bulk scoring could speed up MUST_NOT clauses
> --------------------------------------------
>
>                 Key: LUCENE-6940
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6940
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Adrien Grand
>            Assignee: Adrien Grand
>            Priority: Minor
>         Attachments: LUCENE-6940.patch
>
>
> Today when you have MUST_NOT clauses, the ReqExclScorer is used and needs to check the excluded clauses on every iteration. I suspect we could speed things up by having a BulkScorer that would advance the excluded clause first and then tell the required clause to bulk score up to the next excluded document.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org