You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Adrien Grand (JIRA)" <ji...@apache.org> on 2015/12/18 12:04:46 UTC
[jira] [Updated] (LUCENE-6940) Bulk scoring could speed up MUST_NOT
clauses
[ https://issues.apache.org/jira/browse/LUCENE-6940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Adrien Grand updated LUCENE-6940:
---------------------------------
Attachment: LUCENE-6940.patch
Here is a quick patch (disclaimer: not commented and not tested) to demonstrate the idea. It makes the new bulk scorer used either:
- when there is a single FILTER/MUST clause, no SHOULD clauses, and some MUST_NOT clauses
- or when there are some SHOULD clauses, no FILTER_MUST clauses and some MUST_NOT clauses
I added some tasks to wikimedium.10M.nostopwords.tasks and ran it through luceneutil. As expected this seems to especially yield a speedup when the negative clauses match many less documents than the positive clauses.
{noformat}
diff --git a/tasks/wikimedium.10M.nostopwords.tasks b/tasks/wikimedium.10M.nostopwords.tasks
index 342070c..8991121 100644
--- a/tasks/wikimedium.10M.nostopwords.tasks
+++ b/tasks/wikimedium.10M.nostopwords.tasks
@@ -13361,3 +13361,19 @@ OrNotHighLow: -do necessities # freq=511178 freq=1195
OrHighNotLow: do -necessities # freq=511178 freq=1195
OrNotHighLow: -had halfback # freq=1246743 freq=1205
OrHighNotLow: had -halfback # freq=1246743 freq=1205
+AllNotHigh: *:* -been # freq=1041183
+AllNotHigh: *:* -states # freq=1034872
+AllNotHigh: *:* -time # freq=1032071
+AllNotHigh: *:* -when # freq=1027487
+AllNotLow: *:* -factor # freq=37866
+AllNotLow: *:* -migration # freq=37862
+AllNotLow: *:* -maintained # freq=37840
+AllNotLow: *:* -norwegian # freq=37836
+OrHighHighNotLow: several following -factor # freq=436129 freq=416515 freq=37866
+OrHighHighNotLow: publisher end -migration # freq=1289029 freq=526636 freq=37862
+OrHighHighNotLow: 2009 film -maintaine # freq=887702 freq=432758 freq=37840
+OrHighHighNotLow: http known -norwegian # freq=3493581 freq=607158 freq=37836
+OrHighLowNotHigh: 2005 jorgensen -been # freq=835460 freq=837 freq=1041183
+OrHighLowNotHigh: like undivided -states # freq=479390 freq=1512 freq=1034872
+OrHighLowNotHigh: use coy -time # freq=597053 freq=1198 freq=1032071
+OrHighLowNotHigh: been highperformanceengines -when # freq=1041183 freq=1155 freq=1027487
{noformat}
{noformat}
TaskQPS baseline StdDev QPS patch StdDev Pct diff
OrHighLowNotHigh 19.54 (2.6%) 18.59 (4.2%) -4.9% ( -11% - 1%)
OrHighMed 34.32 (3.5%) 33.03 (4.8%) -3.7% ( -11% - 4%)
OrHighHigh 26.95 (3.7%) 25.97 (4.9%) -3.6% ( -11% - 5%)
Fuzzy2 82.74 (16.0%) 80.29 (16.4%) -3.0% ( -30% - 35%)
AndHighLow 502.91 (5.7%) 496.63 (3.0%) -1.2% ( -9% - 7%)
AndHighMed 236.44 (2.9%) 234.34 (2.6%) -0.9% ( -6% - 4%)
OrNotHighMed 222.75 (2.9%) 220.87 (2.4%) -0.8% ( -5% - 4%)
Respell 60.25 (3.0%) 60.38 (2.7%) 0.2% ( -5% - 6%)
MedSloppyPhrase 21.73 (2.3%) 21.92 (2.5%) 0.8% ( -3% - 5%)
Fuzzy1 57.18 (8.0%) 57.78 (5.8%) 1.1% ( -11% - 16%)
LowSloppyPhrase 25.96 (1.9%) 26.24 (2.1%) 1.1% ( -2% - 5%)
HighSloppyPhrase 29.99 (2.5%) 30.37 (2.7%) 1.3% ( -3% - 6%)
MedPhrase 60.11 (2.8%) 61.15 (3.1%) 1.7% ( -4% - 7%)
AndHighHigh 32.86 (3.0%) 33.56 (3.0%) 2.1% ( -3% - 8%)
LowPhrase 59.36 (2.7%) 60.69 (3.2%) 2.2% ( -3% - 8%)
OrHighLow 78.50 (3.6%) 80.33 (4.3%) 2.3% ( -5% - 10%)
HighPhrase 17.32 (2.1%) 17.73 (1.9%) 2.4% ( -1% - 6%)
LowSpanNear 34.90 (2.8%) 35.75 (2.4%) 2.4% ( -2% - 7%)
MedSpanNear 30.83 (2.9%) 31.59 (2.0%) 2.4% ( -2% - 7%)
OrNotHighLow 982.57 (4.2%) 1009.18 (2.8%) 2.7% ( -4% - 10%)
HighSpanNear 10.39 (3.8%) 10.76 (3.7%) 3.5% ( -3% - 11%)
Wildcard 64.30 (4.2%) 67.27 (5.2%) 4.6% ( -4% - 14%)
HighTerm 110.90 (5.2%) 117.51 (6.7%) 6.0% ( -5% - 18%)
MedTerm 155.42 (5.3%) 165.05 (6.9%) 6.2% ( -5% - 19%)
OrNotHighHigh 40.19 (1.9%) 42.69 (3.2%) 6.2% ( 1% - 11%)
Prefix3 87.35 (6.2%) 93.98 (6.9%) 7.6% ( -5% - 22%)
LowTerm 574.81 (9.0%) 625.04 (9.6%) 8.7% ( -9% - 30%)
IntNRQ 11.95 (9.1%) 13.31 (11.8%) 11.4% ( -8% - 35%)
OrHighNotHigh 50.66 (2.0%) 56.55 (4.3%) 11.6% ( 5% - 18%)
OrHighHighNotLow 27.15 (3.3%) 33.91 (4.9%) 24.9% ( 16% - 34%)
OrHighNotMed 96.64 (2.7%) 130.20 (8.0%) 34.7% ( 23% - 46%)
OrHighNotLow 42.44 (4.0%) 62.60 (10.7%) 47.5% ( 31% - 64%)
AllNotHigh 6.51 (2.9%) 16.76 (26.8%) 157.4% ( 124% - 192%)
AllNotLow 7.18 (3.0%) 21.93 (49.2%) 205.3% ( 148% - 265%)
{noformat}
> Bulk scoring could speed up MUST_NOT clauses
> --------------------------------------------
>
> Key: LUCENE-6940
> URL: https://issues.apache.org/jira/browse/LUCENE-6940
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Adrien Grand
> Assignee: Adrien Grand
> Priority: Minor
> Attachments: LUCENE-6940.patch
>
>
> Today when you have MUST_NOT clauses, the ReqExclScorer is used and needs to check the excluded clauses on every iteration. I suspect we could speed things up by having a BulkScorer that would advance the excluded clause first and then tell the required clause to bulk score up to the next excluded document.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org