You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Adrien Grand (JIRA)" <ji...@apache.org> on 2016/06/16 09:47:06 UTC
[jira] [Updated] (LUCENE-7339) Bring back
RandomAccessFilterStrategy
[ https://issues.apache.org/jira/browse/LUCENE-7339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Adrien Grand updated LUCENE-7339:
---------------------------------
Attachment: LUCENE-7339.patch
Here is a patch: it looks at the iterators to intersect and applies all of them that are an instance of BitSetIterator in a random-access fashion. There is one exception: if the iterator that has the minimum cost in the list is a BitSetIterator, then this iterator will still be consumed using the regular nextDoc/advance DISI API. This makes the optimization safe since we still use the iterator with the least cost in order to "lead" the iteration.
I also changed the query cache to use a FixedBitSet rather than a RoaringDocIdSet when the density is greater than 1% (which was the cut-off used by RandomAccessFilterStrategy). This should help both because FixedBitSet is faster on dense sets (1% is -2 on the x axis at http://people.apache.org/~jpountz/doc_id_sets6.html) and also because it will enable this random-access optimization more often.
I had to hack a bit luceneutil in order to generate conjunctions with an iterator over a BitSet, which I did by creating a conjunction over a term query and a numeric range. Here is the patch I applied to luceneutil:
{noformat}
diff --git a/src/main/perf/TaskParser.java b/src/main/perf/TaskParser.java
index 8397b3a..2365159 100644
--- a/src/main/perf/TaskParser.java
+++ b/src/main/perf/TaskParser.java
@@ -244,6 +244,29 @@ class TaskParser {
query = IntPoint.newRangeQuery(nrqFieldName, start, end);
sort = null;
group = null;
+ } else if (text.startsWith("filtered_nrq//")) {
+ // field start end
+ final int spot3 = text.indexOf(' ');
+ if (spot3 == -1) {
+ throw new RuntimeException("failed to parse query=" + text);
+ }
+ final int spot4 = text.indexOf(' ', spot3+1);
+ if (spot4 == -1) {
+ throw new RuntimeException("failed to parse query=" + text);
+ }
+ final int spot5 = text.indexOf(' ', spot4+1);
+ if (spot5 == -1) {
+ throw new RuntimeException("failed to parse query=" + text);
+ }
+ final String nrqFieldName = text.substring("filtered_nrq//".length(), spot3);
+ final int start = Integer.parseInt(text.substring(1+spot3, spot4));
+ final int end = Integer.parseInt(text.substring(1+spot4, spot5));
+ query = new BooleanQuery.Builder()
+ .add(new TermQuery(new Term("body", text.substring(1+spot5))), Occur.MUST)
+ .add(IntPoint.newRangeQuery(nrqFieldName, start, end), Occur.FILTER)
+ .build();
+ sort = null;
+ group = null;
} else if (text.startsWith("datetimesort//")) {
throw new IllegalArgumentException("use lastmodndvsort instead");
} else if (text.startsWith("titlesort//")) {
diff --git a/tasks/wikimedium.10M.nostopwords.tasks b/tasks/wikimedium.10M.nostopwords.tasks
index 342070c..983361f 100644
--- a/tasks/wikimedium.10M.nostopwords.tasks
+++ b/tasks/wikimedium.10M.nostopwords.tasks
@@ -13361,3 +13361,20 @@ OrNotHighLow: -do necessities # freq=511178 freq=1195
OrHighNotLow: do -necessities # freq=511178 freq=1195
OrNotHighLow: -had halfback # freq=1246743 freq=1205
OrHighNotLow: had -halfback # freq=1246743 freq=1205
+FilteredIntNRQ: filtered_nrq//timesecnum 6207 55832 ref
+FilteredIntNRQ: filtered_nrq//timesecnum 53 85622 http
+FilteredIntNRQ: filtered_nrq//timesecnum 2669 66142 from
+FilteredIntNRQ: filtered_nrq//timesecnum 9936 85687 name
+FilteredIntNRQ: filtered_nrq//timesecnum 23189 61377 title
+FilteredIntNRQ: filtered_nrq//timesecnum 7624 69351 date
+FilteredIntNRQ: filtered_nrq//timesecnum 15733 85583 which
+FilteredIntNRQ: filtered_nrq//timesecnum 8791 69420 also
+FilteredIntNRQ: filtered_nrq//timesecnum 6125 46693 first
+FilteredIntNRQ: filtered_nrq//timesecnum 8006 80120 his
+FilteredIntNRQ: filtered_nrq//timesecnum 11514 45063 cite
+FilteredIntNRQ: filtered_nrq//timesecnum 6342 72089 he
+FilteredIntNRQ: filtered_nrq//timesecnum 10670 66864 new
+FilteredIntNRQ: filtered_nrq//timesecnum 2896 83864 1
+FilteredIntNRQ: filtered_nrq//timesecnum 8947 64612 s
+FilteredIntNRQ: filtered_nrq//timesecnum 2808 75217 2
+FilteredIntNRQ: filtered_nrq//timesecnum 388 84762 one
{noformat}
I got a ~6% speedup when testing it on wikimedium10m:
{noformat}
TaskQPS baseline StdDev QPS patch StdDev Pct diff
LowTerm 686.83 (7.6%) 677.34 (8.5%) -1.4% ( -16% - 15%)
HighSloppyPhrase 19.29 (4.3%) 19.03 (4.5%) -1.3% ( -9% - 7%)
Prefix3 159.38 (5.0%) 157.87 (5.0%) -0.9% ( -10% - 9%)
OrNotHighMed 191.55 (2.2%) 189.74 (2.3%) -0.9% ( -5% - 3%)
Wildcard 99.07 (5.9%) 98.18 (5.9%) -0.9% ( -11% - 11%)
MedSpanNear 38.50 (3.1%) 38.15 (3.2%) -0.9% ( -7% - 5%)
MedPhrase 88.65 (2.6%) 87.88 (1.8%) -0.9% ( -5% - 3%)
LowSloppyPhrase 59.35 (2.8%) 58.95 (2.9%) -0.7% ( -6% - 5%)
OrHighNotHigh 49.32 (4.0%) 49.00 (4.2%) -0.7% ( -8% - 7%)
OrHighNotMed 86.14 (4.8%) 85.60 (5.4%) -0.6% ( -10% - 10%)
OrHighNotLow 50.89 (6.1%) 50.58 (5.7%) -0.6% ( -11% - 11%)
MedSloppyPhrase 76.49 (2.0%) 76.05 (1.9%) -0.6% ( -4% - 3%)
AndHighHigh 119.69 (1.2%) 119.07 (2.1%) -0.5% ( -3% - 2%)
OrNotHighLow 1060.23 (3.2%) 1055.14 (3.9%) -0.5% ( -7% - 6%)
LowSpanNear 229.73 (2.9%) 228.70 (3.0%) -0.4% ( -6% - 5%)
LowPhrase 151.17 (2.2%) 150.49 (3.0%) -0.4% ( -5% - 4%)
HighPhrase 43.96 (1.7%) 43.79 (2.3%) -0.4% ( -4% - 3%)
HighSpanNear 2.23 (7.1%) 2.22 (7.4%) -0.4% ( -13% - 15%)
IntNRQ 12.11 (8.9%) 12.07 (8.9%) -0.3% ( -16% - 19%)
OrNotHighHigh 67.08 (3.9%) 66.92 (3.9%) -0.2% ( -7% - 7%)
MedTerm 157.87 (5.6%) 157.54 (5.1%) -0.2% ( -10% - 11%)
HighTerm 76.31 (5.8%) 76.16 (5.5%) -0.2% ( -10% - 11%)
Respell 74.16 (2.3%) 74.01 (3.2%) -0.2% ( -5% - 5%)
AndHighMed 144.89 (1.8%) 144.62 (1.5%) -0.2% ( -3% - 3%)
OrHighHigh 31.12 (6.2%) 31.18 (5.7%) 0.2% ( -11% - 12%)
OrHighMed 38.73 (6.0%) 38.84 (5.4%) 0.3% ( -10% - 12%)
OrHighLow 102.78 (4.0%) 103.09 (4.0%) 0.3% ( -7% - 8%)
Fuzzy1 75.62 (16.4%) 76.64 (12.7%) 1.4% ( -23% - 36%)
AndHighLow 785.48 (4.2%) 796.18 (3.5%) 1.4% ( -6% - 9%)
Fuzzy2 45.30 (16.5%) 47.03 (19.4%) 3.8% ( -27% - 47%)
FilteredIntNRQ 10.57 (4.0%) 11.20 (4.4%) 5.9% ( -2% - 14%)
{noformat}
> Bring back RandomAccessFilterStrategy
> -------------------------------------
>
> Key: LUCENE-7339
> URL: https://issues.apache.org/jira/browse/LUCENE-7339
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Adrien Grand
> Assignee: Adrien Grand
> Priority: Minor
> Attachments: LUCENE-7339.patch
>
>
> FiteredQuery had 3 ways of running conjunctions: leap-frog, query first and random-access filter. We still use leap-frog for conjunctions and we now have a better "query-first" strategy through two-phase iteration. However, we don't have any equivalent for the random-access filter strategy.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org