You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Adrien Grand (JIRA)" <ji...@apache.org> on 2016/06/16 09:47:06 UTC

[jira] [Updated] (LUCENE-7339) Bring back RandomAccessFilterStrategy

     [ https://issues.apache.org/jira/browse/LUCENE-7339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Adrien Grand updated LUCENE-7339:
---------------------------------
    Attachment: LUCENE-7339.patch

Here is a patch: it looks at the iterators to intersect and applies all of them that are an instance of BitSetIterator in a random-access fashion. There is one exception: if the iterator that has the minimum cost in the list is a BitSetIterator, then this iterator will still be consumed using the regular nextDoc/advance DISI API. This makes the optimization safe since we still use the iterator with the least cost in order to "lead" the iteration.

I also changed the query cache to use a FixedBitSet rather than a RoaringDocIdSet when the density is greater than 1% (which was the cut-off used by RandomAccessFilterStrategy). This should help both because FixedBitSet is faster on dense sets (1% is -2 on the x axis at http://people.apache.org/~jpountz/doc_id_sets6.html) and also because it will enable this random-access optimization more often.

I had to hack a bit luceneutil in order to generate conjunctions with an iterator over a BitSet, which I did by creating a conjunction over a term query and a numeric range. Here is the patch I applied to luceneutil:
{noformat}
diff --git a/src/main/perf/TaskParser.java b/src/main/perf/TaskParser.java
index 8397b3a..2365159 100644
--- a/src/main/perf/TaskParser.java
+++ b/src/main/perf/TaskParser.java
@@ -244,6 +244,29 @@ class TaskParser {
         query = IntPoint.newRangeQuery(nrqFieldName, start, end);
         sort = null;
         group = null;
+      } else if (text.startsWith("filtered_nrq//")) {
+        // field start end
+        final int spot3 = text.indexOf(' ');
+        if (spot3 == -1) {
+          throw new RuntimeException("failed to parse query=" + text);
+        }
+        final int spot4 = text.indexOf(' ', spot3+1);
+        if (spot4 == -1) {
+          throw new RuntimeException("failed to parse query=" + text);
+        }
+        final int spot5 = text.indexOf(' ', spot4+1);
+        if (spot5 == -1) {
+          throw new RuntimeException("failed to parse query=" + text);
+        }
+        final String nrqFieldName = text.substring("filtered_nrq//".length(), spot3);
+        final int start = Integer.parseInt(text.substring(1+spot3, spot4));
+        final int end = Integer.parseInt(text.substring(1+spot4, spot5));
+        query = new BooleanQuery.Builder()
+            .add(new TermQuery(new Term("body", text.substring(1+spot5))), Occur.MUST)
+            .add(IntPoint.newRangeQuery(nrqFieldName, start, end), Occur.FILTER)
+            .build();
+        sort = null;
+        group = null;
       } else if (text.startsWith("datetimesort//")) {
         throw new IllegalArgumentException("use lastmodndvsort instead");
       } else if (text.startsWith("titlesort//")) {
diff --git a/tasks/wikimedium.10M.nostopwords.tasks b/tasks/wikimedium.10M.nostopwords.tasks
index 342070c..983361f 100644
--- a/tasks/wikimedium.10M.nostopwords.tasks
+++ b/tasks/wikimedium.10M.nostopwords.tasks
@@ -13361,3 +13361,20 @@ OrNotHighLow: -do necessities # freq=511178 freq=1195
 OrHighNotLow: do -necessities # freq=511178 freq=1195
 OrNotHighLow: -had halfback # freq=1246743 freq=1205
 OrHighNotLow: had -halfback # freq=1246743 freq=1205
+FilteredIntNRQ: filtered_nrq//timesecnum 6207 55832 ref
+FilteredIntNRQ: filtered_nrq//timesecnum 53 85622 http
+FilteredIntNRQ: filtered_nrq//timesecnum 2669 66142 from
+FilteredIntNRQ: filtered_nrq//timesecnum 9936 85687 name
+FilteredIntNRQ: filtered_nrq//timesecnum 23189 61377 title
+FilteredIntNRQ: filtered_nrq//timesecnum 7624 69351 date
+FilteredIntNRQ: filtered_nrq//timesecnum 15733 85583 which
+FilteredIntNRQ: filtered_nrq//timesecnum 8791 69420 also
+FilteredIntNRQ: filtered_nrq//timesecnum 6125 46693 first
+FilteredIntNRQ: filtered_nrq//timesecnum 8006 80120 his
+FilteredIntNRQ: filtered_nrq//timesecnum 11514 45063 cite
+FilteredIntNRQ: filtered_nrq//timesecnum 6342 72089 he
+FilteredIntNRQ: filtered_nrq//timesecnum 10670 66864 new
+FilteredIntNRQ: filtered_nrq//timesecnum 2896 83864 1
+FilteredIntNRQ: filtered_nrq//timesecnum 8947 64612 s
+FilteredIntNRQ: filtered_nrq//timesecnum 2808 75217 2
+FilteredIntNRQ: filtered_nrq//timesecnum 388 84762 one
{noformat}

I got a ~6% speedup when testing it on wikimedium10m:
{noformat}
                    TaskQPS baseline      StdDev   QPS patch      StdDev                Pct diff
                 LowTerm      686.83      (7.6%)      677.34      (8.5%)   -1.4% ( -16% -   15%)
        HighSloppyPhrase       19.29      (4.3%)       19.03      (4.5%)   -1.3% (  -9% -    7%)
                 Prefix3      159.38      (5.0%)      157.87      (5.0%)   -0.9% ( -10% -    9%)
            OrNotHighMed      191.55      (2.2%)      189.74      (2.3%)   -0.9% (  -5% -    3%)
                Wildcard       99.07      (5.9%)       98.18      (5.9%)   -0.9% ( -11% -   11%)
             MedSpanNear       38.50      (3.1%)       38.15      (3.2%)   -0.9% (  -7% -    5%)
               MedPhrase       88.65      (2.6%)       87.88      (1.8%)   -0.9% (  -5% -    3%)
         LowSloppyPhrase       59.35      (2.8%)       58.95      (2.9%)   -0.7% (  -6% -    5%)
           OrHighNotHigh       49.32      (4.0%)       49.00      (4.2%)   -0.7% (  -8% -    7%)
            OrHighNotMed       86.14      (4.8%)       85.60      (5.4%)   -0.6% ( -10% -   10%)
            OrHighNotLow       50.89      (6.1%)       50.58      (5.7%)   -0.6% ( -11% -   11%)
         MedSloppyPhrase       76.49      (2.0%)       76.05      (1.9%)   -0.6% (  -4% -    3%)
             AndHighHigh      119.69      (1.2%)      119.07      (2.1%)   -0.5% (  -3% -    2%)
            OrNotHighLow     1060.23      (3.2%)     1055.14      (3.9%)   -0.5% (  -7% -    6%)
             LowSpanNear      229.73      (2.9%)      228.70      (3.0%)   -0.4% (  -6% -    5%)
               LowPhrase      151.17      (2.2%)      150.49      (3.0%)   -0.4% (  -5% -    4%)
              HighPhrase       43.96      (1.7%)       43.79      (2.3%)   -0.4% (  -4% -    3%)
            HighSpanNear        2.23      (7.1%)        2.22      (7.4%)   -0.4% ( -13% -   15%)
                  IntNRQ       12.11      (8.9%)       12.07      (8.9%)   -0.3% ( -16% -   19%)
           OrNotHighHigh       67.08      (3.9%)       66.92      (3.9%)   -0.2% (  -7% -    7%)
                 MedTerm      157.87      (5.6%)      157.54      (5.1%)   -0.2% ( -10% -   11%)
                HighTerm       76.31      (5.8%)       76.16      (5.5%)   -0.2% ( -10% -   11%)
                 Respell       74.16      (2.3%)       74.01      (3.2%)   -0.2% (  -5% -    5%)
              AndHighMed      144.89      (1.8%)      144.62      (1.5%)   -0.2% (  -3% -    3%)
              OrHighHigh       31.12      (6.2%)       31.18      (5.7%)    0.2% ( -11% -   12%)
               OrHighMed       38.73      (6.0%)       38.84      (5.4%)    0.3% ( -10% -   12%)
               OrHighLow      102.78      (4.0%)      103.09      (4.0%)    0.3% (  -7% -    8%)
                  Fuzzy1       75.62     (16.4%)       76.64     (12.7%)    1.4% ( -23% -   36%)
              AndHighLow      785.48      (4.2%)      796.18      (3.5%)    1.4% (  -6% -    9%)
                  Fuzzy2       45.30     (16.5%)       47.03     (19.4%)    3.8% ( -27% -   47%)
          FilteredIntNRQ       10.57      (4.0%)       11.20      (4.4%)    5.9% (  -2% -   14%)
{noformat}

> Bring back RandomAccessFilterStrategy
> -------------------------------------
>
>                 Key: LUCENE-7339
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7339
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Adrien Grand
>            Assignee: Adrien Grand
>            Priority: Minor
>         Attachments: LUCENE-7339.patch
>
>
> FiteredQuery had 3 ways of running conjunctions: leap-frog, query first and random-access filter. We still use leap-frog for conjunctions and we now have a better "query-first" strategy through two-phase iteration. However, we don't have any equivalent for the random-access filter strategy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org