You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Adrien Grand (JIRA)" <ji...@apache.org> on 2015/02/13 10:37:14 UTC

[jira] [Comment Edited] (LUCENE-6198) two phase intersection

    [ https://issues.apache.org/jira/browse/LUCENE-6198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14319789#comment-14319789 ] 

Adrien Grand edited comment on LUCENE-6198 at 2/13/15 9:36 AM:
---------------------------------------------------------------

I did some more benchmarking and something that helped was to flatten clauses in ConjunctionDISI. This typically means that {{+ "A B"  +C}} is now approximated as {{+A +B +C}} instead of {{+(+A +B) +C}}. (see attached patch)

Here are results on wikibig:

{noformat}
                    TaskQPS baseline      StdDev   QPS patch      StdDev                Pct diff
    AndMedPhraseHighTerm       21.19      (6.1%)       19.98      (2.6%)   -5.7% ( -13% -    3%)
                PKLookup      334.11      (2.1%)      334.82      (2.2%)    0.2% (  -4% -    4%)
   AndHighPhraseHighTerm       11.64      (4.1%)       11.83      (2.4%)    1.6% (  -4% -    8%)
    AndHighPhraseMedTerm       19.19      (2.5%)       21.99      (2.1%)   14.6% (   9% -   19%)
     AndMedPhraseMedTerm       58.27      (6.3%)       67.53      (6.6%)   15.9% (   2% -   30%)
    AndHighPhraseLowTerm       35.07      (5.6%)       42.46      (6.1%)   21.1% (   8% -   34%)
     AndMedPhraseLowTerm       93.39      (8.0%)      128.24     (13.3%)   37.3% (  14% -   63%)
{noformat}

I was curious about the slow down on AndMedPhraseHighTerm. And actually it seems to be tied to the fact that terms are not random. For instance one query of this task is {{+"los angeles" +title}} which matches 30669 documents. However the approximation is {{+los +angeles +title}} and matches 30711 documents, so approximation in this case only adds overhead.


was (Author: jpountz):
I did some more benchmarking and something that helped was to flatten clauses in ConjunctionDISI. This typically means that {{+ "A B"  +C}} is now approximated as {{+A +B +C}} instead of {+(+A +B) +C}}. (see attached patch)

Here are results on wikibig:

{noformat}
                    TaskQPS baseline      StdDev   QPS patch      StdDev                Pct diff
    AndMedPhraseHighTerm       21.19      (6.1%)       19.98      (2.6%)   -5.7% ( -13% -    3%)
                PKLookup      334.11      (2.1%)      334.82      (2.2%)    0.2% (  -4% -    4%)
   AndHighPhraseHighTerm       11.64      (4.1%)       11.83      (2.4%)    1.6% (  -4% -    8%)
    AndHighPhraseMedTerm       19.19      (2.5%)       21.99      (2.1%)   14.6% (   9% -   19%)
     AndMedPhraseMedTerm       58.27      (6.3%)       67.53      (6.6%)   15.9% (   2% -   30%)
    AndHighPhraseLowTerm       35.07      (5.6%)       42.46      (6.1%)   21.1% (   8% -   34%)
     AndMedPhraseLowTerm       93.39      (8.0%)      128.24     (13.3%)   37.3% (  14% -   63%)
{noformat}

I was curious about the slow down on AndMedPhraseHighTerm. And actually it seems to be tied to the fact that terms are not random. For instance one query of this task is {{+"los angeles" +title}} which matches 30669 documents. However the approximation is {{+los +angeles +title}} and matches 30711 documents, so approximation in this case only adds overhead.

> two phase intersection
> ----------------------
>
>                 Key: LUCENE-6198
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6198
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Robert Muir
>         Attachments: LUCENE-6198.patch, LUCENE-6198.patch, LUCENE-6198.patch, LUCENE-6198.patch, LUCENE-6198.patch, phrase_intersections.tasks
>
>
> Currently some scorers have to do a lot of per-document work to determine if a document is a match. The simplest example is a phrase scorer, but there are others (spans, sloppy phrase, geospatial, etc).
> Imagine a conjunction with two MUST clauses, one that is a term that matches all odd documents, another that is a phrase matching all even documents. Today this conjunction will be very expensive, because the zig-zag intersection is reading a ton of useless positions.
> The same problem happens with filteredQuery and anything else that acts like a conjunction.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org