You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Michael McCandless (JIRA)" <ji...@apache.org> on 2013/06/09 23:47:20 UTC

[jira] [Updated] (LUCENE-5049) Native (C++) implementation of "pure OR" BooleanQuery

     [ https://issues.apache.org/jira/browse/LUCENE-5049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-5049:
---------------------------------------

    Attachment: LUCENE-5049.patch

Patch.

I use NativeMMapDir (borrowed from LUCENE-3178) to be able to map a
single file without chunking.

No core changes were needed (it uses reflection to grab all the stuff
it needs).

This works with the current default 4.x codec, but it's somewhat
sub-optimal on x86 since Lucene stores longs big-endian but x86 is
little-endian: I do a long byte-swap for every long at read time
(though this is a single instruction on x86...).  Also, the longs are
not "aligned" in memory, but x86 seems not to have much penalty for
this (I tried a modified PF that inserts spacer bytes to align each
block, but net/net it was slower).

It's simple to use: there is a single static NativeSearch.search
method, that will use the native code if it can apply to the current
"context", and otherwise falls back to normal IndexSearcher method.
So an app can just route all searches through this API, and those that
can be optimized, will be.

There are definite limitations:

  * You must use NativeMMapDir, default codec and sim, all terms must
    be in one field, and you must sort by score.

  * It's C code ... so when there are bugs, when you close the
    IndexSearcher while threads are still searching, etc., you'll get
    SEGV and the OS will kill the JVM!

  * Only works on Linux (Unix) little-endian CPUs (only tested on
    Linux/Intel...) since I always byte swap ...

                
> Native (C++) implementation of "pure OR" BooleanQuery
> -----------------------------------------------------
>
>                 Key: LUCENE-5049
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5049
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>         Attachments: LUCENE-5049.patch
>
>
> I've been playing with a C++ implementation of BooleanQuery containing
> only OR'd (SHOULD) TermQuery clauses, collecting top N hits by score.
> The results are impressive: ~3X speedup for BQ OR over two terms, and
> also good speedups (~38-78%) for Fuzzy1/2 as well since they rewrite
> to BQ OR over N terms:
> {noformat}
>                     Task    QPS base      StdDev    QPS comp      StdDev                Pct diff
>                  MedTerm       69.47     (15.8%)       68.61     (13.4%)   -1.2% ( -26% -   33%)
>                 HighTerm       55.25     (16.2%)       54.63     (13.9%)   -1.1% ( -26% -   34%)
>                  LowTerm      333.10      (9.6%)      329.43      (8.0%)   -1.1% ( -17% -   18%)
>                   IntNRQ        3.37      (2.6%)        3.36      (4.6%)   -0.2% (  -7% -    7%)
>                  Prefix3       18.91      (2.0%)       19.04      (3.5%)    0.7% (  -4% -    6%)
>                 Wildcard       29.40      (1.7%)       29.70      (2.8%)    1.0% (  -3% -    5%)
>                MedPhrase      132.69      (6.2%)      134.66      (7.0%)    1.5% ( -11% -   15%)
>         HighSloppyPhrase        0.82      (3.6%)        0.83      (3.5%)    1.9% (  -5% -    9%)
>              AndHighHigh       19.65      (0.6%)       20.02      (0.8%)    1.9% (   0% -    3%)
>               HighPhrase       11.74      (6.6%)       11.96      (7.1%)    1.9% ( -11% -   16%)
>          MedSloppyPhrase       29.09      (1.2%)       29.76      (1.9%)    2.3% (   0% -    5%)
>          LowSloppyPhrase       25.71      (1.4%)       26.98      (1.7%)    4.9% (   1% -    8%)
>                  Respell      173.78      (3.0%)      182.41      (3.7%)    5.0% (  -1% -   12%)
>              MedSpanNear       27.67      (2.5%)       29.07      (2.4%)    5.1% (   0% -   10%)
>             HighSpanNear        2.95      (2.4%)        3.10      (2.8%)    5.4% (   0% -   10%)
>              LowSpanNear        8.29      (3.4%)        8.82      (3.3%)    6.4% (   0% -   13%)
>               AndHighMed       79.32      (1.6%)       84.44      (1.0%)    6.5% (   3% -    9%)
>                LowPhrase       23.20      (2.0%)       25.14      (1.6%)    8.4% (   4% -   12%)
>               AndHighLow      594.17      (3.4%)      660.32      (1.9%)   11.1% (   5% -   16%)
>                   Fuzzy2       88.32      (6.4%)      121.44      (1.7%)   37.5% (  27% -   48%)
>                   Fuzzy1       86.34      (6.0%)      153.49      (1.7%)   77.8% (  66% -   90%)
>               OrHighHigh       16.29      (2.5%)       48.29      (1.3%)  196.5% ( 188% -  205%)
>                OrHighMed       28.98      (2.7%)       87.81      (0.9%)  203.0% ( 194% -  212%)
>                OrHighLow       27.38      (2.6%)       84.94      (1.1%)  210.3% ( 201% -  219%)
> {noformat}
> This is essentially a scaled back attempt at LUCENE-1594 in that it's
> "hardwired" to "just" the "OR of TermQuery" case.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org