You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Michael McCandless (JIRA)" <ji...@apache.org> on 2013/06/09 23:47:20 UTC
[jira] [Updated] (LUCENE-5049) Native (C++) implementation of "pure
OR" BooleanQuery
[ https://issues.apache.org/jira/browse/LUCENE-5049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Michael McCandless updated LUCENE-5049:
---------------------------------------
Attachment: LUCENE-5049.patch
Patch.
I use NativeMMapDir (borrowed from LUCENE-3178) to be able to map a
single file without chunking.
No core changes were needed (it uses reflection to grab all the stuff
it needs).
This works with the current default 4.x codec, but it's somewhat
sub-optimal on x86 since Lucene stores longs big-endian but x86 is
little-endian: I do a long byte-swap for every long at read time
(though this is a single instruction on x86...). Also, the longs are
not "aligned" in memory, but x86 seems not to have much penalty for
this (I tried a modified PF that inserts spacer bytes to align each
block, but net/net it was slower).
It's simple to use: there is a single static NativeSearch.search
method, that will use the native code if it can apply to the current
"context", and otherwise falls back to normal IndexSearcher method.
So an app can just route all searches through this API, and those that
can be optimized, will be.
There are definite limitations:
* You must use NativeMMapDir, default codec and sim, all terms must
be in one field, and you must sort by score.
* It's C code ... so when there are bugs, when you close the
IndexSearcher while threads are still searching, etc., you'll get
SEGV and the OS will kill the JVM!
* Only works on Linux (Unix) little-endian CPUs (only tested on
Linux/Intel...) since I always byte swap ...
> Native (C++) implementation of "pure OR" BooleanQuery
> -----------------------------------------------------
>
> Key: LUCENE-5049
> URL: https://issues.apache.org/jira/browse/LUCENE-5049
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Attachments: LUCENE-5049.patch
>
>
> I've been playing with a C++ implementation of BooleanQuery containing
> only OR'd (SHOULD) TermQuery clauses, collecting top N hits by score.
> The results are impressive: ~3X speedup for BQ OR over two terms, and
> also good speedups (~38-78%) for Fuzzy1/2 as well since they rewrite
> to BQ OR over N terms:
> {noformat}
> Task QPS base StdDev QPS comp StdDev Pct diff
> MedTerm 69.47 (15.8%) 68.61 (13.4%) -1.2% ( -26% - 33%)
> HighTerm 55.25 (16.2%) 54.63 (13.9%) -1.1% ( -26% - 34%)
> LowTerm 333.10 (9.6%) 329.43 (8.0%) -1.1% ( -17% - 18%)
> IntNRQ 3.37 (2.6%) 3.36 (4.6%) -0.2% ( -7% - 7%)
> Prefix3 18.91 (2.0%) 19.04 (3.5%) 0.7% ( -4% - 6%)
> Wildcard 29.40 (1.7%) 29.70 (2.8%) 1.0% ( -3% - 5%)
> MedPhrase 132.69 (6.2%) 134.66 (7.0%) 1.5% ( -11% - 15%)
> HighSloppyPhrase 0.82 (3.6%) 0.83 (3.5%) 1.9% ( -5% - 9%)
> AndHighHigh 19.65 (0.6%) 20.02 (0.8%) 1.9% ( 0% - 3%)
> HighPhrase 11.74 (6.6%) 11.96 (7.1%) 1.9% ( -11% - 16%)
> MedSloppyPhrase 29.09 (1.2%) 29.76 (1.9%) 2.3% ( 0% - 5%)
> LowSloppyPhrase 25.71 (1.4%) 26.98 (1.7%) 4.9% ( 1% - 8%)
> Respell 173.78 (3.0%) 182.41 (3.7%) 5.0% ( -1% - 12%)
> MedSpanNear 27.67 (2.5%) 29.07 (2.4%) 5.1% ( 0% - 10%)
> HighSpanNear 2.95 (2.4%) 3.10 (2.8%) 5.4% ( 0% - 10%)
> LowSpanNear 8.29 (3.4%) 8.82 (3.3%) 6.4% ( 0% - 13%)
> AndHighMed 79.32 (1.6%) 84.44 (1.0%) 6.5% ( 3% - 9%)
> LowPhrase 23.20 (2.0%) 25.14 (1.6%) 8.4% ( 4% - 12%)
> AndHighLow 594.17 (3.4%) 660.32 (1.9%) 11.1% ( 5% - 16%)
> Fuzzy2 88.32 (6.4%) 121.44 (1.7%) 37.5% ( 27% - 48%)
> Fuzzy1 86.34 (6.0%) 153.49 (1.7%) 77.8% ( 66% - 90%)
> OrHighHigh 16.29 (2.5%) 48.29 (1.3%) 196.5% ( 188% - 205%)
> OrHighMed 28.98 (2.7%) 87.81 (0.9%) 203.0% ( 194% - 212%)
> OrHighLow 27.38 (2.6%) 84.94 (1.1%) 210.3% ( 201% - 219%)
> {noformat}
> This is essentially a scaled back attempt at LUCENE-1594 in that it's
> "hardwired" to "just" the "OR of TermQuery" case.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org