You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Adrien Grand (JIRA)" <ji...@apache.org> on 2015/04/29 13:35:06 UTC
[jira] [Comment Edited] (LUCENE-6458) MultiTermQuery's FILTER
rewrite method should support skipping whenever possible
[ https://issues.apache.org/jira/browse/LUCENE-6458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14519160#comment-14519160 ]
Adrien Grand edited comment on LUCENE-6458 at 4/29/15 11:34 AM:
----------------------------------------------------------------
Here is a patch, it is quite similar to the old "auto" rewrite except that it rewrites per segment and only consumes the filtered terms enum once. Queries are executed as regular disjunctions when there are 50 matching terms or less.
{noformat}
TaskQPS baseline StdDev QPS patch StdDev Pct diff
Wildcard 76.93 (1.7%) 66.55 (4.1%) -13.5% ( -18% - -7%)
Prefix3 99.69 (1.8%) 88.80 (2.3%) -10.9% ( -14% - -6%)
OrHighMed 76.77 (3.7%) 76.26 (3.7%) -0.7% ( -7% - 6%)
OrHighNotHigh 37.88 (1.7%) 37.73 (2.2%) -0.4% ( -4% - 3%)
MedTerm 306.74 (1.4%) 305.54 (1.5%) -0.4% ( -3% - 2%)
OrHighHigh 36.17 (4.5%) 36.05 (4.0%) -0.3% ( -8% - 8%)
HighTerm 120.67 (1.2%) 120.37 (1.7%) -0.2% ( -3% - 2%)
MedSloppyPhrase 36.30 (2.9%) 36.25 (2.8%) -0.1% ( -5% - 5%)
IntNRQ 8.64 (2.9%) 8.63 (2.6%) -0.1% ( -5% - 5%)
LowSpanNear 70.11 (1.8%) 70.13 (2.2%) 0.0% ( -3% - 4%)
HighSpanNear 17.55 (2.8%) 17.56 (3.0%) 0.1% ( -5% - 6%)
OrHighNotMed 81.45 (1.8%) 81.51 (2.2%) 0.1% ( -3% - 4%)
LowPhrase 14.47 (2.7%) 14.50 (3.0%) 0.2% ( -5% - 6%)
MedSpanNear 120.55 (2.2%) 120.86 (2.0%) 0.3% ( -3% - 4%)
AndHighHigh 58.08 (2.5%) 58.24 (2.6%) 0.3% ( -4% - 5%)
LowSloppyPhrase 62.42 (4.3%) 62.60 (4.4%) 0.3% ( -8% - 9%)
OrHighNotLow 76.06 (1.9%) 76.36 (2.4%) 0.4% ( -3% - 4%)
Respell 72.86 (3.9%) 73.17 (2.9%) 0.4% ( -6% - 7%)
OrNotHighHigh 50.07 (1.5%) 50.30 (1.2%) 0.5% ( -2% - 3%)
HighSloppyPhrase 24.92 (6.4%) 25.05 (6.5%) 0.5% ( -11% - 14%)
OrHighLow 68.75 (4.6%) 69.17 (4.1%) 0.6% ( -7% - 9%)
HighPhrase 20.89 (2.5%) 21.04 (1.8%) 0.7% ( -3% - 5%)
OrNotHighMed 179.02 (1.9%) 180.37 (1.4%) 0.8% ( -2% - 4%)
PKLookup 263.21 (2.8%) 265.42 (3.0%) 0.8% ( -4% - 6%)
MedPhrase 34.60 (3.6%) 34.94 (3.4%) 1.0% ( -5% - 8%)
LowTerm 780.71 (3.2%) 790.04 (4.2%) 1.2% ( -5% - 8%)
OrNotHighLow 1459.46 (3.5%) 1480.76 (5.0%) 1.5% ( -6% - 10%)
AndHighMed 255.15 (2.6%) 258.93 (2.4%) 1.5% ( -3% - 6%)
Fuzzy1 77.69 (8.9%) 79.12 (7.7%) 1.8% ( -13% - 20%)
AndHighLow 961.32 (3.9%) 980.23 (3.5%) 2.0% ( -5% - 9%)
Fuzzy2 24.48 (7.9%) 25.19 (7.4%) 2.9% ( -11% - 19%)
{noformat}
I was hoping it would kick in for numeric range queries but unfortunately they often need to match hundreds of terms. I'm wondering if it would be different for auto-prefix.
Prefix3 and Wildcard are a bit slower because these ones get actually executed as regular disjunctions. I think the slowdown is fair given that it also requires less memory and provides true skipping support (which the benchmark doesn't use).
was (Author: jpountz):
Here is a patch, it is quite similar to the old "auto" rewrite except that it rewrites per segment and only consumes the filtered terms enum once. Queries are executed as regular disjunctions when there are 50 matching terms or less.
{noformat}
TaskQPS baseline StdDev QPS patch StdDev Pct diff
Prefix3 113.17 (1.7%) 88.55 (2.7%) -21.8% ( -25% - -17%)
Wildcard 37.43 (2.0%) 36.26 (3.2%) -3.1% ( -8% - 2%)
HighSpanNear 4.30 (2.6%) 4.24 (4.0%) -1.6% ( -7% - 5%)
OrHighNotLow 71.52 (1.5%) 70.51 (3.1%) -1.4% ( -5% - 3%)
HighSloppyPhrase 20.60 (6.3%) 20.34 (7.6%) -1.3% ( -14% - 13%)
OrHighNotMed 96.14 (2.0%) 95.11 (2.8%) -1.1% ( -5% - 3%)
MedPhrase 23.49 (1.8%) 23.30 (3.5%) -0.8% ( -6% - 4%)
Respell 62.25 (8.9%) 62.01 (7.4%) -0.4% ( -15% - 17%)
AndHighHigh 52.43 (0.7%) 52.27 (1.1%) -0.3% ( -2% - 1%)
OrNotHighHigh 26.08 (3.5%) 26.02 (1.0%) -0.2% ( -4% - 4%)
OrHighNotHigh 61.96 (2.0%) 61.85 (2.1%) -0.2% ( -4% - 4%)
IntNRQ 8.03 (3.1%) 8.02 (2.6%) -0.2% ( -5% - 5%)
LowTerm 783.62 (4.9%) 783.25 (4.5%) -0.0% ( -9% - 9%)
MedSpanNear 18.77 (1.9%) 18.76 (3.6%) -0.0% ( -5% - 5%)
LowSpanNear 14.49 (2.5%) 14.49 (2.6%) -0.0% ( -4% - 5%)
MedTerm 237.81 (2.1%) 237.76 (3.0%) -0.0% ( -4% - 5%)
PKLookup 266.15 (2.5%) 266.38 (2.5%) 0.1% ( -4% - 5%)
OrHighMed 50.61 (6.0%) 50.68 (6.1%) 0.1% ( -11% - 13%)
Fuzzy2 19.87 (4.4%) 19.92 (7.8%) 0.2% ( -11% - 12%)
OrNotHighMed 90.03 (1.1%) 90.25 (0.8%) 0.2% ( -1% - 2%)
HighPhrase 15.56 (2.0%) 15.61 (2.7%) 0.3% ( -4% - 5%)
MedSloppyPhrase 252.97 (5.2%) 253.93 (4.3%) 0.4% ( -8% - 10%)
LowPhrase 8.16 (1.7%) 8.21 (1.9%) 0.6% ( -2% - 4%)
HighTerm 115.17 (2.4%) 116.05 (2.7%) 0.8% ( -4% - 6%)
OrHighHigh 25.19 (5.7%) 25.45 (6.4%) 1.0% ( -10% - 13%)
OrHighLow 42.12 (7.5%) 42.60 (6.9%) 1.1% ( -12% - 16%)
LowSloppyPhrase 129.20 (1.6%) 130.68 (2.0%) 1.2% ( -2% - 4%)
AndHighMed 231.64 (1.3%) 235.28 (2.1%) 1.6% ( -1% - 4%)
AndHighLow 733.51 (3.9%) 751.08 (3.5%) 2.4% ( -4% - 10%)
Fuzzy1 85.42 (17.0%) 91.04 (5.9%) 6.6% ( -13% - 35%)
OrNotHighLow 893.55 (2.9%) 962.35 (4.6%) 7.7% ( 0% - 15%)
{noformat}
I was hoping it would kick in for numeric range queries but unfortunately they often need to match hundreds of terms. I'm wondering if it would be different for auto-prefix.
Prefix3 and Wildcard are a bit slower because these ones get actually executed as regular disjunctions. I think the slowdown is fair given that it also requires less memory and provides true skipping support (which the benchmark doesn't use).
> MultiTermQuery's FILTER rewrite method should support skipping whenever possible
> --------------------------------------------------------------------------------
>
> Key: LUCENE-6458
> URL: https://issues.apache.org/jira/browse/LUCENE-6458
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Adrien Grand
> Assignee: Adrien Grand
> Priority: Minor
> Attachments: LUCENE-6458.patch
>
>
> Today MultiTermQuery's FILTER rewrite always builds a bit set fom all matching terms. This means that we need to consume the entire postings lists of all matching terms. Instead we should try to execute like regular disjunctions when there are few terms.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org