You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Adrien Grand (JIRA)" <ji...@apache.org> on 2015/02/19 10:56:11 UTC
[jira] [Updated] (LUCENE-6260) Simplify ExactPhraseScorer
[ https://issues.apache.org/jira/browse/LUCENE-6260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Adrien Grand updated LUCENE-6260:
---------------------------------
Attachment: LUCENE-6260.patch
Here is a patch which makes phrase intersection essentially look like ConjunctionDISI except that it works on positions instead of doc IDs. I ran luceneutil on wikibig1M and the performance loss looks quite small:
{noformat}
TaskQPS baseline StdDev QPS patch StdDev Pct diff
HighPhrase 33.54 (1.3%) 31.72 (1.9%) -5.4% ( -8% - -2%)
LowPhrase 48.76 (1.2%) 47.74 (2.1%) -2.1% ( -5% - 1%)
OrNotHighLow 1167.83 (4.0%) 1153.63 (4.4%) -1.2% ( -9% - 7%)
Fuzzy1 112.76 (12.5%) 111.41 (11.9%) -1.2% ( -22% - 26%)
MedPhrase 126.21 (1.6%) 124.89 (2.8%) -1.0% ( -5% - 3%)
LowTerm 2361.80 (5.3%) 2338.19 (5.0%) -1.0% ( -10% - 9%)
AndHighLow 1053.44 (2.6%) 1043.11 (5.6%) -1.0% ( -8% - 7%)
OrHighNotMed 180.00 (1.8%) 179.10 (2.1%) -0.5% ( -4% - 3%)
OrHighNotLow 139.58 (2.6%) 139.24 (3.1%) -0.2% ( -5% - 5%)
IntNRQ 126.93 (6.3%) 126.72 (5.5%) -0.2% ( -11% - 12%)
AndHighHigh 130.72 (3.1%) 130.58 (3.2%) -0.1% ( -6% - 6%)
HighSpanNear 12.64 (1.2%) 12.63 (1.4%) -0.1% ( -2% - 2%)
Prefix3 92.94 (7.8%) 92.92 (7.6%) -0.0% ( -14% - 16%)
OrHighMed 155.49 (10.5%) 155.60 (10.0%) 0.1% ( -18% - 22%)
AndHighMed 181.53 (3.0%) 181.74 (3.0%) 0.1% ( -5% - 6%)
OrNotHighHigh 137.81 (3.1%) 137.98 (2.2%) 0.1% ( -5% - 5%)
OrHighHigh 136.52 (10.5%) 136.71 (9.8%) 0.1% ( -18% - 22%)
MedSloppyPhrase 44.59 (2.8%) 44.67 (3.3%) 0.2% ( -5% - 6%)
OrHighNotHigh 135.68 (1.6%) 135.93 (1.5%) 0.2% ( -2% - 3%)
MedTerm 949.94 (3.1%) 951.88 (2.9%) 0.2% ( -5% - 6%)
LowSpanNear 26.02 (0.9%) 26.07 (1.3%) 0.2% ( -1% - 2%)
OrHighLow 97.01 (11.1%) 97.22 (10.6%) 0.2% ( -19% - 24%)
MedSpanNear 27.98 (1.1%) 28.04 (1.0%) 0.2% ( -1% - 2%)
PKLookup 407.25 (2.2%) 408.17 (1.9%) 0.2% ( -3% - 4%)
OrNotHighMed 434.88 (2.8%) 435.99 (2.5%) 0.3% ( -4% - 5%)
Wildcard 166.20 (4.0%) 166.65 (4.6%) 0.3% ( -8% - 9%)
LowSloppyPhrase 107.31 (3.7%) 107.65 (3.9%) 0.3% ( -7% - 8%)
HighSloppyPhrase 13.76 (2.9%) 13.82 (2.9%) 0.4% ( -5% - 6%)
HighTerm 328.62 (2.3%) 330.24 (2.1%) 0.5% ( -3% - 4%)
Respell 67.48 (4.8%) 67.84 (5.5%) 0.5% ( -9% - 11%)
Fuzzy2 73.37 (15.2%) 78.35 (13.4%) 6.8% ( -18% - 41%)
{noformat}
One advantage of this approach is that it would help phraseFreq return earlier if scores are not needed and there is a match at the beginning of the document.
> Simplify ExactPhraseScorer
> --------------------------
>
> Key: LUCENE-6260
> URL: https://issues.apache.org/jira/browse/LUCENE-6260
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Adrien Grand
> Priority: Minor
> Attachments: LUCENE-6260.patch
>
>
> ExactPhraseScorer tries to intersect positions using windows of 4096 documents. In LUCENE-2410 it was reported that it helped a lot but I tried again on wikibig with a simpler impl that does advance one position at a time and the performance difference was only of a few percents. I'm guessing that maybe other changes (eg. the new postings format?) do not make this behaviour as useful as it used to be?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org