You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Adrien Grand (JIRA)" <ji...@apache.org> on 2018/01/12 19:17:00 UTC
[jira] [Comment Edited] (LUCENE-4198) Allow codecs to index term
impacts
[ https://issues.apache.org/jira/browse/LUCENE-4198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16324415#comment-16324415 ]
Adrien Grand edited comment on LUCENE-4198 at 1/12/18 7:16 PM:
---------------------------------------------------------------
To give some insight into future work on scorers, here is an untested patch (the only tests for now are that luceneutil gives the same hits back) that implements some ideas from the BMW paper.
The new {{BlockMaxConjunctionScorer}} skips blocks whose sum of max scores is less than the max competitive score, and also skips hits when the score of the max scoring clause is less than the minimum required score minus max scores of other clauses.
{{WANDScorer}} uses the block max scores to get an upper bound of the score of the current candidate, which already helps {{OrHighLow}}. It could also skip over blocks when the sum of the max scores is not competitive, but the impl needs a bit more work than for conjunctions.
Baseline is LUCENE-4198.patch, patch is LUCENE-4198.patch and LUCENE-4198-BMW.patch combined.
{noformat}
TaskQPS baseline StdDev QPS patch StdDev Pct diff
LowTerm 2365.07 (2.8%) 2313.92 (2.5%) -2.2% ( -7% - 3%)
OrHighMed 73.78 (2.9%) 72.70 (2.5%) -1.5% ( -6% - 4%)
HighTermDayOfYearSort 88.44 (11.4%) 87.15 (13.0%) -1.5% ( -23% - 25%)
HighTerm 650.28 (5.8%) 646.81 (5.7%) -0.5% ( -11% - 11%)
Respell 228.08 (2.5%) 227.84 (2.4%) -0.1% ( -4% - 4%)
MedTerm 1189.63 (4.2%) 1189.27 (4.6%) -0.0% ( -8% - 9%)
MedSpanNear 12.21 (5.0%) 12.24 (5.5%) 0.2% ( -9% - 11%)
HighSpanNear 7.26 (5.5%) 7.28 (5.8%) 0.2% ( -10% - 12%)
Wildcard 108.43 (7.0%) 108.95 (6.8%) 0.5% ( -12% - 15%)
Prefix3 128.80 (8.1%) 129.46 (7.8%) 0.5% ( -14% - 17%)
HighTermMonthSort 172.27 (8.0%) 173.28 (8.0%) 0.6% ( -14% - 18%)
Fuzzy2 104.86 (5.7%) 105.79 (6.5%) 0.9% ( -10% - 13%)
LowSloppyPhrase 14.80 (5.6%) 14.93 (6.1%) 0.9% ( -10% - 13%)
LowSpanNear 95.06 (3.4%) 96.07 (4.2%) 1.1% ( -6% - 8%)
HighSloppyPhrase 3.96 (8.6%) 4.02 (9.7%) 1.6% ( -15% - 21%)
IntNRQ 29.80 (7.0%) 30.50 (6.9%) 2.4% ( -10% - 17%)
Fuzzy1 281.25 (4.8%) 288.77 (9.5%) 2.7% ( -11% - 17%)
MedSloppyPhrase 53.95 (8.0%) 55.43 (9.0%) 2.7% ( -13% - 21%)
OrHighHigh 23.86 (4.1%) 24.70 (2.7%) 3.5% ( -3% - 10%)
MedPhrase 42.45 (2.2%) 44.10 (3.2%) 3.9% ( -1% - 9%)
LowPhrase 19.57 (2.7%) 20.47 (3.6%) 4.6% ( -1% - 11%)
HighPhrase 15.76 (4.1%) 16.91 (5.3%) 7.3% ( -1% - 17%)
OrHighLow 209.91 (2.3%) 261.10 (3.5%) 24.4% ( 18% - 30%)
AndHighHigh 27.22 (2.1%) 47.66 (5.1%) 75.1% ( 66% - 84%)
AndHighLow 514.84 (3.5%) 920.46 (6.0%) 78.8% ( 66% - 91%)
AndHighMed 56.15 (2.0%) 107.60 (5.4%) 91.6% ( 82% - 101%)
{noformat}
was (Author: jpountz):
To give some insight into future work on scorers, here is an untested patch (the only tests for now are that luceneutil gives the same hits back) that implements some ideas from the BMW paper.
The new {{BlockMaxConjunctionScorer}} skips blocks whose sum of max scores is less than the max competitive score, and also skips hits when the score of the max scoring clause is less than the minimum required score minus max scores of other clauses.
{{WANDScorer}} uses the block max scores to get an upper bound of the score of the current candidate, which already helps {{OrHighLow}}. It could also skip over blocks when the sum of the max scores is not competitive, but the impl needs a bit more work than for conjunctions.
{noformat}
TaskQPS baseline StdDev QPS patch StdDev Pct diff
LowTerm 2365.07 (2.8%) 2313.92 (2.5%) -2.2% ( -7% - 3%)
OrHighMed 73.78 (2.9%) 72.70 (2.5%) -1.5% ( -6% - 4%)
HighTermDayOfYearSort 88.44 (11.4%) 87.15 (13.0%) -1.5% ( -23% - 25%)
HighTerm 650.28 (5.8%) 646.81 (5.7%) -0.5% ( -11% - 11%)
Respell 228.08 (2.5%) 227.84 (2.4%) -0.1% ( -4% - 4%)
MedTerm 1189.63 (4.2%) 1189.27 (4.6%) -0.0% ( -8% - 9%)
MedSpanNear 12.21 (5.0%) 12.24 (5.5%) 0.2% ( -9% - 11%)
HighSpanNear 7.26 (5.5%) 7.28 (5.8%) 0.2% ( -10% - 12%)
Wildcard 108.43 (7.0%) 108.95 (6.8%) 0.5% ( -12% - 15%)
Prefix3 128.80 (8.1%) 129.46 (7.8%) 0.5% ( -14% - 17%)
HighTermMonthSort 172.27 (8.0%) 173.28 (8.0%) 0.6% ( -14% - 18%)
Fuzzy2 104.86 (5.7%) 105.79 (6.5%) 0.9% ( -10% - 13%)
LowSloppyPhrase 14.80 (5.6%) 14.93 (6.1%) 0.9% ( -10% - 13%)
LowSpanNear 95.06 (3.4%) 96.07 (4.2%) 1.1% ( -6% - 8%)
HighSloppyPhrase 3.96 (8.6%) 4.02 (9.7%) 1.6% ( -15% - 21%)
IntNRQ 29.80 (7.0%) 30.50 (6.9%) 2.4% ( -10% - 17%)
Fuzzy1 281.25 (4.8%) 288.77 (9.5%) 2.7% ( -11% - 17%)
MedSloppyPhrase 53.95 (8.0%) 55.43 (9.0%) 2.7% ( -13% - 21%)
OrHighHigh 23.86 (4.1%) 24.70 (2.7%) 3.5% ( -3% - 10%)
MedPhrase 42.45 (2.2%) 44.10 (3.2%) 3.9% ( -1% - 9%)
LowPhrase 19.57 (2.7%) 20.47 (3.6%) 4.6% ( -1% - 11%)
HighPhrase 15.76 (4.1%) 16.91 (5.3%) 7.3% ( -1% - 17%)
OrHighLow 209.91 (2.3%) 261.10 (3.5%) 24.4% ( 18% - 30%)
AndHighHigh 27.22 (2.1%) 47.66 (5.1%) 75.1% ( 66% - 84%)
AndHighLow 514.84 (3.5%) 920.46 (6.0%) 78.8% ( 66% - 91%)
AndHighMed 56.15 (2.0%) 107.60 (5.4%) 91.6% ( 82% - 101%)
{noformat}
> Allow codecs to index term impacts
> ----------------------------------
>
> Key: LUCENE-4198
> URL: https://issues.apache.org/jira/browse/LUCENE-4198
> Project: Lucene - Core
> Issue Type: Sub-task
> Components: core/index
> Reporter: Robert Muir
> Attachments: LUCENE-4198-BMW.patch, LUCENE-4198.patch, LUCENE-4198.patch, LUCENE-4198.patch, LUCENE-4198.patch, LUCENE-4198_flush.patch
>
>
> Subtask of LUCENE-4100.
> Thats an example of something similar to impact indexing (though, his implementation currently stores a max for the entire term, the problem is the same).
> We can imagine other similar algorithms too: I think the codec API should be able to support these.
> Currently it really doesnt: Stefan worked around the problem by providing a tool to 'rewrite' your index, he passes the IndexReader and Similarity to it. But it would be better if we fixed the codec API.
> One problem is that the Postings writer needs to have access to the Similarity. Another problem is that it needs access to the term and collection statistics up front, rather than after the fact.
> This might have some cost (hopefully minimal), so I'm thinking to experiment in a branch with these changes and see if we can make it work well.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org