You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Adrien Grand (JIRA)" <ji...@apache.org> on 2018/01/03 15:51:00 UTC
[jira] [Updated] (LUCENE-4198) Allow codecs to index term impacts
[ https://issues.apache.org/jira/browse/LUCENE-4198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Adrien Grand updated LUCENE-4198:
---------------------------------
Attachment: LUCENE-4198.patch
I have been working on a prototype that adds skip data so that postings could know the best potential score for each block of documents. It would be nice to not make it Similarity-dependant so that Similarities that use the same norm encoding could still be switched at search time like today. So the current approach is to store the maximum freq per block when norms are disabled, or all competitive (freq,norm) pairs when norms are enabled. This leverages the work that has been done on similarities in order to make sure that scores do not decrease when freq increases or when the norm increases. This means that (freq,norm) is always more competitive than (freq-1,norm) or (freq,norm+1), so we don't need to store all (freq,norm) pairs, only competitive ones. At search time, the sim scorer is passed to the postings producer so that it can compute the maximum score of a block by computing the score for all competitive {{(freq,norm)}} pairs.
Note that the attached patch is a rough prototype, it is hacky and not everything compiles. I just did the bare minimum so that some basic tests and luceneutil can run. There is very little testing. Some notes about the approach:
- This patch adds the assumption than (unsigned) greater norms produce equal or lower scores. I liked this better than adding a new API on Similarity so that it could tell us how to compare norms.
- Skip lists do not store the competitive (freq,norm) pairs on level 0 since it could take more storage than the postings block, only level 1 and greater.
- I had to add norms producers to the postings consumers so that they could know about norms.
- Having to pass the sim scorer to the postings producer is a bit ugly but I couldn't figure a way to make it nicer.
- The similarity API doesn't make it easy to integrate, it currently gives a {{score(docID, freq)}} API while we'd rather need a {{score(freq,norm)}} API, especially because this optimization only works if freq and norm are the only per-document parameters that can influence the score.
Here is what it gives on luceneutil when disabling total hit counts on both master and the patch:
{noformat}
TaskQPS baseline StdDev QPS patch StdDev Pct diff
AndHighHigh 127.39 (1.4%) 100.94 (2.4%) -20.8% ( -24% - -17%)
AndHighMed 240.66 (2.0%) 212.11 (1.3%) -11.9% ( -14% - -8%)
OrHighMed 76.60 (3.6%) 69.37 (2.3%) -9.4% ( -14% - -3%)
OrHighHigh 27.37 (3.9%) 24.78 (2.4%) -9.4% ( -15% - -3%)
Fuzzy1 328.61 (6.5%) 316.04 (5.4%) -3.8% ( -14% - 8%)
Wildcard 56.88 (7.6%) 55.64 (10.0%) -2.2% ( -18% - 16%)
Fuzzy2 144.68 (3.5%) 142.07 (5.8%) -1.8% ( -10% - 7%)
Prefix3 372.69 (6.1%) 366.43 (7.7%) -1.7% ( -14% - 12%)
HighTermDayOfYearSort 132.88 (6.6%) 131.18 (7.7%) -1.3% ( -14% - 13%)
LowSpanNear 53.14 (1.8%) 52.48 (1.9%) -1.2% ( -4% - 2%)
HighTermMonthSort 109.37 (7.8%) 108.12 (7.1%) -1.1% ( -14% - 14%)
LowSloppyPhrase 54.79 (1.2%) 54.20 (1.1%) -1.1% ( -3% - 1%)
Respell 293.10 (2.9%) 290.77 (5.7%) -0.8% ( -9% - 8%)
HighSloppyPhrase 35.60 (1.6%) 35.33 (1.6%) -0.8% ( -3% - 2%)
OrNotHighLow 1686.91 (3.4%) 1675.46 (1.8%) -0.7% ( -5% - 4%)
HighPhrase 24.98 (1.9%) 24.82 (1.7%) -0.6% ( -4% - 3%)
MedSpanNear 228.02 (3.4%) 226.69 (3.6%) -0.6% ( -7% - 6%)
MedSloppyPhrase 46.13 (1.4%) 45.87 (1.3%) -0.6% ( -3% - 2%)
MedPhrase 642.58 (3.7%) 639.51 (3.1%) -0.5% ( -6% - 6%)
LowPhrase 82.99 (2.1%) 82.63 (1.6%) -0.4% ( -3% - 3%)
HighSpanNear 34.77 (2.8%) 34.66 (3.1%) -0.3% ( -5% - 5%)
IntNRQ 32.59 (15.2%) 32.61 (14.9%) 0.1% ( -26% - 35%)
AndHighLow 1719.37 (3.8%) 1915.66 (2.8%) 11.4% ( 4% - 18%)
OrHighLow 1290.65 (3.1%) 1808.66 (3.7%) 40.1% ( 32% - 48%)
LowTerm 873.82 (3.1%) 1527.34 (7.2%) 74.8% ( 62% - 87%)
OrNotHighMed 285.74 (2.5%) 590.09 (3.9%) 106.5% ( 97% - 115%)
MedTerm 180.74 (3.6%) 970.40 (20.2%) 436.9% ( 398% - 477%)
OrNotHighHigh 63.41 (0.8%) 529.76 (20.0%) 735.5% ( 709% - 762%)
OrHighNotLow 71.04 (0.6%) 649.36 (30.4%) 814.1% ( 778% - 850%)
HighTerm 85.02 (3.7%) 804.33 (35.6%) 846.1% ( 778% - 919%)
OrHighNotMed 107.76 (0.6%) 1929.95 (48.1%) 1691.0% (1633% - 1749%)
OrHighNotHigh 24.38 (0.5%) 478.53 (56.8%) 1862.7% (1796% - 1928%)
{noformat}
It make {{HighTerm}} about 8x faster. If you wonder why it also helps some boolean queries, this is because boolean queries propagate information about the minimum competitive score to sub clauses. Disk usage increase is negligible: only 0.5% on {{.doc}} files and 0.2% overall. I have not measured indexing speed however.
If someone has ideas what the API could look like, I'd be happy to discuss.
> Allow codecs to index term impacts
> ----------------------------------
>
> Key: LUCENE-4198
> URL: https://issues.apache.org/jira/browse/LUCENE-4198
> Project: Lucene - Core
> Issue Type: Sub-task
> Components: core/index
> Reporter: Robert Muir
> Attachments: LUCENE-4198.patch, LUCENE-4198_flush.patch
>
>
> Subtask of LUCENE-4100.
> Thats an example of something similar to impact indexing (though, his implementation currently stores a max for the entire term, the problem is the same).
> We can imagine other similar algorithms too: I think the codec API should be able to support these.
> Currently it really doesnt: Stefan worked around the problem by providing a tool to 'rewrite' your index, he passes the IndexReader and Similarity to it. But it would be better if we fixed the codec API.
> One problem is that the Postings writer needs to have access to the Similarity. Another problem is that it needs access to the term and collection statistics up front, rather than after the fact.
> This might have some cost (hopefully minimal), so I'm thinking to experiment in a branch with these changes and see if we can make it work well.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org