You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Adrien Grand (JIRA)" <ji...@apache.org> on 2018/01/03 15:51:00 UTC
[jira] [Updated] (LUCENE-4198) Allow codecs to index term impacts

     [ https://issues.apache.org/jira/browse/LUCENE-4198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Adrien Grand updated LUCENE-4198:
---------------------------------
    Attachment: LUCENE-4198.patch

I have been working on a prototype that adds skip data so that postings could know the best potential score for each block of documents. It would be nice to not make it Similarity-dependant so that Similarities that use the same norm encoding could still be switched at search time like today. So the current approach is to store the maximum freq per block when norms are disabled, or all competitive (freq,norm) pairs when norms are enabled. This leverages the work that has been done on similarities in order to make sure that scores do not decrease when freq increases or when the norm increases. This means that (freq,norm) is always more competitive than (freq-1,norm) or (freq,norm+1), so we don't need to store all (freq,norm) pairs, only competitive ones. At search time, the sim scorer is passed to the postings producer so that it can compute the maximum score of a block by computing the score for all competitive {{(freq,norm)}} pairs.

Note that the attached patch is a rough prototype, it is hacky and not everything compiles. I just did the bare minimum so that some basic tests and luceneutil can run. There is very little testing. Some notes about the approach:
 - This patch adds the assumption than (unsigned) greater norms produce equal or lower scores. I liked this better than adding a new API on Similarity so that it could tell us how to compare norms.
 - Skip lists do not store the competitive (freq,norm) pairs on level 0 since it could take more storage than the postings block, only level 1 and greater.
 - I had to add norms producers to the postings consumers so that they could know about norms.
 - Having to pass the sim scorer to the postings producer is a bit ugly but I couldn't figure a way to make it nicer.
 - The similarity API doesn't make it easy to integrate, it currently gives a {{score(docID, freq)}} API while we'd rather need a {{score(freq,norm)}} API, especially because this optimization only works if freq and norm are the only per-document parameters that can influence the score.

Here is what it gives on luceneutil when disabling total hit counts on both master and the patch:

{noformat}
                    TaskQPS baseline      StdDev   QPS patch      StdDev                Pct diff
             AndHighHigh      127.39      (1.4%)      100.94      (2.4%)  -20.8% ( -24% -  -17%)
              AndHighMed      240.66      (2.0%)      212.11      (1.3%)  -11.9% ( -14% -   -8%)
               OrHighMed       76.60      (3.6%)       69.37      (2.3%)   -9.4% ( -14% -   -3%)
              OrHighHigh       27.37      (3.9%)       24.78      (2.4%)   -9.4% ( -15% -   -3%)
                  Fuzzy1      328.61      (6.5%)      316.04      (5.4%)   -3.8% ( -14% -    8%)
                Wildcard       56.88      (7.6%)       55.64     (10.0%)   -2.2% ( -18% -   16%)
                  Fuzzy2      144.68      (3.5%)      142.07      (5.8%)   -1.8% ( -10% -    7%)
                 Prefix3      372.69      (6.1%)      366.43      (7.7%)   -1.7% ( -14% -   12%)
   HighTermDayOfYearSort      132.88      (6.6%)      131.18      (7.7%)   -1.3% ( -14% -   13%)
             LowSpanNear       53.14      (1.8%)       52.48      (1.9%)   -1.2% (  -4% -    2%)
       HighTermMonthSort      109.37      (7.8%)      108.12      (7.1%)   -1.1% ( -14% -   14%)
         LowSloppyPhrase       54.79      (1.2%)       54.20      (1.1%)   -1.1% (  -3% -    1%)
                 Respell      293.10      (2.9%)      290.77      (5.7%)   -0.8% (  -9% -    8%)
        HighSloppyPhrase       35.60      (1.6%)       35.33      (1.6%)   -0.8% (  -3% -    2%)
            OrNotHighLow     1686.91      (3.4%)     1675.46      (1.8%)   -0.7% (  -5% -    4%)
              HighPhrase       24.98      (1.9%)       24.82      (1.7%)   -0.6% (  -4% -    3%)
             MedSpanNear      228.02      (3.4%)      226.69      (3.6%)   -0.6% (  -7% -    6%)
         MedSloppyPhrase       46.13      (1.4%)       45.87      (1.3%)   -0.6% (  -3% -    2%)
               MedPhrase      642.58      (3.7%)      639.51      (3.1%)   -0.5% (  -6% -    6%)
               LowPhrase       82.99      (2.1%)       82.63      (1.6%)   -0.4% (  -3% -    3%)
            HighSpanNear       34.77      (2.8%)       34.66      (3.1%)   -0.3% (  -5% -    5%)
                  IntNRQ       32.59     (15.2%)       32.61     (14.9%)    0.1% ( -26% -   35%)
              AndHighLow     1719.37      (3.8%)     1915.66      (2.8%)   11.4% (   4% -   18%)
               OrHighLow     1290.65      (3.1%)     1808.66      (3.7%)   40.1% (  32% -   48%)
                 LowTerm      873.82      (3.1%)     1527.34      (7.2%)   74.8% (  62% -   87%)
            OrNotHighMed      285.74      (2.5%)      590.09      (3.9%)  106.5% (  97% -  115%)
                 MedTerm      180.74      (3.6%)      970.40     (20.2%)  436.9% ( 398% -  477%)
           OrNotHighHigh       63.41      (0.8%)      529.76     (20.0%)  735.5% ( 709% -  762%)
            OrHighNotLow       71.04      (0.6%)      649.36     (30.4%)  814.1% ( 778% -  850%)
                HighTerm       85.02      (3.7%)      804.33     (35.6%)  846.1% ( 778% -  919%)
            OrHighNotMed      107.76      (0.6%)     1929.95     (48.1%) 1691.0% (1633% - 1749%)
           OrHighNotHigh       24.38      (0.5%)      478.53     (56.8%) 1862.7% (1796% - 1928%)
{noformat}

It make {{HighTerm}} about 8x faster. If you wonder why it also helps some boolean queries, this is because boolean queries propagate information about the minimum competitive score to sub clauses. Disk usage increase is negligible: only 0.5% on {{.doc}} files and 0.2% overall. I have not measured indexing speed however.

If someone has ideas what the API could look like, I'd be happy to discuss.

> Allow codecs to index term impacts
> ----------------------------------
>
>                 Key: LUCENE-4198
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4198
>             Project: Lucene - Core
>          Issue Type: Sub-task
>          Components: core/index
>            Reporter: Robert Muir
>         Attachments: LUCENE-4198.patch, LUCENE-4198_flush.patch
>
>
> Subtask of LUCENE-4100.
> Thats an example of something similar to impact indexing (though, his implementation currently stores a max for the entire term, the problem is the same).
> We can imagine other similar algorithms too: I think the codec API should be able to support these.
> Currently it really doesnt: Stefan worked around the problem by providing a tool to 'rewrite' your index, he passes the IndexReader and Similarity to it. But it would be better if we fixed the codec API.
> One problem is that the Postings writer needs to have access to the Similarity. Another problem is that it needs access to the term and collection statistics up front, rather than after the fact.
> This might have some cost (hopefully minimal), so I'm thinking to experiment in a branch with these changes and see if we can make it work well.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org