You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Adrien Grand (JIRA)" <ji...@apache.org> on 2019/07/02 12:57:00 UTC

[jira] [Reopened] (LUCENE-8069) Allow index sorting by field length

     [ https://issues.apache.org/jira/browse/LUCENE-8069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Adrien Grand reopened LUCENE-8069:
----------------------------------

I've had this idea come back to my mind several times since I opened it. Sorting by norm brings the following benefits:
 - Better compression, smaller doc IDs likely have tiny term frequencies since most times the term frequency is less than or equal to the norm.
 - Smaller impacts: since each block of postings has only one unique norm value on average, then it also only has one impact on average. This helps at search time since computing the score of this impact gives us immediately the best score of the block, as opposed to having to iterate several impacts and take the highest score.
 - For term queries, it makes sure that among all documents that have X occurrences of the queried term, we visit the documents that have the lowest norm first, and thus the ones that trigger the better scores.
 - Boolean queries are interesting: they get the same above benefit as term queries but on the other hand the norm tends to correlate with the number of unique terms so it might be that you need to collect more matches before you find one that matches several query terms.

I hacked a quick prototype and ran luceneutil on wikibig, results are encouraging:
{noformat}
                    TaskQPS baseline      StdDev   QPS patch      StdDev                Pct diff
   HighTermDayOfYearSort       37.64      (6.4%)       33.96      (4.7%)   -9.8% ( -19% -    1%)
              HighPhrase       26.45      (2.7%)       25.24      (2.8%)   -4.6% (  -9% -    0%)
               OrHighLow      341.59      (2.8%)      327.84      (2.6%)   -4.0% (  -9% -    1%)
                  Fuzzy2      153.15      (5.3%)      147.70      (5.1%)   -3.6% ( -13% -    7%)
                  IntNRQ      151.43      (1.4%)      147.04      (3.4%)   -2.9% (  -7% -    1%)
       HighTermMonthSort       79.28      (6.4%)       79.44      (7.6%)    0.2% ( -12% -   15%)
                 Respell      229.10      (2.2%)      230.62      (1.8%)    0.7% (  -3% -    4%)
                  Fuzzy1      285.25      (6.9%)      288.99      (6.8%)    1.3% ( -11% -   16%)
                 Prefix3       34.60     (10.3%)       35.14     (10.6%)    1.6% ( -17% -   25%)
                Wildcard       72.36      (5.8%)       73.86      (6.3%)    2.1% (  -9% -   15%)
                 MedTerm     1895.68      (4.2%)     1939.92      (4.2%)    2.3% (  -5% -   11%)
            HighSpanNear        5.25      (6.0%)        5.46      (6.0%)    3.9% (  -7% -   17%)
         LowSloppyPhrase        6.85      (6.5%)        7.13      (6.3%)    4.2% (  -8% -   18%)
               LowPhrase       46.08      (1.7%)       48.56      (1.8%)    5.4% (   1% -    9%)
             LowSpanNear       24.03      (3.7%)       25.68      (4.3%)    6.9% (  -1% -   15%)
             MedSpanNear        5.20     (13.2%)        5.63     (15.2%)    8.3% ( -17% -   42%)
         MedSloppyPhrase       11.01      (4.5%)       11.95      (4.7%)    8.6% (   0% -   18%)
               MedPhrase       23.39      (2.6%)       25.64      (2.2%)    9.6% (   4% -   14%)
        HighSloppyPhrase        3.84      (5.9%)        4.26      (5.8%)   11.0% (   0% -   24%)
              AndHighLow      401.13      (3.4%)      458.11      (3.0%)   14.2% (   7% -   21%)
                 LowTerm     2294.98      (4.0%)     2863.59      (7.0%)   24.8% (  13% -   37%)
              AndHighMed       53.62      (3.8%)       71.40      (1.8%)   33.2% (  26% -   40%)
                HighTerm     1286.59      (3.9%)     1917.61      (5.7%)   49.0% (  38% -   60%)
             AndHighHigh       41.24      (3.5%)       69.17      (4.2%)   67.7% (  58% -   78%)
               OrHighMed       49.92      (2.4%)       84.95      (4.0%)   70.2% (  62% -   78%)
              OrHighHigh       43.55      (2.3%)       90.06      (4.8%)  106.8% (  97% -  116%)
{noformat}

The {{doc}} file is 12% smaller.

> Allow index sorting by field length
> -----------------------------------
>
>                 Key: LUCENE-8069
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8069
>             Project: Lucene - Core
>          Issue Type: Wish
>            Reporter: Adrien Grand
>            Priority: Minor
>
> Short documents are more likely to get higher scores, so sorting an index by field length would mean we would be likely to collect best matches first. Depending on the similarity implementation, this might even allow to early terminate collection of top documents on term queries.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org