You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Adrien Grand (JIRA)" <ji...@apache.org> on 2019/07/02 12:57:00 UTC
[jira] [Reopened] (LUCENE-8069) Allow index sorting by field length
[ https://issues.apache.org/jira/browse/LUCENE-8069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Adrien Grand reopened LUCENE-8069:
----------------------------------
I've had this idea come back to my mind several times since I opened it. Sorting by norm brings the following benefits:
- Better compression, smaller doc IDs likely have tiny term frequencies since most times the term frequency is less than or equal to the norm.
- Smaller impacts: since each block of postings has only one unique norm value on average, then it also only has one impact on average. This helps at search time since computing the score of this impact gives us immediately the best score of the block, as opposed to having to iterate several impacts and take the highest score.
- For term queries, it makes sure that among all documents that have X occurrences of the queried term, we visit the documents that have the lowest norm first, and thus the ones that trigger the better scores.
- Boolean queries are interesting: they get the same above benefit as term queries but on the other hand the norm tends to correlate with the number of unique terms so it might be that you need to collect more matches before you find one that matches several query terms.
I hacked a quick prototype and ran luceneutil on wikibig, results are encouraging:
{noformat}
TaskQPS baseline StdDev QPS patch StdDev Pct diff
HighTermDayOfYearSort 37.64 (6.4%) 33.96 (4.7%) -9.8% ( -19% - 1%)
HighPhrase 26.45 (2.7%) 25.24 (2.8%) -4.6% ( -9% - 0%)
OrHighLow 341.59 (2.8%) 327.84 (2.6%) -4.0% ( -9% - 1%)
Fuzzy2 153.15 (5.3%) 147.70 (5.1%) -3.6% ( -13% - 7%)
IntNRQ 151.43 (1.4%) 147.04 (3.4%) -2.9% ( -7% - 1%)
HighTermMonthSort 79.28 (6.4%) 79.44 (7.6%) 0.2% ( -12% - 15%)
Respell 229.10 (2.2%) 230.62 (1.8%) 0.7% ( -3% - 4%)
Fuzzy1 285.25 (6.9%) 288.99 (6.8%) 1.3% ( -11% - 16%)
Prefix3 34.60 (10.3%) 35.14 (10.6%) 1.6% ( -17% - 25%)
Wildcard 72.36 (5.8%) 73.86 (6.3%) 2.1% ( -9% - 15%)
MedTerm 1895.68 (4.2%) 1939.92 (4.2%) 2.3% ( -5% - 11%)
HighSpanNear 5.25 (6.0%) 5.46 (6.0%) 3.9% ( -7% - 17%)
LowSloppyPhrase 6.85 (6.5%) 7.13 (6.3%) 4.2% ( -8% - 18%)
LowPhrase 46.08 (1.7%) 48.56 (1.8%) 5.4% ( 1% - 9%)
LowSpanNear 24.03 (3.7%) 25.68 (4.3%) 6.9% ( -1% - 15%)
MedSpanNear 5.20 (13.2%) 5.63 (15.2%) 8.3% ( -17% - 42%)
MedSloppyPhrase 11.01 (4.5%) 11.95 (4.7%) 8.6% ( 0% - 18%)
MedPhrase 23.39 (2.6%) 25.64 (2.2%) 9.6% ( 4% - 14%)
HighSloppyPhrase 3.84 (5.9%) 4.26 (5.8%) 11.0% ( 0% - 24%)
AndHighLow 401.13 (3.4%) 458.11 (3.0%) 14.2% ( 7% - 21%)
LowTerm 2294.98 (4.0%) 2863.59 (7.0%) 24.8% ( 13% - 37%)
AndHighMed 53.62 (3.8%) 71.40 (1.8%) 33.2% ( 26% - 40%)
HighTerm 1286.59 (3.9%) 1917.61 (5.7%) 49.0% ( 38% - 60%)
AndHighHigh 41.24 (3.5%) 69.17 (4.2%) 67.7% ( 58% - 78%)
OrHighMed 49.92 (2.4%) 84.95 (4.0%) 70.2% ( 62% - 78%)
OrHighHigh 43.55 (2.3%) 90.06 (4.8%) 106.8% ( 97% - 116%)
{noformat}
The {{doc}} file is 12% smaller.
> Allow index sorting by field length
> -----------------------------------
>
> Key: LUCENE-8069
> URL: https://issues.apache.org/jira/browse/LUCENE-8069
> Project: Lucene - Core
> Issue Type: Wish
> Reporter: Adrien Grand
> Priority: Minor
>
> Short documents are more likely to get higher scores, so sorting an index by field length would mean we would be likely to collect best matches first. Depending on the similarity implementation, this might even allow to early terminate collection of top documents on term queries.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org