You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@lucene.apache.org by "Adrien Grand (Jira)" <ji...@apache.org> on 2022/04/25 08:22:00 UTC

[jira] [Resolved] (LUCENE-8836) Optimize DocValues TermsDict to continue scanning from the last position when possible

     [ https://issues.apache.org/jira/browse/LUCENE-8836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Adrien Grand resolved LUCENE-8836.
----------------------------------
    Fix Version/s: 9.2
       Resolution: Fixed

I merged a change that only improves lookupOrd, and not seekCeil like the previous patch did, but I'm still inclined to mark this issue as resolved. Let's improve seekCeil in a follow-up if there's appetite for it?

> Optimize DocValues TermsDict to continue scanning from the last position when possible
> --------------------------------------------------------------------------------------
>
>                 Key: LUCENE-8836
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8836
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Bruno Roustant
>            Priority: Major
>              Labels: docValues, optimization
>             Fix For: 9.2
>
>          Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> Lucene80DocValuesProducer.TermsDict is used to lookup for either a term or a term ordinal.
> Currently it does not have the optimization the FSTEnum has: to be able to continue a sequential scan from where the last lookup was in the IndexInput. For sparse lookups (when searching only a few terms or ordinal) it is not an issue. But for multiple lookups in a row this optimization could save re-scanning all the terms from the block start (since they are delat encoded).
> This patch proposes the optimization.
> To estimate the gain, we ran 3 Lucene tests while counting the seeks and the term reads in the IndexInput, with and without the optimization:
> TestLucene70DocValuesFormat - the optimization saves 24% seeks and 15% term reads.
> TestDocValuesQueries - the optimization adds 0.7% seeks and 0.003% term reads.
> TestDocValuesRewriteMethod.testRegexps - the optimization saves 71% seeks and 82% term reads.
> In some cases, when scanning many terms in lexicographical order, the optimization saves a lot. In some case, when only looking for some sparse terms, the optimization does not bring improvement, but does not penalize neither. It seems to be worth to always have it.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org