You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Adrien Grand (Jira)" <ji...@apache.org> on 2022/05/12 21:36:00 UTC

[jira] [Created] (LUCENE-10569) Think again about the floor segment size?

Adrien Grand created LUCENE-10569:
-------------------------------------

             Summary: Think again about the floor segment size?
                 Key: LUCENE-10569
                 URL: https://issues.apache.org/jira/browse/LUCENE-10569
             Project: Lucene - Core
          Issue Type: Improvement
            Reporter: Adrien Grand


TieredMergePolicy has a floor segment size that it uses to prevent indexes from having a long tail of small segments, which would be very inefficient at search time. It is 2MB by default.

While this floor segment size is good for searches, it also has the side effect of making merges run in quadratic time when segments are below this size. This caught me by surprise several times when working on datasets that have few fields or that are extremely space-efficient: even segments that are not so small from a doc count perspective could be considered too small and trigger quadratic merging because of this floor segment size.

We came up whis 2MB floor segment size many years ago when Lucene was less space-efficient. I think we should consider lowering it at a minimum, and maybe move from a threshold on the document count rather than the byte size of the segment to better work with datasets of small or highly-compressible documents

Separately, we should enable merge-on-refresh by default (LUCENE-10078) to make sure that searches actually take advantage of this quadratic merging of small segments, that only exists to make searches more efficient.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org