You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Adrien Grand (Jira)" <ji...@apache.org> on 2022/05/04 07:10:00 UTC

[jira] [Created] (LUCENE-10556) Relax the maximum dirtiness for stored fields and term vectors?

Adrien Grand created LUCENE-10556:
-------------------------------------

             Summary: Relax the maximum dirtiness for stored fields and term vectors?
                 Key: LUCENE-10556
                 URL: https://issues.apache.org/jira/browse/LUCENE-10556
             Project: Lucene - Core
          Issue Type: Improvement
            Reporter: Adrien Grand


Stored fields and term vectors compress data and have merge-time optimizations to copy compressed data directly instead of decompressing and recompressing over and over again. However, sometimes incomplete blocks get carried over (typically the last block of a flushed segment) and so these file formats keep track of how "dirty" their current blocks are to know whether stored fields / term vectors for a segment should be re-compressed.

Currently the logic is to recompress if more than 1% of the blocks are incomplete, or if the total number of missing documents across incomplete blocks is more than the configured maximum number of documents per block.

I'd be interested in evaluating what the compression ratio would be if we relaxed these conditions a bit, e.g. by allowing up to 5% dirtiness. My gut feeling is that the compression ratio could be barely worse while index-time CPU usage could be significantly improved. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org