You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@lucene.apache.org by "Adrien Grand (Jira)" <ji...@apache.org> on 2020/09/15 13:55:00 UTC

[jira] [Created] (LUCENE-9525) Better handle small documents with the new Lucene87StoredFieldsFormat

Adrien Grand created LUCENE-9525:
------------------------------------

             Summary: Better handle small documents with the new Lucene87StoredFieldsFormat
                 Key: LUCENE-9525
                 URL: https://issues.apache.org/jira/browse/LUCENE-9525
             Project: Lucene - Core
          Issue Type: Improvement
            Reporter: Adrien Grand


Stored fields configure a maximum number of fields per block, whose goal is to make sure that you don't decompress more than X documents to get access to a single one. However this has interesting effects with the new format.

For instance we use 4kB of dictionary and blocks of 60kB for at most 512 documents per block. So if your documents are very small, say 10 bytes, the block will be 5120 bytes overall, and we'll first compress 4096 bytes independently, and then 5120-4096=1024 bytes with 4096 bytes of dictionary. In this case training the dictionary takes more time than actually compressing the data, and it's not even sure it's worth it since only 1024 bytes out of the 5120 bytes of the block get compressed with a preset dictionary.

I'm considering adapting the dictionary size and the block size to the total block size in order to better handle such cases.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org