You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "ASF subversion and git services (Jira)" <ji...@apache.org> on 2020/09/16 11:10:00 UTC

[jira] [Commented] (LUCENE-9525) Better handle small documents with the new Lucene87StoredFieldsFormat

    [ https://issues.apache.org/jira/browse/LUCENE-9525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17196880#comment-17196880 ] 

ASF subversion and git services commented on LUCENE-9525:
---------------------------------------------------------

Commit ad71bee0161cd52dba73f866c897e88fde2639a4 in lucene-solr's branch refs/heads/master from Adrien Grand
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=ad71bee ]

LUCENE-9525: Better handle small documents with Lucene87StoredFieldsFormat. (#1876)

Instead of configuring a dictionary size and a block size, the format
now tries to have 10 sub blocks per bigger block, and adapts the size of
the dictionary and of the sub blocks to this overall block size.

> Better handle small documents with the new Lucene87StoredFieldsFormat
> ---------------------------------------------------------------------
>
>                 Key: LUCENE-9525
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9525
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Adrien Grand
>            Priority: Minor
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> Stored fields configure a maximum number of fields per block, whose goal is to make sure that you don't decompress more than X documents to get access to a single one. However this has interesting effects with the new format.
> For instance we use 4kB of dictionary and blocks of 60kB for at most 512 documents per block. So if your documents are very small, say 10 bytes, the block will be 5120 bytes overall, and we'll first compress 4096 bytes independently, and then 5120-4096=1024 bytes with 4096 bytes of dictionary. In this case training the dictionary takes more time than actually compressing the data, and it's not even sure it's worth it since only 1024 bytes out of the 5120 bytes of the block get compressed with a preset dictionary.
> I'm considering adapting the dictionary size and the block size to the total block size in order to better handle such cases.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org