You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "ASF subversion and git services (Jira)" <ji...@apache.org> on 2022/05/12 08:34:00 UTC

[jira] [Commented] (LUCENE-10536) Doc values terms dicts should use the first term of each block as a dictionary

    [ https://issues.apache.org/jira/browse/LUCENE-10536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17535967#comment-17535967 ] 

ASF subversion and git services commented on LUCENE-10536:
----------------------------------------------------------

Commit 8f89db8048033cdd63ee08ee9e99d64c6fa4c90d in lucene's branch refs/heads/main from Adrien Grand
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=8f89db80480 ]

LUCENE-10536: Slightly better compression of doc values' terms dictionaries. (#838)

Doc values terms dictionaries keep the first term of each block uncompressed so
that they can somewhat efficiently perform binary searches across blocks.
Suffixes of the other 63 terms are compressed together using LZ4 to leverage
redundancy across suffixes. This change improves compression a bit by using the
first (uncompressed) term of each block as a dictionary when compressing
suffixes of the 63 other terms. This helps with compressing the first few
suffixes when there's not much context yet that can be leveraged to find
duplicates.

> Doc values terms dicts should use the first term of each block as a dictionary
> ------------------------------------------------------------------------------
>
>                 Key: LUCENE-10536
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10536
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Adrien Grand
>            Priority: Minor
>          Time Spent: 1h
>  Remaining Estimate: 0h
>
> Doc values terms dictionaries split data into blocks of 64 terms, where the first term is written uncompressed (which is useful for binary searches), and the 63 other terms are encoded by taking the difference with the previous term and compressing all suffixes together with LZ4.
> With this format, the suffix of the second term is also unlikely to benefit from any compression, since it doesn't have data to search for duplicate bytes into besides itself. A minor improvement we could make would consist of using the first term as a dictionary for suffixes of terms 2..64.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org