You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Adrien Grand (JIRA)" <ji...@apache.org> on 2016/12/09 16:57:58 UTC

[jira] [Updated] (LUCENE-7589) Prevent outliers from raising the number of bits of everyone with numeric doc values

     [ https://issues.apache.org/jira/browse/LUCENE-7589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Adrien Grand updated LUCENE-7589:
---------------------------------
    Attachment: LUCENE-7589.patch

Here is a patch. The doc values consumer computes space usage both for the case that all values use the same number of bits per value and for the case that values are split into blocks of 16384 values. And if using blocks proves to save 10% disk usage or more, then it encodes blocks with their own required number of bits per value.

I kept a rather high value of the block size, since this impl can only jump forward {{blockSize}} documents at a time, so a high value like 16384 hopefully keeps performance good, but in the future we might want to look into leveraging the sequential access pattern even more (to do run-length encoding for instance) and maybe have eg. a skip list to handle the big jumps, like postings do. I think that patch is a good first (baby) step towards that direction.

> Prevent outliers from raising the number of bits of everyone with numeric doc values
> ------------------------------------------------------------------------------------
>
>                 Key: LUCENE-7589
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7589
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Adrien Grand
>            Assignee: Adrien Grand
>            Priority: Minor
>         Attachments: LUCENE-7589.patch
>
>
> Today we encode entire segments with a single number of bits per value. It was done this way because it was faster, but it also means a single outlier can significantly increase the space requirements. I think we should have protection against that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org