You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Adrien Grand (JIRA)" <ji...@apache.org> on 2014/12/02 01:04:13 UTC

[jira] [Updated] (LUCENE-5914) More options for stored fields compression

     [ https://issues.apache.org/jira/browse/LUCENE-5914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Adrien Grand updated LUCENE-5914:
---------------------------------
    Attachment: LUCENE-5914.patch

Here is a new patch that iterates on Robert's:
 - improved compression for numerics:
 - floats and doubles representing small integers take 1 byte
 - other positive floats and doubles take 4 / 8 bytes
 - other floats and doubles (negative) take 5 / 9 bytes
 - doubles that are actually casted floats take 5 bytes
 - longs are compressed if they represent a timestamp (2 bits are used to encode for the fact that the number is a multiple of a second, hour, day, or is uncompressed)
 - clean up of the checkFooter calls in the reader
 - slightly better encoding of the offsets with the BEST_SPEED option by using monotonic encoding: this allows to just slurp a sequence of bytes and then decode a single value instead of having to decode lengths and sum them up in order to have offsets (the BEST_COMPRESSION option still does this however)
 - fixed some javadocs errors

> More options for stored fields compression
> ------------------------------------------
>
>                 Key: LUCENE-5914
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5914
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Adrien Grand
>            Assignee: Adrien Grand
>             Fix For: 5.0
>
>         Attachments: LUCENE-5914.patch, LUCENE-5914.patch, LUCENE-5914.patch, LUCENE-5914.patch, LUCENE-5914.patch
>
>
> Since we added codec-level compression in Lucene 4.1 I think I got about the same amount of users complaining that compression was too aggressive and that compression was too light.
> I think it is due to the fact that we have users that are doing very different things with Lucene. For example if you have a small index that fits in the filesystem cache (or is close to), then you might never pay for actual disk seeks and in such a case the fact that the current stored fields format needs to over-decompress data can sensibly slow search down on cheap queries.
> On the other hand, it is more and more common to use Lucene for things like log analytics, and in that case you have huge amounts of data for which you don't care much about stored fields performance. However it is very frustrating to notice that the data that you store takes several times less space when you gzip it compared to your index although Lucene claims to compress stored fields.
> For that reason, I think it would be nice to have some kind of options that would allow to trade speed for compression in the default codec.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org