You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Adrien Grand (JIRA)" <ji...@apache.org> on 2014/08/29 19:15:52 UTC

[jira] [Updated] (LUCENE-5914) More options for stored fields compression

     [ https://issues.apache.org/jira/browse/LUCENE-5914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Adrien Grand updated LUCENE-5914:
---------------------------------

    Attachment: LUCENE-5914.patch

What I have been thinking about would be to provide 2 options for compression (we could have more than 2 but that would make it more complicated backward-compatibility wise):
 - one option that focuses on search speed,
 - one option that focuses on compression.

Here is how the current patch tries to address these requirements:

1. For high compression, documents are grouped into blocks of 16KB (pretty much like today) but instead of being compressed with LZ4, they are compressed with deflate and a low compression level (3 which is the highest level that doens't use lazy match evaluation, I think it is a good trade-off for our stored fields).

If you want to decompress a document, you need to decompress the whole block.

2. For better search speed, documents are compressed individually with lz4. In order to keep the compression ratio good enough, documents are still grouped into blocks, and what happens is that the data that results from the compression of the previous documents in the block are used as a dictionary in order to compress the current document.

When you want to decompress, you can decompress a single document at a time, all that you need to is to have a buffer that stores the compressed representation of the previous documents in the block so that the decompression routine can make references to it.

In both cases, I tried to implement it in such a way that it is not required to override the default bulk merge API in order to get good merging performance: the readers keep some state that allow them to read documents sequentially. This should also help for operations like exports of the indices since they would get much better performance when iterating over documents in order.

The patch is not ready yet, it is too light on tests so that I can sleep quietly, and quite inefficient on large documents. For now I'm just sharing it in order to get some feedback. :-)

For the shared dictionary logic, I looked at other approaches that didn't work out well:
 - trying to compute a shared dictionary happens to be very costly since you would need to compute the longest common subsequences that are shared across documents in a block. That is why I ended up using the compressed documents as a dictionary since it requires neither additional CPU nor space while works quite efficiently due to the way that lz4 works.
 - I tried to see what can be done with shared dictionaries and DEFLATE, but the dictionaries are only used for the LZ77 part of the algorithm, not for the huffman coding so it was not really helpful in our case.
 - I tried to see if we could compute a huffman dictionary per block and use it to compress all documents individually (some thing that the deflate API doesn't allow for) but it was very slow (probably for two reasons: 1. I wanted to keep it simple and 2. Java is not as efficient as C for that kind of stuff)
 - I also played with the ng2 and ng3 algorithms described in http://openproceedings.org/EDBT/2014/paper_25.pdf but it was significantly slower than lz4 (because lz4 can process large sequences of bytes at a time, while these formats only work on 2 or 3 bytes at a time).

> More options for stored fields compression
> ------------------------------------------
>
>                 Key: LUCENE-5914
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5914
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Adrien Grand
>            Assignee: Adrien Grand
>             Fix For: 4.11
>
>         Attachments: LUCENE-5914.patch
>
>
> Since we added codec-level compression in Lucene 4.1 I think I got about the same amount of users complaining that compression was too aggressive and that compression was too light.
> I think it is due to the fact that we have users that are doing very different things with Lucene. For example if you have a small index that fits in the filesystem cache (or is close to), then you might never pay for actual disk seeks and in such a case the fact that the current stored fields format needs to over-decompress data can sensibly slow search down on cheap queries.
> On the other hand, it is more and more common to use Lucene for things like log analytics, and in that case you have huge amounts of data for which you don't care much about stored fields performance. However it is very frustrating to notice that the data that you store takes several times less space when you gzip it compared to your index although Lucene claims to compress stored fields.
> For that reason, I think it would be nice to have some kind of options that would allow to trade speed for compression in the default codec.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org