You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Adrien Grand (Jira)" <ji...@apache.org> on 2020/08/27 15:03:00 UTC

[jira] [Comment Edited] (LUCENE-9486) Explore using preset dictionaries with LZ4 for stored fields

    [ https://issues.apache.org/jira/browse/LUCENE-9486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17185904#comment-17185904 ] 

Adrien Grand edited comment on LUCENE-9486 at 8/27/20, 3:02 PM:
----------------------------------------------------------------

I played with various configurations and ended up with a preset dictionary of 4kB combined with 10 sub blocks of 60kB, which gives interesting results. Here are some benchmarks on the same datasets as LUCENE-9447:

On highly compressible JSON logs:

||Method||Index size(MB)||Index time(s)||Avg fetch time (us)||
|LZ4(16kB) (current BEST_SPEED)|304,2|9|5|
|LZ4(60kB)|141,7|7,5|10|
|LZ4(256kB)|105,1|7,5|33|
|LZ4(1MB)|96,5|7,5|115|
|LZ4 with preset dict (new BEST_SPEED)|91,9|7,5|16|
|Deflate with preset dict (new BEST_COMPRESSION)|64.9|14|41|

On enwiki documents:

||Method||Index size(MB)||Index time(s)||Avg fetch time (us)||
|LZ4(16kB) (current BEST_SPEED)|558,8|14,5|83|
|LZ4(60kB)|526,2|15|120|
|LZ4(256kB)|523,1|15|323|
|LZ4(1MB)|521,3|15,5|1151|
|LZ4 with preset dict (new BEST_SPEED)|515,2|15|135|
|Deflate with preset dict (new BEST_COMPRESSION)|338.0|35|250|

It makes fetch times a bit slower, which is fair I think given that these fetch times are still way under the cost of a page fault. Indexing remains as fast as today and compression gets respectively 3.3x and 8% better on these datasets.

I also included the results with BEST_COMPRESSION in the above benchmarks to show the trade-off that users are making when going with one versus the other.


was (Author: jpountz):
I played with various configurations and ended up with a preset dictionary of 4kB combined with 10 sub blocks of 60kB, which gives interesting results. Here are some benchmarks on the same datasets as LUCENE-9447:

On highly compressible JSON logs:

||Method||Index size(MB)||Index time(s)||Avg fetch time (us)||
|LZ4(16kB) (current BEST_SPEED)|304,2|9|5|
|LZ4(60kB)|141,7|7,5|10|
|LZ4(256kB)|105,1|7,5|33|
|LZ4(1MB)|96,5|7,5|115|
|LZ4 with preset dict (new BEST_SPEED)|91,9|7,5|16|
|Deflate with preset dict (new BEST_SPEED)|64.9|14|41|

On enwiki documents:

||Method||Index size(MB)||Index time(s)||Avg fetch time (us)||
|LZ4(16kB) (current BEST_SPEED)|558,8|14,5|83|
|LZ4(60kB)|526,2|15|120|
|LZ4(256kB)|523,1|15|323|
|LZ4(1MB)|521,3|15,5|1151|
|LZ4 with preset dict (new BEST_SPEED)|515,2|15|135|
|Deflate with preset dict (new BEST_SPEED)|338.0|35|250|

It makes fetch times a bit slower, which is fair I think given that these fetch times are still way under the cost of a page fault. Indexing remains as fast as today and compression gets respectively 3.3x and 8% better on these datasets.

I also included the results with BEST_COMPRESSION in the above benchmarks to show the trade-off that users are making when going with one versus the other.

> Explore using preset dictionaries with LZ4 for stored fields
> ------------------------------------------------------------
>
>                 Key: LUCENE-9486
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9486
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Adrien Grand
>            Priority: Minor
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Follow-up of LUCENE-9447: using preset dictionaries with DEFLATE provided very significant gains. Adding support for preset dictionaries with LZ4 would be easy so let's give it a try?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org