You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Ryan Ernst (JIRA)" <ji...@apache.org> on 2014/12/08 22:49:15 UTC
[jira] [Commented] (LUCENE-6100) Further tuning of Lucene50Codec(BEST_COMPRESSION)

    [ https://issues.apache.org/jira/browse/LUCENE-6100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14238509#comment-14238509 ] 

Ryan Ernst commented on LUCENE-6100:
------------------------------------

+1

> Further tuning of Lucene50Codec(BEST_COMPRESSION)
> -------------------------------------------------
>
>                 Key: LUCENE-6100
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6100
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Robert Muir
>         Attachments: LUCENE-6100.patch
>
>
> Currently this codec has two options: BEST_SPEED and BEST_COMPRESSION. But in the case of highly compressible data, the ratio for BEST_COMPRESSION is not much over BEST_SPEED, because they share the same underlying format which is not optimized for this here.
> block size is currently 24576 (32kb sliding window size minus 8kb "grace" to avoid going over it). And we compress this in a stateless manner, each block is its own stream and they dont share preset dictionary or anything. So we have a lot of waste in many cases, since zlib has to reboot itself, then we generally throw away 1/4 of the window and start over.
> I ran some experiments with highly compressible logs data:
> ||method||time indexing(ms)||time merging(ms)||fdt||fdx||
> |BEST_SPEED|101,729|15,638|372,845,282|406,964|
> |BEST_COMPRESSION|114,364|23,474|269,387,347|275.909|
> |patch (60KB)|105,533|18,914|237,284,342|117,639|
> The other experiments I ran were:
> ||method||time indexing(ms)||time merging(ms)||fdt||fdx||
> |crappy preset|130,854|38,095|234,603,971|274,500|
> |64KB|107,256|21,570|236,004,297|111,135|
> |crappy preset+64KB|121,503|30,030|222,422,924|110,751|
> For 'crappy preset' I just use arbitrary first 32KB bytes of original data as a preset dictionary for every block. This is effective, but slow because of some unnecessary overhead involved (like computing adler32 over and over of the preset dict for each block). However, this overhead is reduced with larger block sizes, and still offers benefits, so maybe in the future we can do it (especially e.g. if its per-chunk and we can bulk merge chunks without recompressing, etc).
> For 64KB, we measure removing the "grace" completely so it spills to another block each time. The proposed smaller "grace" amount still offers cpu savings, so I think we should keep it. But its not terrible if you go over.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org