You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by "Rens Groothuijsen (Jira)" <ji...@apache.org> on 2020/05/13 23:50:00 UTC

[jira] [Assigned] (KAFKA-9716) Values of compression-rate and compression-rate-avg are misleading

     [ https://issues.apache.org/jira/browse/KAFKA-9716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Rens Groothuijsen reassigned KAFKA-9716:
----------------------------------------

    Assignee: Rens Groothuijsen

> Values of compression-rate and compression-rate-avg are misleading
> ------------------------------------------------------------------
>
>                 Key: KAFKA-9716
>                 URL: https://issues.apache.org/jira/browse/KAFKA-9716
>             Project: Kafka
>          Issue Type: Bug
>          Components: clients, compression
>    Affects Versions: 2.4.1
>            Reporter: Christian Kosmowski
>            Assignee: Rens Groothuijsen
>            Priority: Minor
>
> The values of the following metrics:
> compression-rate and compression-rate-avg and basically every other compression-rate (i.e.) topic compression rate
> are confusing.
> They are calculated as follows:
> {code:java}
> if (numRecords == 0L) {
>     buffer().position(initialPosition);
>     builtRecords = MemoryRecords.EMPTY;
> } else {
>     if (magic > RecordBatch.MAGIC_VALUE_V1)
>         this.actualCompressionRatio = (float) writeDefaultBatchHeader() / this.uncompressedRecordsSizeInBytes;
>     else if (compressionType != CompressionType.NONE)
>         this.actualCompressionRatio = (float) writeLegacyCompressedWrapperHeader() / this.uncompressedRecordsSizeInBytes;
>     ByteBuffer buffer = buffer().duplicate();
>     buffer.flip();
>     buffer.position(initialPosition);
>     builtRecords = MemoryRecords.readableRecords(buffer.slice());
> }
> {code}
> basically the compressed size is divided by the uncompressed size which leads to a value < 1 for high compression (good if you want compression) or > 1 for poor compression (bad if you want compression).
> From the name "compression rate" i would expect the exact opposite. Apart from the fact that the word "rate" usually refers to comparisons based on values of different units (miles per hour) the correct word "ratio" would refer to the uncompressed size divided by the compressed size. (In the code this is correct, but not with the metric names)
> So if the compressed data takes half the space of the uncompressed data the correct value for compression ratio (or rate) would be 2 and not 0.5 as kafka reports it. That is really confusing and i would AT LEAST expect that this behaviour would be documented somewhere, but it's not all documentation sources just say "the compression rate".



--
This message was sent by Atlassian Jira
(v8.3.4#803005)