You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kafka.apache.org by "Guozhang Wang (JIRA)" <ji...@apache.org> on 2016/05/17 18:53:13 UTC

[jira] [Comment Edited] (KAFKA-3704) Improve mechanism for compression stream block size selection in KafkaProducer

    [ https://issues.apache.org/jira/browse/KAFKA-3704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15286887#comment-15286887 ] 

Guozhang Wang edited comment on KAFKA-3704 at 5/17/16 6:52 PM:
---------------------------------------------------------------

Thanks for the summary [~ijuma].

I think 2) solves the problem "cleanly" except for GZIP, while 3) still introduces extra memory out of controlled buffer pool, one block for each partition. 1) introduces a new config but does not necessarily control the total extra memory allocated out of buffer pool.

Personally I feel 3) is worth doing: originally I'm concerned it complicates the code by quite a lot, but after checking it once again I feel it may not be that worse compared with 2).


was (Author: guozhang):
Thanks for the summary [~ijuma].

I think 2) solves the problem "cleanly" except for GZIP, while 3) still introduces extra memory out of controlled buffer pool, one block for each partition. 1) introduces a new config but does not necessarily control the total extra memory allocated out of buffer pool.

Personally I fell 3) is worth doing: originally I'm concerned it complicates the code by quite a lot, but after checking it once again I feel it may not be that worse compared with 2).

> Improve mechanism for compression stream block size selection in KafkaProducer
> ------------------------------------------------------------------------------
>
>                 Key: KAFKA-3704
>                 URL: https://issues.apache.org/jira/browse/KAFKA-3704
>             Project: Kafka
>          Issue Type: Bug
>            Reporter: Guozhang Wang
>            Assignee: Ismael Juma
>             Fix For: 0.10.1.0
>
>
> As discovered in https://issues.apache.org/jira/browse/KAFKA-3565, the current default block size (1K) used in Snappy and GZIP may cause a sub-optimal compression ratio for Snappy, and hence reduce throughput. Because we no longer recompress data in the broker, it also impacts what gets stored on disk.
> A solution might be to use the default block size, which is 64K in LZ4, 32K in Snappy and 0.5K in GZIP. The downside is that this solution will require more memory allocated outside of the buffer pool and hence users may need to bump up their JVM heap size, especially for MirrorMakers. Using Snappy as an example, it's an additional 2x32k per batch (as Snappy uses two buffers) and one would expect at least one batch per partition. However, the number of batches per partition can be much higher if the broker is slow to acknowledge producer requests (depending on `buffer.memory`, `batch.size`, message size, etc.).
> Given the above, there are a few things that could be done (potentially more than one):
> 1) A configuration for the producer compression stream buffer size.
> 2) Allocate buffers from the buffer pool and pass them to the compression library. This is possible with Snappy and we could adapt our LZ4 code. It's not possible with GZIP, but it uses a very small buffer by default.
> 3) Close the existing `RecordBatch.records` when we create a new `RecordBatch` for the `TopicPartition` instead of doing it during `RecordAccumulator.drain`. This would mean that we would only retain resources for one `RecordBatch` per partition, which would improve the worst case scenario significantly.
> Note that we decided that this change was too risky for 0.10.0.0 and reverted the original attempt.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)