You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by "Christo Lolov (Jira)" <ji...@apache.org> on 2023/02/06 10:50:00 UTC
[jira] [Commented] (KAFKA-14636) Compression optimization: Use zstd dictionary based (de)compression
[ https://issues.apache.org/jira/browse/KAFKA-14636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17684587#comment-17684587 ]
Christo Lolov commented on KAFKA-14636:
---------------------------------------
I incorporated a dictionary to be used and created a new JMH benchmark to test the performance of the implementation (https://github.com/apache/kafka/compare/trunk...clolov:kafka:produce-dictionary?expand=1). There were improvements, but they were big only over an artificial set of data as seen below
{code:java}
# Without dictionary
Benchmark (bufferSupplierStr) (bytes) (maxBatchSize) (messageSize) (messageVersion) Mode Cnt Score
CompressionBenchmark.measureCompressionThroughput CREATE ONES 50 10 2 thrpt 2 1046.463
CompressionBenchmark.measureCompressionThroughput CREATE ONES 50 50 2 thrpt 2 957.770
CompressionBenchmark.measureCompressionThroughput CREATE ONES 50 100 2 thrpt 2 877.248
CompressionBenchmark.measureCompressionThroughput CREATE ONES 100 10 2 thrpt 2 679.727
CompressionBenchmark.measureCompressionThroughput CREATE ONES 100 50 2 thrpt 2 642.920
CompressionBenchmark.measureCompressionThroughput CREATE ONES 100 100 2 thrpt 2 569.959
{code}
{code:java}
# With dictionary
Benchmark (bufferSupplierStr) (bytes) (maxBatchSize) (messageSize) (messageVersion) Mode Cnt Score
CompressionBenchmark.measureCompressionThroughput CREATE ONES 50 10 2 thrpt 2 1533.673
CompressionBenchmark.measureCompressionThroughput CREATE ONES 50 50 2 thrpt 2 1376.801
CompressionBenchmark.measureCompressionThroughput CREATE ONES 50 100 2 thrpt 2 1209.928
CompressionBenchmark.measureCompressionThroughput CREATE ONES 100 10 2 thrpt 2 878.464
CompressionBenchmark.measureCompressionThroughput CREATE ONES 100 50 2 thrpt 2 790.505
CompressionBenchmark.measureCompressionThroughput CREATE ONES 100 100 2 thrpt 2 701.102
{code}
On a more "realistic" data set as the one given in the link, the improvements were minimal. I experimented with different dictionary, sample and buffer sizes, but could not obtain results similar to the ones detailed in https://github.com/facebook/zstd. I tried reaching out to people who had operational knowledge of Zstd, but none of the ones I spoke had employed dictionaries.
[~ijuma], do you have any thoughts on whether to proceed with this or not or any suggestions for improvement?
> Compression optimization: Use zstd dictionary based (de)compression
> -------------------------------------------------------------------
>
> Key: KAFKA-14636
> URL: https://issues.apache.org/jira/browse/KAFKA-14636
> Project: Kafka
> Issue Type: Sub-task
> Reporter: Divij Vaidya
> Assignee: Christo Lolov
> Priority: Major
> Labels: needs-kip
>
> Use dictionary functionality of Zstd decompression. Train the dictionary per topic for first few MBs and then use it.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)