You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by "Christo Lolov (Jira)" <ji...@apache.org> on 2023/02/06 10:50:00 UTC

[jira] [Commented] (KAFKA-14636) Compression optimization: Use zstd dictionary based (de)compression

    [ https://issues.apache.org/jira/browse/KAFKA-14636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17684587#comment-17684587 ] 

Christo Lolov commented on KAFKA-14636:
---------------------------------------

I incorporated a dictionary to be used and created a new JMH benchmark to test the performance of the implementation (https://github.com/apache/kafka/compare/trunk...clolov:kafka:produce-dictionary?expand=1). There were improvements, but they were big only over an artificial set of data as seen below
{code:java}
# Without dictionary
Benchmark                                          (bufferSupplierStr)  (bytes)  (maxBatchSize)  (messageSize)  (messageVersion)   Mode  Cnt     Score
CompressionBenchmark.measureCompressionThroughput               CREATE     ONES              50             10                 2  thrpt    2  1046.463
CompressionBenchmark.measureCompressionThroughput               CREATE     ONES              50             50                 2  thrpt    2   957.770
CompressionBenchmark.measureCompressionThroughput               CREATE     ONES              50            100                 2  thrpt    2   877.248
CompressionBenchmark.measureCompressionThroughput               CREATE     ONES             100             10                 2  thrpt    2   679.727
CompressionBenchmark.measureCompressionThroughput               CREATE     ONES             100             50                 2  thrpt    2   642.920
CompressionBenchmark.measureCompressionThroughput               CREATE     ONES             100            100                 2  thrpt    2   569.959
{code}

{code:java}
# With dictionary
Benchmark                                          (bufferSupplierStr)  (bytes)  (maxBatchSize)  (messageSize)  (messageVersion)   Mode  Cnt     Score
CompressionBenchmark.measureCompressionThroughput               CREATE     ONES              50             10                 2  thrpt    2  1533.673
CompressionBenchmark.measureCompressionThroughput               CREATE     ONES              50             50                 2  thrpt    2  1376.801
CompressionBenchmark.measureCompressionThroughput               CREATE     ONES              50            100                 2  thrpt    2  1209.928
CompressionBenchmark.measureCompressionThroughput               CREATE     ONES             100             10                 2  thrpt    2   878.464
CompressionBenchmark.measureCompressionThroughput               CREATE     ONES             100             50                 2  thrpt    2   790.505
CompressionBenchmark.measureCompressionThroughput               CREATE     ONES             100            100                 2  thrpt    2   701.102
{code}
On a more "realistic" data set as the one given in the link, the improvements were minimal. I experimented with different dictionary, sample and buffer sizes, but could not obtain results similar to the ones detailed in https://github.com/facebook/zstd. I tried reaching out to people who had operational knowledge of Zstd, but none of the ones I spoke had employed dictionaries.

[~ijuma], do you have any thoughts on whether to proceed with this or not or any suggestions for improvement?

> Compression optimization: Use zstd dictionary based (de)compression
> -------------------------------------------------------------------
>
>                 Key: KAFKA-14636
>                 URL: https://issues.apache.org/jira/browse/KAFKA-14636
>             Project: Kafka
>          Issue Type: Sub-task
>            Reporter: Divij Vaidya
>            Assignee: Christo Lolov
>            Priority: Major
>              Labels: needs-kip
>
> Use dictionary functionality of Zstd decompression. Train the dictionary per topic for first few MBs and then use it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)