You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kafka.apache.org by "Jay Kreps (JIRA)" <ji...@apache.org> on 2012/09/24 22:40:07 UTC
[jira] [Updated] (KAFKA-527) Compression support does numerous byte copies

     [ https://issues.apache.org/jira/browse/KAFKA-527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jay Kreps updated KAFKA-527:
----------------------------

    Description: 
The data path for compressing or decompressing messages is extremely inefficient. We do something like 7 (?) complete copies of the data, often for simple things like adding a 4 byte size to the front. I am not sure how this went by unnoticed.

This is likely the root cause of the performance issues we saw in doing bulk recompression of data in mirror maker.

The mismatch between the InputStream and OutputStream interfaces and the Message/MessageSet interfaces which are based on byte buffers is the cause of many of these.

I believe the right thing to do is to rework the compression code so that it doesn't use the Stream interface. Snappy supports ByteBuffers directly. GZIP in java doesn't seem to, but I think GZIP is the wrong thing to be using. If I understand correctly GZIP = DEFLATE + HEADER + FOOTER. The header contains things like a version and checksum. Since we already record the compression type, using GZIP is redundant, and we should just be using DEFLATE which has direct support for byte arrays. With this change I think it should be possible to optimize the compression down to eliminate all copying in the common case.



  was:
The data path for compressing or decompressing messages is extremely inefficient. We do something like 7 (?) complete copies of the data, often for simple things like adding a 4 byte size to the front. I am not how this went by unnoticed.

This is likely the root cause of the performance issues we saw in doing bulk recompression of data in mirror maker.

The mismatch between the InputStream and OutputStream interfaces and the Message/MessageSet interfaces which are based on byte buffers is the cause of many of these.

I believe the right thing to do is to rework the compression code so that it doesn't use the Stream interface. Snappy supports ByteBuffers directly. GZIP in java doesn't seem to, but I think GZIP is the wrong thing to be using. If I understand correctly GZIP = DEFLATE + HEADER + FOOTER. The header contains things like a version and checksum. Since we already record the compression type, using GZIP is redundant, and we should just be using DEFLATE which has direct support for byte arrays. With this change I think it should be possible to optimize the compression down to eliminate all copying in the common case.



    
> Compression support does numerous byte copies
> ---------------------------------------------
>
>                 Key: KAFKA-527
>                 URL: https://issues.apache.org/jira/browse/KAFKA-527
>             Project: Kafka
>          Issue Type: Bug
>            Reporter: Jay Kreps
>
> The data path for compressing or decompressing messages is extremely inefficient. We do something like 7 (?) complete copies of the data, often for simple things like adding a 4 byte size to the front. I am not sure how this went by unnoticed.
> This is likely the root cause of the performance issues we saw in doing bulk recompression of data in mirror maker.
> The mismatch between the InputStream and OutputStream interfaces and the Message/MessageSet interfaces which are based on byte buffers is the cause of many of these.
> I believe the right thing to do is to rework the compression code so that it doesn't use the Stream interface. Snappy supports ByteBuffers directly. GZIP in java doesn't seem to, but I think GZIP is the wrong thing to be using. If I understand correctly GZIP = DEFLATE + HEADER + FOOTER. The header contains things like a version and checksum. Since we already record the compression type, using GZIP is redundant, and we should just be using DEFLATE which has direct support for byte arrays. With this change I think it should be possible to optimize the compression down to eliminate all copying in the common case.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira