You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "lwpyr (via GitHub)" <gi...@apache.org> on 2023/06/13 18:21:26 UTC

[GitHub] [arrow-rs] lwpyr opened a new pull request, #4411: Fix bug in compression

lwpyr opened a new pull request, #4411:
URL: https://github.com/apache/arrow-rs/pull/4411

   The wrongly calculated compressed length included the full original buffer length, which will decline almost all the compressable data. Suppose original buffer len is *a*, incoming data len is *b*, compressed data len is *c*, the code should compare *b* and *c* instead of *b* and *a+c*
   
   # Which issue does this PR close?
   Closes #4410.
   
   # Rationale for this change
    We observe abnormal compress ratio in our production code even with ZSTD.
   
   # What changes are included in this PR?
   Fix the compressed length calculation
   
   # Are there any user-facing changes?
   No
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] alamb merged pull request #4411: Fix bug in IPC logic that determines if the buffer should be compressed or not

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb merged PR #4411:
URL: https://github.com/apache/arrow-rs/pull/4411


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] alamb commented on a diff in pull request #4411: Fix bug in compression

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on code in PR #4411:
URL: https://github.com/apache/arrow-rs/pull/4411#discussion_r1232460412


##########
arrow-ipc/src/compression.rs:
##########
@@ -69,7 +69,7 @@ impl CompressionCodec {
             output.extend_from_slice(&uncompressed_data_len.to_le_bytes());
             self.compress(input, output)?;
 
-            let compression_len = output.len();
+            let compression_len = output.len() - original_output_len;

Review Comment:
   > The wrongly calculated compressed length included the full original buffer length, which will decline almost all the compressable data. Suppose original buffer len is a, incoming data len is b, compressed data len is c, the code should compare b and c instead of b and a+c
   
   I agree this makes much more sense. I carefully reviewed this code and it looks good to me. THank you @lwpyr 
   
   It would be really nice if you could figure out some way to test it (so that we don't break it inadvertently in the future), but given this feature has no explicitly coverage (given the fact that there are no tests that need to change) I think this OR is ok
   
   cc @liukun4515 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] lwpyr commented on pull request #4411: Fix bug in compression

Posted by "lwpyr (via GitHub)" <gi...@apache.org>.
lwpyr commented on PR #4411:
URL: https://github.com/apache/arrow-rs/pull/4411#issuecomment-1590955488

   For compression, a perf case with monitor may be a better practice, but that is far beyond this bug fixing. I suggest to have a backlog work item on that.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on pull request #4411: Fix bug in compression

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on PR #4411:
URL: https://github.com/apache/arrow-rs/pull/4411#issuecomment-1589862663

   Thank you for this, could we possibly get a test so that this doesn't get broken in future?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] lwpyr commented on pull request #4411: Fix bug in compression

Posted by "lwpyr (via GitHub)" <gi...@apache.org>.
lwpyr commented on PR #4411:
URL: https://github.com/apache/arrow-rs/pull/4411#issuecomment-1590948147

   > Perhaps we can have a test that writes data with and without compression enabled and checks that the compressed version is smaller?
   
   The problem is the buggy code can also produce smaller ones... but the compress ratio is like 95%


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on pull request #4411: Fix bug in compression

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on PR #4411:
URL: https://github.com/apache/arrow-rs/pull/4411#issuecomment-1590829763

   Perhaps we can have a test that writes data with and without compression enabled and checks that the compressed version is smaller?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] lwpyr commented on pull request #4411: Fix bug in compression

Posted by "lwpyr (via GitHub)" <gi...@apache.org>.
lwpyr commented on PR #4411:
URL: https://github.com/apache/arrow-rs/pull/4411#issuecomment-1590272914

   > Thank you for this, could we possibly get a test so that this doesn't get broken in future?
   
   It looks like not so explicit how to test this, there is no correctness issue, maybe in the future there could be a perf test suite to examine the compress ratio.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org