You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/06/18 16:46:27 UTC

[GitHub] [arrow-rs] liukun4515 commented on a diff in pull request #1855: support compression for IPC

liukun4515 commented on code in PR #1855:
URL: https://github.com/apache/arrow-rs/pull/1855#discussion_r900994846


##########
arrow/src/ipc/writer.rs:
##########
@@ -922,19 +1021,67 @@ fn write_array_data(
 }
 
 /// Write a buffer to a vector of bytes, and add its ipc::Buffer to a vector
+/// From https://github.com/apache/arrow/blob/6a936c4ff5007045e86f65f1a6b6c3c955ad5103/format/Message.fbs#L58
+/// Each constituent buffer is first compressed with the indicated
+/// compressor, and then written with the uncompressed length in the first 8
+/// bytes as a 64-bit little-endian signed integer followed by the compressed
+/// buffer bytes (and then padding as required by the protocol). The
+/// uncompressed length may be set to -1 to indicate that the data that
+/// follows is not compressed, which can be useful for cases where
+/// compression does not yield appreciable savings.
 fn write_buffer(
     buffer: &Buffer,
     buffers: &mut Vec<ipc::Buffer>,
     arrow_data: &mut Vec<u8>,
     offset: i64,
+    compression_codec: &CompressionCodecType,
 ) -> i64 {
-    let len = buffer.len();
-    let pad_len = pad_to_8(len as u32);
-    let total_len: i64 = (len + pad_len) as i64;
+    let origin_buffer_len = buffer.len();
+    let mut compression_buffer = Vec::<u8>::new();
+    let (data, uncompression_buffer_len) = match compression_codec {
+        CompressionCodecType::NoCompression => {
+            // this buffer_len will not used in the following logic
+            // If we don't use the compression, just write the data in the array
+            (buffer.as_slice(), origin_buffer_len as i64)
+        }
+        CompressionCodecType::Lz4Frame | CompressionCodecType::ZSTD => {
+            if (origin_buffer_len as i64) == LENGTH_EMPTY_COMPRESSED_DATA {
+                (buffer.as_slice(), 0)
+            } else {
+                compression_codec
+                    .compress(buffer.as_slice(), &mut compression_buffer)
+                    .unwrap();
+                if compression_buffer.len() > origin_buffer_len {
+                    // the length of compressed data is larger than uncompressed data
+                    // use the uncompressed data with -1
+                    // -1 indicate that we don't compress the data
+                    (buffer.as_slice(), LENGTH_NO_COMPRESSED_DATA)
+                } else {
+                    // use the compressed data with uncompressed length
+                    (compression_buffer.as_slice(), origin_buffer_len as i64)
+                }
+            }
+        }
+    };
+    let len = data.len() as i64;
+    // TODO: don't need to pad each buffer, and just need to pad the tail of the message body
+    // let pad_len = pad_to_8(len as u32);

Review Comment:
   I read format/IPC [spec](https://arrow.apache.org/docs/format/Columnar.html#serialization-and-interprocess-communication-ipc) and got that we don't need to align each buffer and pad the `0` to the tail of each buffer, just pad the `0` to the tail of message body.
   @martin-g @alamb 
   Is there any error for my understanding for the IPC format?
   Please feel free to point out.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org