You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/01/08 15:22:10 UTC

[GitHub] [arrow] nevi-me opened a new pull request #9137: ARROW-8676: [Rust] IPC RecordBatch body compression

nevi-me opened a new pull request #9137:
URL: https://github.com/apache/arrow/pull/9137


   Adds Recordbatch body compression, which compresses the buffers that make up arrays (e.g. offsets, null buffer).
   I've restricted the write side to only work with v5 of the metadata. We can expand on this later, as I think the non-legacy v4 supports the `BodyCompression` method implemented here. Reading should be fine if the compression info is specified.
   
   This PR is built on top of ARROW-10299 (#9122).
   
   I have not yet implemented ZSTD compression, but I expect it shouldn't be too much work, so can still be done as part of this PR.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nevi-me commented on pull request #9137: ARROW-8676: [Rust] IPC RecordBatch body compression

Posted by GitBox <gi...@apache.org>.
nevi-me commented on pull request #9137:
URL: https://github.com/apache/arrow/pull/9137#issuecomment-756815481






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] mqy edited a comment on pull request #9137: ARROW-8676: [Rust] IPC RecordBatch body compression

Posted by GitBox <gi...@apache.org>.
mqy edited a comment on pull request #9137:
URL: https://github.com/apache/arrow/pull/9137#issuecomment-757449958


   @nevi-me no luck to install pyarrow 2.0.0 due to various dependency errors, but found something:
   `Message.rs` requires `LZ4 frame format`, not the block format.
   
   ```
     // LZ4 frame format, for portability, as provided by lz4frame.h or wrappers
     // thereof. Not to be confused with "raw" (also called "block") format
     // provided by lz4.h
   ```
   
   FYI:
   
   - [lz4frame.h] mentioned in `Message.rs`: https://github.com/lz4/lz4/blob/fdf2ef5809ca875c454510610764d9125ef2ebbd/lib/lz4frame.h
   - [lz4 frame example c code] https://github.com/lz4/lz4/blob/fdf2ef5809ca875c454510610764d9125ef2ebbd/examples/frameCompress.c
   - [lz4-rs lz4-sys APIs] https://github.com/10XGenomics/lz4-rs/blob/master/lz4-sys/src/lib.rs
   - [c++ uses LZ4F_xxx APIs] https://github.com/apache/arrow/blob/3694794bdfd0677b95b8c95681e392512f1c9237/cpp/src/arrow/util/compression_lz4.cc


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #9137: ARROW-8676: [Rust] IPC RecordBatch body compression

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #9137:
URL: https://github.com/apache/arrow/pull/9137#issuecomment-756822180


   https://issues.apache.org/jira/browse/ARROW-8676


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] mqy commented on pull request #9137: ARROW-8676: [Rust] IPC RecordBatch body compression

Posted by GitBox <gi...@apache.org>.
mqy commented on pull request #9137:
URL: https://github.com/apache/arrow/pull/9137#issuecomment-757098021


   @nevi-me would you please have a look at  `BodyCompressionBuilder` from  [ipc/gen/Message.rs](https://github.com/apache/arrow/blob/a2e7d3a87fb8fa1cc98a54029c0262df468838fa/rust/arrow/src/ipc/gen/Message.rs)?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nevi-me commented on pull request #9137: ARROW-8676: [Rust] IPC RecordBatch body compression

Posted by GitBox <gi...@apache.org>.
nevi-me commented on pull request #9137:
URL: https://github.com/apache/arrow/pull/9137#issuecomment-757099760


   > @nevi-me would you please have a look at `BodyCompressionBuilder` from [ipc/gen/Message.rs](https://github.com/apache/arrow/blob/a2e7d3a87fb8fa1cc98a54029c0262df468838fa/rust/arrow/src/ipc/gen/Message.rs)?
   
   Did you notice something with it? I might need a bit more context :)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] mqy commented on a change in pull request #9137: ARROW-8676: [Rust] IPC RecordBatch body compression

Posted by GitBox <gi...@apache.org>.
mqy commented on a change in pull request #9137:
URL: https://github.com/apache/arrow/pull/9137#discussion_r554081674



##########
File path: rust/arrow/src/ipc/compression.rs
##########
@@ -0,0 +1,122 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! IPC Compression Utilities
+
+use std::io::{Read, Write};
+
+use crate::error::{ArrowError, Result};
+use crate::ipc::gen::Message::CompressionType;
+
+pub trait IpcCompressionCodec {
+    fn compress(&self, input_buf: &[u8], output_buf: &mut Vec<u8>) -> Result<()>;
+    fn decompress(&self, input_buf: &[u8], output_buf: &mut Vec<u8>) -> Result<usize>;
+    fn get_compression_type(&self) -> CompressionType;
+}
+
+pub type Codec = Box<dyn IpcCompressionCodec>;
+
+#[inline]
+pub(crate) fn get_codec(
+    compression_type: Option<CompressionType>,
+) -> Result<Option<Codec>> {
+    match compression_type {
+        Some(CompressionType::LZ4_FRAME) => Ok(Some(Box::new(Lz4CompressionCodec {}))),
+        Some(ctype) => Err(ArrowError::InvalidArgumentError(format!(
+            "IPC CompresstionType {:?} not yet supported",
+            ctype
+        ))),
+        None => Ok(None),
+    }
+}
+
+pub struct Lz4CompressionCodec {}
+const LZ4_BUFFER_SIZE: usize = 4096;
+
+impl IpcCompressionCodec for Lz4CompressionCodec {
+    fn compress(&self, input_buf: &[u8], output_buf: &mut Vec<u8>) -> Result<()> {
+        if input_buf.is_empty() {
+            output_buf.write_all(&8i64.to_le_bytes())?;
+            return Ok(());
+        }
+        // write out the uncompressed length as a LE i64 value
+        output_buf.write_all(&(input_buf.len() as i64).to_le_bytes())?;
+        let mut encoder = lz4::EncoderBuilder::new().build(output_buf)?;
+        let mut from = 0;
+        loop {

Review comment:
       This loop can be omitted.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] mqy commented on pull request #9137: ARROW-8676: [Rust] IPC RecordBatch body compression

Posted by GitBox <gi...@apache.org>.
mqy commented on pull request #9137:
URL: https://github.com/apache/arrow/pull/9137#issuecomment-757449958


   @nevi-me no luck to install pyarrow 2.0.0 due to various dependency errors, but found something:
   `Message.rs` requires `LZ4 frame format`, not the block format, but it seems you are using the block APIs. 
   
   FYI:
   
   - [lz4frame.h] mentioned in `Message.rs`: https://github.com/lz4/lz4/blob/fdf2ef5809ca875c454510610764d9125ef2ebbd/lib/lz4frame.h
   - [lz4-rs lz4-sys APIs] https://github.com/10XGenomics/lz4-rs/blob/master/lz4-sys/src/lib.rs


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] mqy edited a comment on pull request #9137: ARROW-8676: [Rust] IPC RecordBatch body compression

Posted by GitBox <gi...@apache.org>.
mqy edited a comment on pull request #9137:
URL: https://github.com/apache/arrow/pull/9137#issuecomment-757449958


   @nevi-me no luck to install pyarrow 2.0.0 due to various dependency errors, but found something:
   `Message.rs` requires `LZ4 frame format`, not the block format.
   
   ```
     // LZ4 frame format, for portability, as provided by lz4frame.h or wrappers
     // thereof. Not to be confused with "raw" (also called "block") format
     // provided by lz4.h
   ```
   
   Just let you know this because I'm not able to explore this in a short period of time.
   
   FYI:
   
   - [lz4frame.h] mentioned in `Message.rs`: https://github.com/lz4/lz4/blob/fdf2ef5809ca875c454510610764d9125ef2ebbd/lib/lz4frame.h
   - [lz4-rs lz4-sys APIs] https://github.com/10XGenomics/lz4-rs/blob/master/lz4-sys/src/lib.rs


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nevi-me closed pull request #9137: ARROW-8676: [Rust] IPC RecordBatch body compression

Posted by GitBox <gi...@apache.org>.
nevi-me closed pull request #9137:
URL: https://github.com/apache/arrow/pull/9137


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nevi-me closed pull request #9137: ARROW-8676: [Rust] IPC RecordBatch body compression

Posted by GitBox <gi...@apache.org>.
nevi-me closed pull request #9137:
URL: https://github.com/apache/arrow/pull/9137


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] mqy commented on pull request #9137: ARROW-8676: [Rust] IPC RecordBatch body compression

Posted by GitBox <gi...@apache.org>.
mqy commented on pull request #9137:
URL: https://github.com/apache/arrow/pull/9137#issuecomment-756821780






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] mqy edited a comment on pull request #9137: ARROW-8676: [Rust] IPC RecordBatch body compression

Posted by GitBox <gi...@apache.org>.
mqy edited a comment on pull request #9137:
URL: https://github.com/apache/arrow/pull/9137#issuecomment-757449958


   @nevi-me no luck to install pyarrow 2.0.0 due to various dependency errors, but found something:
   `Message.rs` requires `LZ4 frame format`, not the block format.
   
   ```
     // LZ4 frame format, for portability, as provided by lz4frame.h or wrappers
     // thereof. Not to be confused with "raw" (also called "block") format
     // provided by lz4.h
   ```
   
   It's a bit complicate to implement with c bindings as what `frameCompress.c` does. Also found another crate named `**lzzzz**` for lz4 frame:
   https://docs.rs/lzzzz/0.8.0/lzzzz/lz4f/index.html, where both `compress_to_vec` and `decompress_to_vec` looks quite simple, I had made the unit test passed with these APIs. 
   
   FYI:
   
   - [lz4frame.h] mentioned in `Message.rs`: https://github.com/lz4/lz4/blob/fdf2ef5809ca875c454510610764d9125ef2ebbd/lib/lz4frame.h
   - [lz4 frame example c code] https://github.com/lz4/lz4/blob/fdf2ef5809ca875c454510610764d9125ef2ebbd/examples/frameCompress.c
   - [lz4-rs lz4-sys APIs] https://github.com/10XGenomics/lz4-rs/blob/master/lz4-sys/src/lib.rs
   - [c++ uses LZ4F_xxx APIs] https://github.com/apache/arrow/blob/3694794bdfd0677b95b8c95681e392512f1c9237/cpp/src/arrow/util/compression_lz4.cc
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] mqy commented on pull request #9137: ARROW-8676: [Rust] IPC RecordBatch body compression

Posted by GitBox <gi...@apache.org>.
mqy commented on pull request #9137:
URL: https://github.com/apache/arrow/pull/9137#issuecomment-757107707


   > Did you notice something with it? I might need a bit more context :)
   
   ```
   pub struct BodyCompressionArgs {
       pub codec: CompressionType,
       pub method: BodyCompressionMethod,
   }
   ```
   Perhaps you missed `RecordBatchBuilder::add_compression()`?
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nevi-me commented on pull request #9137: ARROW-8676: [Rust] IPC RecordBatch body compression

Posted by GitBox <gi...@apache.org>.
nevi-me commented on pull request #9137:
URL: https://github.com/apache/arrow/pull/9137#issuecomment-767020365


   I'll come back to this in the coming weeks


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] mqy commented on pull request #9137: ARROW-8676: [Rust] IPC RecordBatch body compression

Posted by GitBox <gi...@apache.org>.
mqy commented on pull request #9137:
URL: https://github.com/apache/arrow/pull/9137#issuecomment-756821780


   The  error can be found at https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/compression_lz4.cc line 283-285: 
   
   ```
   if (input_len != 0) {
         return Status::IOError("Lz4 compressed input contains more than one frame");
   }
   ```
   Seems that  input_len remains non-zero while the decompression completed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] mqy edited a comment on pull request #9137: ARROW-8676: [Rust] IPC RecordBatch body compression

Posted by GitBox <gi...@apache.org>.
mqy edited a comment on pull request #9137:
URL: https://github.com/apache/arrow/pull/9137#issuecomment-757449958


   @nevi-me no luck to install pyarrow 2.0.0 due to various dependency errors, but found something:
   `Message.rs` requires `LZ4 frame format`, not the block format.
   
   ```
     // LZ4 frame format, for portability, as provided by lz4frame.h or wrappers
     // thereof. Not to be confused with "raw" (also called "block") format
     // provided by lz4.h
   ```
   
   FYI:
   
   - [lz4frame.h] mentioned in `Message.rs`: https://github.com/lz4/lz4/blob/fdf2ef5809ca875c454510610764d9125ef2ebbd/lib/lz4frame.h
   - [lz4 frame example c code] https://github.com/lz4/lz4/blob/fdf2ef5809ca875c454510610764d9125ef2ebbd/examples/frameCompress.c
   - [lz4-rs lz4-sys APIs] https://github.com/10XGenomics/lz4-rs/blob/master/lz4-sys/src/lib.rs
   - [c++ uses LZ4F_xxx APIs] https://github.com/apache/arrow/blob/3694794bdfd0677b95b8c95681e392512f1c9237/cpp/src/arrow/util/compression_lz4.cc
   
   --update--
   
   It's a bit complicate to implement with c bindings as what `frameCompress.c` does. Also found another crate for lz4 frame:
   https://docs.rs/lzzzz/0.8.0/lzzzz/lz4f/index.html, where both `compress_to_vec` and `decompress_to_vec` looks good, you can evaluate from the corresponding examples.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nevi-me commented on pull request #9137: ARROW-8676: [Rust] IPC RecordBatch body compression

Posted by GitBox <gi...@apache.org>.
nevi-me commented on pull request #9137:
URL: https://github.com/apache/arrow/pull/9137#issuecomment-757113891


   > Perhaps you missed `RecordBatchBuilder::add_compression()`?
   
   I missed it at the dictionary write, but not when writing a plain recordbatch; so it's not that. If I hadn't written the compression details completely, the C++ implementation wouldn't have known that the message is compressed or with LZ4.
   I'll debug when I have sufficient time


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] mqy commented on pull request #9137: ARROW-8676: [Rust] IPC RecordBatch body compression

Posted by GitBox <gi...@apache.org>.
mqy commented on pull request #9137:
URL: https://github.com/apache/arrow/pull/9137#issuecomment-758010516


   C++ has flags for codec. Perhaps we should also add feature for this, because:
   1. Compression may not be a very common feature, I saw https://github.com/apache/arrow/pull/9139 also has similar discussions.
   2. lz4 library has unsafe codes that depend on liblz4


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] mqy commented on a change in pull request #9137: ARROW-8676: [Rust] IPC RecordBatch body compression

Posted by GitBox <gi...@apache.org>.
mqy commented on a change in pull request #9137:
URL: https://github.com/apache/arrow/pull/9137#discussion_r554081674



##########
File path: rust/arrow/src/ipc/compression.rs
##########
@@ -0,0 +1,122 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! IPC Compression Utilities
+
+use std::io::{Read, Write};
+
+use crate::error::{ArrowError, Result};
+use crate::ipc::gen::Message::CompressionType;
+
+pub trait IpcCompressionCodec {
+    fn compress(&self, input_buf: &[u8], output_buf: &mut Vec<u8>) -> Result<()>;
+    fn decompress(&self, input_buf: &[u8], output_buf: &mut Vec<u8>) -> Result<usize>;
+    fn get_compression_type(&self) -> CompressionType;
+}
+
+pub type Codec = Box<dyn IpcCompressionCodec>;
+
+#[inline]
+pub(crate) fn get_codec(
+    compression_type: Option<CompressionType>,
+) -> Result<Option<Codec>> {
+    match compression_type {
+        Some(CompressionType::LZ4_FRAME) => Ok(Some(Box::new(Lz4CompressionCodec {}))),
+        Some(ctype) => Err(ArrowError::InvalidArgumentError(format!(
+            "IPC CompresstionType {:?} not yet supported",
+            ctype
+        ))),
+        None => Ok(None),
+    }
+}
+
+pub struct Lz4CompressionCodec {}
+const LZ4_BUFFER_SIZE: usize = 4096;
+
+impl IpcCompressionCodec for Lz4CompressionCodec {
+    fn compress(&self, input_buf: &[u8], output_buf: &mut Vec<u8>) -> Result<()> {
+        if input_buf.is_empty() {
+            output_buf.write_all(&8i64.to_le_bytes())?;
+            return Ok(());
+        }
+        // write out the uncompressed length as a LE i64 value
+        output_buf.write_all(&(input_buf.len() as i64).to_le_bytes())?;
+        let mut encoder = lz4::EncoderBuilder::new().build(output_buf)?;
+        let mut from = 0;
+        loop {

Review comment:
       This loop can be omitted.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] mqy commented on pull request #9137: ARROW-8676: [Rust] IPC RecordBatch body compression

Posted by GitBox <gi...@apache.org>.
mqy commented on pull request #9137:
URL: https://github.com/apache/arrow/pull/9137#issuecomment-757438089


   > > Perhaps you missed `RecordBatchBuilder::add_compression()`?
   > 
   > I missed it at the dictionary write, but not when writing a plain recordbatch; so it's not that. If I hadn't written the compression details completely, the C++ implementation wouldn't have known that the message is compressed or with LZ4.
   > I'll debug when I have sufficient time
   
   @nevi-me I pulled your branch locally, found `add_compression` in `writer.rs`. I'm sorry that I failed to search `add_compression` from `https://github.com/apache/arrow/pull/9137/fies` because `writer.rs` is not loaded.
   
   I'm installing `pyarrow`,  after run test `test_write_file_v5_compressed`.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #9137: ARROW-8676: [Rust] IPC RecordBatch body compression

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #9137:
URL: https://github.com/apache/arrow/pull/9137#issuecomment-756822180


   https://issues.apache.org/jira/browse/ARROW-8676


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] mqy commented on a change in pull request #9137: ARROW-8676: [Rust] IPC RecordBatch body compression

Posted by GitBox <gi...@apache.org>.
mqy commented on a change in pull request #9137:
URL: https://github.com/apache/arrow/pull/9137#discussion_r555029409



##########
File path: rust/arrow/src/ipc/writer.rs
##########
@@ -712,15 +762,24 @@ fn write_buffer(
     buffers: &mut Vec<ipc::Buffer>,
     arrow_data: &mut Vec<u8>,
     offset: i64,
-) -> i64 {
-    let len = buffer.len();
+    codec: Option<&Codec>,
+) -> Result<i64> {
+    let mut compressed = Vec::new();
+    let buf_slice = match codec {
+        Some(codec) => {
+            codec.compress(buffer.as_slice(), &mut compressed)?;
+            compressed.as_slice()
+        }
+        None => buffer.as_slice(),
+    };
+    let len = buf_slice.len();
     let pad_len = pad_to_8(len as u32);
     let total_len: i64 = (len + pad_len) as i64;
     // assert_eq!(len % 8, 0, "Buffer width not a multiple of 8 bytes");
     buffers.push(ipc::Buffer::new(offset, total_len));

Review comment:
       @nevi-me perhaps the arg `total_len` of `buffers.push(ipc::Buffer::new(...)` should not account for `pad_len`. After fixing this possible bug, my test on pyarrow passed and no failure with `cargo test`, but I'm doubting if this is the root cause, because:
   - I changed quite a lot based on your PR.
   - This line was last updated in PR ARROW-518 a year ago.
   
   I took quite some time to compare with the C++ code, it looks c++ set buffer size as actual memory size.
   The [IPC spec recordbatch-message](https://github.com/apache/arrow/blob/master/docs/source/format/Columnar.rst#recordbatch-message) states that:
   
   `
   The size field of Buffer is not required to account for padding bytes. Since this metadata can be used to communicate in-memory pointer addresses between libraries, it is recommended to set size to the actual memory size rather than the padded size.
   `
   
   FYI: https://github.com/apache/arrow/commit/68a558be866927435abffb33ccb606aa6f1c9390 at https://github.com/apache/arrow/compare/master...mqy:nevi-me_ARROW-8676 based on your PR.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nevi-me commented on pull request #9137: ARROW-8676: [Rust] IPC RecordBatch body compression

Posted by GitBox <gi...@apache.org>.
nevi-me commented on pull request #9137:
URL: https://github.com/apache/arrow/pull/9137#issuecomment-756815481


   For anyone who understands how LZ4 works, I need some help.
   I'm able to read LZ4 compressed data written in Python from Rust, but I get an error when trying to read a file written by the Rust writer from `pyarrow`.
   
   The error that I'm getting is:
   
   ```
   OSError: Lz4 compressed input contains more than one frame
   ```
   
   To reproduce the error, please:
   1. Run the `arrow` unit tests, so that a compressed file is created.
   2. Run the below to read the file in `pyarrow`
   
   ```python
   import pyarrow as pa
   file = pa.ipc.open_file("$ARROW_ROOT/rust/arrow/target/debug/testdata/primitive_lz4.arrow_file")
   batches = file.read_all()
   ```
   
   Thanks


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nevi-me commented on pull request #9137: ARROW-8676: [Rust] IPC RecordBatch body compression

Posted by GitBox <gi...@apache.org>.
nevi-me commented on pull request #9137:
URL: https://github.com/apache/arrow/pull/9137#issuecomment-756827284


   > Seems that input_len remains non-zero while the decompression completed.
   
   the logic of how the buffers are compressed isn't explained in a beginner-friendly way, so I likely am implementing the compression incorrectly. I used the Java implementation in #8949 as a reference. I couldn't quite understand the C++ implementation.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nevi-me commented on pull request #9137: ARROW-8676: [Rust] IPC RecordBatch body compression

Posted by GitBox <gi...@apache.org>.
nevi-me commented on pull request #9137:
URL: https://github.com/apache/arrow/pull/9137#issuecomment-767020365


   I'll come back to this in the coming weeks


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] mqy edited a comment on pull request #9137: ARROW-8676: [Rust] IPC RecordBatch body compression

Posted by GitBox <gi...@apache.org>.
mqy edited a comment on pull request #9137:
URL: https://github.com/apache/arrow/pull/9137#issuecomment-757449958


   @nevi-me no luck to install pyarrow 2.0.0 due to various dependency errors, but found something:
   `Message.rs` requires `LZ4 frame format`, not the block format.
   
   ```
     // LZ4 frame format, for portability, as provided by lz4frame.h or wrappers
     // thereof. Not to be confused with "raw" (also called "block") format
     // provided by lz4.h
   ```
   
   It's a bit complicate to implement with c bindings as what `frameCompress.c` does. Also found another crate named **`lzzzz`** for lz4 frame:
   https://docs.rs/lzzzz/0.8.0/lzzzz/lz4f/index.html, where both `compress_to_vec` and `decompress_to_vec` looks quite simple, I had made the unit test passed with these APIs. 
   
   FYI:
   
   - [lz4frame.h] mentioned in `Message.rs`: https://github.com/lz4/lz4/blob/fdf2ef5809ca875c454510610764d9125ef2ebbd/lib/lz4frame.h
   - [lz4 frame example c code] https://github.com/lz4/lz4/blob/fdf2ef5809ca875c454510610764d9125ef2ebbd/examples/frameCompress.c
   - [lz4-rs lz4-sys APIs] https://github.com/10XGenomics/lz4-rs/blob/master/lz4-sys/src/lib.rs
   - [c++ uses LZ4F_xxx APIs] https://github.com/apache/arrow/blob/3694794bdfd0677b95b8c95681e392512f1c9237/cpp/src/arrow/util/compression_lz4.cc
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] mqy edited a comment on pull request #9137: ARROW-8676: [Rust] IPC RecordBatch body compression

Posted by GitBox <gi...@apache.org>.
mqy edited a comment on pull request #9137:
URL: https://github.com/apache/arrow/pull/9137#issuecomment-757449958


   @nevi-me no luck to install pyarrow 2.0.0 due to various dependency errors, but found something:
   `Message.rs` requires `LZ4 frame format`, not the block format.
   
   ```
     // LZ4 frame format, for portability, as provided by lz4frame.h or wrappers
     // thereof. Not to be confused with "raw" (also called "block") format
     // provided by lz4.h
   ```
   
   Just let you know this because I'm not able to explore this in a short period of time.
   
   FYI:
   
   - [lz4frame.h] mentioned in `Message.rs`: https://github.com/lz4/lz4/blob/fdf2ef5809ca875c454510610764d9125ef2ebbd/lib/lz4frame.h
   - [lz4-rs lz4-sys APIs] https://github.com/10XGenomics/lz4-rs/blob/master/lz4-sys/src/lib.rs
   - [c++ uses LZ4F_xxx APIs] https://github.com/apache/arrow/blob/3694794bdfd0677b95b8c95681e392512f1c9237/cpp/src/arrow/util/compression_lz4.cc


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org