You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/04/08 08:43:14 UTC

[GitHub] [arrow-rs] REASY opened a new issue, #1528: RecordBatch: Serialization of sliced record using StreamWriter produces incorrect resut

REASY opened a new issue, #1528:
URL: https://github.com/apache/arrow-rs/issues/1528

   When you slice `RecordBatch` and serialize it with `StreamWriter`, it produces an incorrect result. I'm using `arrow = "11.1.0"`
   
   To reproduce once can use the following test:
   ```rust
   #[cfg(test)]
   mod tests {
       use std::sync::Arc;
       use arrow::array::{Int32Array, StringArray};
       use arrow::datatypes::{DataType, Field, Schema};
       use arrow::ipc::writer::StreamWriter;
       use arrow::record_batch::RecordBatch;
   
       #[test]
       fn it_works() {
           pub fn serialize(record: &RecordBatch) -> Vec<u8> {
               let buffer: Vec<u8> = Vec::new();
               let mut stream_writer = StreamWriter::try_new(buffer, &record.schema()).unwrap();
               stream_writer.write(record).unwrap();
               stream_writer.finish().unwrap();
               let serialized_batch = stream_writer.into_inner().unwrap();
               serialized_batch
           }
   
           fn create_batch(rows: usize) -> RecordBatch {
               let schema = Schema::new(vec![
                   Field::new("a", DataType::Int32, false),
                   Field::new("b", DataType::Utf8, false),
               ]);
               let expected_schema = schema.clone();
   
               let a = Int32Array::from(vec![1; rows]);
               let b = StringArray::from(vec!["a"; rows]);
   
               let record_batch = RecordBatch::try_new(Arc::new(schema), vec![Arc::new(a), Arc::new(b)])
                   .unwrap();
               record_batch
           }
           let big_record_batch = create_batch(65536);
           println!("big_record_batch with dimension ({}, {}) (rows x cols) serialized as Apache Arrow IPC in {} bytes", big_record_batch.num_rows(),
                    big_record_batch.num_columns(), serialize(&big_record_batch).len());
           let length = 5;
           let small_record_batch = create_batch(length);
           println!("small_record_batch with dimension ({}, {}) (rows x cols) serialized as Apache Arrow IPC in {} bytes", small_record_batch.num_rows(),
                    small_record_batch.num_columns(), serialize(&small_record_batch).len());
   
           let offset = 2;
           let record_batch_slice = big_record_batch.slice(offset, length);
           println!("(Sliced): record_batch_slice with dimension ({}, {}) (rows x cols) serialized as Apache Arrow IPC in {} bytes", record_batch_slice.num_rows(),
                    record_batch_slice.num_columns(), serialize(&record_batch_slice).len());
       }
   }
   ```
   As you can see the sliced one has almost the same size as `big_record_batch`, but I would expect it to be the same size as `small_record_batch`:
   ```
   big_record_batch with dimension (65536, 2) (rows x cols) serialized as Apache Arrow IPC in 606608 bytes
   small_record_batch with dimension (5, 2) (rows x cols) serialized as Apache Arrow IPC in 464 bytes
   (Sliced): record_batch_slice with dimension (5, 2) (rows x cols) serialized as Apache Arrow IPC in 590240 bytes
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] alamb commented on issue #1528: RecordBatch: Serialization of sliced record using StreamWriter produces incorrect resut

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #1528:
URL: https://github.com/apache/arrow-rs/issues/1528#issuecomment-1182366890

   see https://github.com/apache/arrow-rs/pull/2040 from @viirya  ❤️ 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on issue #1528: RecordBatch: Serialization of sliced record using StreamWriter produces incorrect resut

Posted by GitBox <gi...@apache.org>.
tustvold commented on issue #1528:
URL: https://github.com/apache/arrow-rs/issues/1528#issuecomment-1179567360

   As stated above this isn't a bug per se, but rather that the IPC format faithfully sends the representation of the arrays over the wire - even if some portion of the values have been logically sliced away. I think some feature that truncates buffers, rewriting offsets, etc... is definitely possible as described in #208.
   
   I personally have very limited time to spend on this, but perhaps @nevi-me or @viirya might have some spare cycles?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] viirya commented on issue #1528: RecordBatch: Serialization of sliced record using StreamWriter produces incorrect resut

Posted by GitBox <gi...@apache.org>.
viirya commented on issue #1528:
URL: https://github.com/apache/arrow-rs/issues/1528#issuecomment-1179569742

   I will try to take a look this weekend.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] kaaniboy commented on issue #1528: RecordBatch: Serialization of sliced record using StreamWriter produces incorrect resut

Posted by GitBox <gi...@apache.org>.
kaaniboy commented on issue #1528:
URL: https://github.com/apache/arrow-rs/issues/1528#issuecomment-1179400358

   Is there any plan to resolve this issue? For my use case, I care specifically that I can write multiple smaller IPC messages rather than a single large one. I hoped to achieve this by slicing the large `RecordBatch` and writing each slice separately. It seems like [a similar issue](https://github.com/jorgecarleitao/arrow2/issues/192) existed in `arrow2` but was resolved last year.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on issue #1528: RecordBatch: Serialization of sliced record using StreamWriter produces incorrect resut

Posted by GitBox <gi...@apache.org>.
tustvold commented on issue #1528:
URL: https://github.com/apache/arrow-rs/issues/1528#issuecomment-1100708700

   Can you confirm that the issue is just the size of the written file, and not a correctness problem - i.e. the data is larger than it could be, but still round-trips correctly? If so, I think as you've suggested this might be a duplicate of https://github.com/apache/arrow-rs/issues/208.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold closed issue #1528: RecordBatch: Serialization of sliced record using StreamWriter produces incorrect resut

Posted by GitBox <gi...@apache.org>.
tustvold closed issue #1528: RecordBatch: Serialization of sliced record using StreamWriter produces incorrect resut
URL: https://github.com/apache/arrow-rs/issues/1528


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org