You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/04/26 12:50:57 UTC

[GitHub] [arrow-rs] alamb opened a new issue #208: flight_data_from_arrow_batch sends too much data

alamb opened a new issue #208:
URL: https://github.com/apache/arrow-rs/issues/208


   *Note*: migrated from original JIRA: https://issues.apache.org/jira/browse/ARROW-12265
   
   Arrow arrays can share the same backing store, even if the array is just a "view" of a slice of another array.
   
   Yet, when `flight_data_from_arrow_batch` encodes the arrays into a FlightData, it blindly copies the entire buffer ready to be sent over the wire.
   
   Thus, for example, when DataFusion uses the `arrow::compute::limit` operator to return a few elements of an array, we still end up with a the full (potentially) large array being sent over the wire.
   
    
   
   Since encoding the array in a FlightData involves copying the data anyway, perhaps it would be beneficial to take the Array length in consideration and copy only the parts of the buffer that contain actual data.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] matthewmturner commented on issue #208: flight_data_from_arrow_batch sends too much data

Posted by GitBox <gi...@apache.org>.

matthewmturner commented on issue #208:
URL: https://github.com/apache/arrow-rs/issues/208#issuecomment-917692820


   Thanks much for the response @alamb 
   
   > I am not sure to be honest as I am not familiar with the flight code. Perhaps @nevi-me or @jorgecarleitao who have more experience in how IPC / flight is supposed to work might have thoughts on how to handle serializing bytes for an Array whose backing `Buffer` is much larger. Another avenue we can explore is to review how the C++ implementation handles the case and/or ask about this on [dev@arrow.apache.org](mailto:dev@arrow.apache.org).
   > 
   > One way to reduce potential unintended side effects could be to make the optimization optional (an option on [`IpcWriteOptions`](https://docs.rs/arrow/5.3.0/arrow/ipc/writer/struct.IpcWriteOptions.html), perhaps) while we test it out more broadly, and then switch the default value in a later version.
   >
   I checked the `IpcWriteOptions` for the C++ implementation (http://arrow.apache.org/docs/cpp/api/ipc.html) and I dont think they have an option for that.  It also doesnt look like there is a user option for this in their Flight client implementation (http://arrow.apache.org/docs/cpp/api/flight.html#_CPPv4N5arrow6flight17FlightCallOptionsE).   I can email the dev email list to see if they have any hidden logic for this not exposed to the end user.
   > `RecordBatch::slice` is what I know of for this purpose: https://docs.rs/arrow/5.3.0/arrow/record_batch/struct.RecordBatch.html#method.slice. (Kudos to @b41sh for adding that one)
   >
   Thanks will check that out! Not sure how i missed that function.
   
   @nevi-me @jorgecarleitao any thoughts or preferences on how to handle this?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] matthewmturner commented on issue #208: flight_data_from_arrow_batch sends too much data

Posted by GitBox <gi...@apache.org>.

matthewmturner commented on issue #208:
URL: https://github.com/apache/arrow-rs/issues/208#issuecomment-918286125


   @jorgecarleitao I believe this is done in `write_generic_binary` using 
   ```
   let first = *offsets.first().unwrap();
   let last = *offsets.last().unwrap();
   ```
   and then writing to buffer based on those values.
   
   Given the different approaches here between `arrow` and `arrow2` is there a preference for how to handle within `arrow`?
   
   I was thinking we could use the `RecordBatch::slice` method within `record_batch_to_bytes` before writing the data.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] alamb commented on issue #208: flight_data_from_arrow_batch sends too much data

Posted by GitBox <gi...@apache.org>.

alamb commented on issue #208:
URL: https://github.com/apache/arrow-rs/issues/208#issuecomment-917615261


   > And the issue is that the data buffer points to the original larger array. Then, that larger array is ultimately turned into the FlightData which is a waste.
   
   Yes, that is the crux of the issue we found. 
   
   > Assuming that's all correct is there a preference as to where a fix should be applied? i.e. whether at flight_data_from_arrow_batch, encoded_batch, or record_batch_to_bytes?
   
   I am not sure to be honest as I am not familiar with the flight code. Perhaps @nevi-me  or @jorgecarleitao who have more experience in how IPC / flight is supposed to work might have thoughts on how to handle serializing bytes for an Array whose backing `Buffer` is much larger. Another avenue we can explore is to review how the C++ implementation handles the case and/or ask about this on dev@arrow.apache.org.
   
    One way to reduce potential unintended side effects could be to make the optimization optional (an option on [`IpcWriteOptions`](https://docs.rs/arrow/5.3.0/arrow/ipc/writer/struct.IpcWriteOptions.html), perhaps) while we test it out more broadly, and then switch the default value in a later version. 
   
   > Naively I was thinking at the record_batch_to_bytes level - but i think that might impact IPC in general.
   
   Yes. However, maybe that is ok (as that seems to be optimizing the serialization of Arrow Arrays). However, I am not sure what the expectations are here. 
   
   > Separately, ive been looking if there are any methods / helpers for recreating a RecordBatch out of the data / offsets / len of another RecordBatch.
   
   `RecordBatch::slice` is what I know of for this purpose: https://docs.rs/arrow/5.3.0/arrow/record_batch/struct.RecordBatch.html#method.slice


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] jorgecarleitao commented on issue #208: flight_data_from_arrow_batch sends too much data

Posted by GitBox <gi...@apache.org>.

jorgecarleitao commented on issue #208:
URL: https://github.com/apache/arrow-rs/issues/208#issuecomment-917695353


   I am not sure about the root cause of this, but some time ago we merged something like this in arrow2, could you check if https://github.com/jorgecarleitao/arrow2/pull/194, specifically around `write_generic_binary` in `src/io/ipc/write/serialize.rs` is what we are looking for here?
   
   Note that the code is significantly different at this point, but the gist is that we need to check if the array was sliced, and if it was, only write the relevant values.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] matthewmturner commented on issue #208: flight_data_from_arrow_batch sends too much data

Posted by GitBox <gi...@apache.org>.

matthewmturner commented on issue #208:
URL: https://github.com/apache/arrow-rs/issues/208#issuecomment-916461549


   Thank you @alamb , very helpful. I will review and let you know if any questions. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] nevi-me commented on issue #208: flight_data_from_arrow_batch sends too much data

Posted by GitBox <gi...@apache.org>.

nevi-me commented on issue #208:
URL: https://github.com/apache/arrow-rs/issues/208#issuecomment-918505093


   @matthewmturner 
   
   > @jorgecarleitao I believe this is done in `write_generic_binary` using
   > 
   > ```
   > let first = *offsets.first().unwrap();
   > let last = *offsets.last().unwrap();
   > ```
   > 
   > and then writing to buffer based on those values.
   
   That will apply for strings, lists and binaries, but the overall problem is the below.
   
   We write`Buffer`s to IPC, and those buffers have a length and an offset (almost always 0). The problem is that when we write a buffer, we have to determine what its correct offset and length is, and the current APis in the crate can't give us that information conveniently.
   
   For example, if I have a list of i64 values:
   
   ```rust
   List:
     offset_buffer: [0, 1, 3,  6, 10] // 4 values
     child_data: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
     null_buffer: [T, F, T, F, T, F, T, T, T, F]
   ```
   
   There are 3 buffers to write.
   If the list gets sliced, `list.slice(2, 1)`, we now have a list that looks like:
   
   ```rust
   List:
     offset_buffer: [_, _, 3, 6, _]
     child_data: [_, _, _, 4, 5, 6, _, _, _, _]
     null_buffer: [_, _, _, F, T, F, _, _, _, _]
   ```
   
   In terms of the buffers, you have:
   
   ```rust
   buffer 1: type = i32, offset = 8 (2 * 4 bytes), len = 8 (2 * 4 bytes)
   buffer 2: type = i64, offset = 24 (3 * 8 bytes), len = 24 (3 * 8) // notice how the offset is 3 becaues of the list's first offset, and length is 3 because (6 - 3) on the offsets (and the child data has 3 values)
   buffer 3: type = bool, offset = 0 (3 offsets don't cross a byte boundary), len = 1 byte 0b00000_010
   ```
   
   The root of the challenge above comes from the signature of `arrow::buffer::immutable::Buffer`
   
   ```rust
   /// Buffer represents a contiguous memory region that can be shared with other buffers and across
   /// thread boundaries.
   #[derive(Clone, PartialEq, Debug)]
   pub struct Buffer {
       /// the internal byte buffer.
       data: Arc<Bytes>,
   
       /// The offset into the buffer.
       offset: usize,
   }
   ```
   
   and the current state that the only method that sets the `offset` above is 
   
   ```rust
       pub fn slice(&self, offset: usize) -> Self {
           assert!(
               offset <= self.len(),
               "the offset of the new Buffer cannot exceed the existing length"
           );
           Self {
               data: self.data.clone(),
               offset: self.offset + offset,
           }
       }
   ```
   
   One of the foundations of `arrow2` is that a `Buffer` knows its offset and length based on its content. If a string buffer is created with "hello", "you", "world", a slice of 2 means that the buffer will know to offset 8 bytes, making the IPC process easy (@jorgecarleitao this is my understanding without having checked the code as I write this).
   
   ___
   
   So, to only write the correct amount of data in IPC, my approach would be to modify `arrow::ipc::writer::fn write_array_data()` to account for the offset and correct length, and probably change `write_buffer()` in the same module to take the sliced bytes instead of `Buffer`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] matthewmturner commented on issue #208: flight_data_from_arrow_batch sends too much data

Posted by GitBox <gi...@apache.org>.

matthewmturner commented on issue #208:
URL: https://github.com/apache/arrow-rs/issues/208#issuecomment-917698616


   @jorgecarleitao sure will check it out 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] matthewmturner commented on issue #208: flight_data_from_arrow_batch sends too much data

Posted by GitBox <gi...@apache.org>.

matthewmturner commented on issue #208:
URL: https://github.com/apache/arrow-rs/issues/208#issuecomment-916215681


   @alamb im interested in looking into this.  i see it's been open for a while - can you confirm this is still an issue / if there is any additional context that has come up since this was initially raised? Thx!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] alamb edited a comment on issue #208: flight_data_from_arrow_batch sends too much data

Posted by GitBox <gi...@apache.org>.

alamb edited a comment on issue #208:
URL: https://github.com/apache/arrow-rs/issues/208#issuecomment-917615261


   > And the issue is that the data buffer points to the original larger array. Then, that larger array is ultimately turned into the FlightData which is a waste.
   
   Yes, that is the crux of the issue we found. 
   
   > Assuming that's all correct is there a preference as to where a fix should be applied? i.e. whether at flight_data_from_arrow_batch, encoded_batch, or record_batch_to_bytes?
   
   I am not sure to be honest as I am not familiar with the flight code. Perhaps @nevi-me  or @jorgecarleitao who have more experience in how IPC / flight is supposed to work might have thoughts on how to handle serializing bytes for an Array whose backing `Buffer` is much larger. Another avenue we can explore is to review how the C++ implementation handles the case and/or ask about this on dev@arrow.apache.org.
   
    One way to reduce potential unintended side effects could be to make the optimization optional (an option on [`IpcWriteOptions`](https://docs.rs/arrow/5.3.0/arrow/ipc/writer/struct.IpcWriteOptions.html), perhaps) while we test it out more broadly, and then switch the default value in a later version. 
   
   > Naively I was thinking at the record_batch_to_bytes level - but i think that might impact IPC in general.
   
   Yes. However, maybe that is ok (as that seems to be optimizing the serialization of Arrow Arrays). However, I am not sure what the expectations are here. 
   
   > Separately, ive been looking if there are any methods / helpers for recreating a RecordBatch out of the data / offsets / len of another RecordBatch.
   
   `RecordBatch::slice` is what I know of for this purpose: https://docs.rs/arrow/5.3.0/arrow/record_batch/struct.RecordBatch.html#method.slice. (Kudos to @b41sh for adding that one)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] matthewmturner commented on issue #208: flight_data_from_arrow_batch sends too much data

Posted by GitBox <gi...@apache.org>.

matthewmturner commented on issue #208:
URL: https://github.com/apache/arrow-rs/issues/208#issuecomment-917069278


   @alamb I think im restating the obvious and what has already been said, but i want to make sure i understand whats happening so i made a small sample.
   
   ```
   use arrow::array::{Array, Int32Array};
   use arrow::datatypes::{DataType, Field, Schema};
   use arrow::record_batch::RecordBatch;
   use std::sync::Arc;
   
   pub fn test_record_batch_size() {
       let arr_data = vec![1, 2, 3, 4, 5];
       let val_data = vec![5, 6, 7, 8, 9];
       let id_arr = Int32Array::from(arr_data);
       let val_arr = Int32Array::from(val_data);
       let id_arr_slice = id_arr.slice(1, 3);
       let val_arr_slice = val_arr.slice(1, 3);
   
       let schema = Schema::new(vec![
           Field::new("id", DataType::Int32, false),
           Field::new("val", DataType::Int32, false),
       ]);
   
       let batch = RecordBatch::try_new(Arc::new(schema), vec![id_arr_slice, val_arr_slice]).unwrap();
       println!("{:?}", batch);
   
       for column in batch.columns() {
           println!("{:?}", column.data());
       }
   }
   ```
   
   Produces the following output
   ```
   RecordBatch { schema: Schema { fields: [Field { name: "id", data_type: Int32, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: None }, Field { name: "val", data_type: Int32, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: None }], metadata: {} }, columns: [PrimitiveArray<Int32>
   [
     2,
     3,
     4,
   ], PrimitiveArray<Int32>
   [
     6,
     7,
     8,
   ]] }
   ArrayData { data_type: Int32, len: 3, null_count: 0, offset: 1, buffers: [Buffer { data: Bytes { ptr: 0x149e06c40, len: 20, data: [1, 0, 0, 0, 2, 0, 0, 0, 3, 0, 0, 0, 4, 0, 0, 0, 5, 0, 0, 0] }, offset: 0 }], child_data: [], null_bitmap: None }
   1
   ArrayData { data_type: Int32, len: 3, null_count: 0, offset: 1, buffers: [Buffer { data: Bytes { ptr: 0x149e06d00, len: 20, data: [5, 0, 0, 0, 6, 0, 0, 0, 7, 0, 0, 0, 8, 0, 0, 0, 9, 0, 0, 0] }, offset: 0 }], child_data: [], null_bitmap: None }
   ```
   
   And the issue is that the data buffer points to the original larger array.  Then, that larger array is ultimately turned into the `FlightData` which is a waste.
   
   Assuming that's all correct is there a preference as to where a fix should be applied? i.e. whether at `flight_data_from_arrow_batch`, `encoded_batch`, or `record_batch_to_bytes`?
   
   Naively I was thinking at the `record_batch_to_bytes` level - but i think that might impact IPC in general.  Im still figuring out the separation between IPC and Flight functionality though and if this issue is focused only on updating how array data is handled for Flight or for IPC in general.  If we wanted it to be closer to the Flight level then i think copying the `RecordBatch` in `flight_data_from_arrow_batch` before passing it to `encoded_batch` would be the way.
   
   What do you think?
   
   Separately, ive been looking if there are any methods / helpers for recreating a `RecordBatch` out of the data / offsets / len of another `RecordBatch`. Dont think ive found anything though.  If thats the case would the idea be to just remake the batch from scratch with the data from the original?
   
   Hope that's all clear.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] matthewmturner commented on issue #208: flight_data_from_arrow_batch sends too much data

Posted by GitBox <gi...@apache.org>.

matthewmturner commented on issue #208:
URL: https://github.com/apache/arrow-rs/issues/208#issuecomment-919243942


   @nevi-me thank you for the detailed explanation.  very helpful.  im working through some examples on my side to solidify my understanding and will come back if any questions else will just open a PR for a fix.
   
   thx again!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] nevi-me commented on issue #208: flight_data_from_arrow_batch sends too much data

Posted by GitBox <gi...@apache.org>.

nevi-me commented on issue #208:
URL: https://github.com/apache/arrow-rs/issues/208#issuecomment-917699841


   I was going to suggest the same thing, that we only materialise and write
   the length that excludes the offsets.
   
   It's something fundamental that I missed when working on the IPC support. I
   think that when writing data in IPC we should reset the offsets to 0 and
   only write the relevant data.
   
   On Sun, 12 Sep 2021, 21:46 Matthew Turner, ***@***.***> wrote:
   
   > @jorgecarleitao <https://github.com/jorgecarleitao> sure will check it out
   >
   > —
   > You are receiving this because you were mentioned.
   > Reply to this email directly, view it on GitHub
   > <https://github.com/apache/arrow-rs/issues/208#issuecomment-917698616>,
   > or unsubscribe
   > <https://github.com/notifications/unsubscribe-auth/AAOKHDWOWBPRDJXJ7ABSGBDUBT7SFANCNFSM43S2JG5Q>
   > .
   > Triage notifications on the go with GitHub Mobile for iOS
   > <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
   > or Android
   > <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
   >
   >
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] alamb commented on issue #208: flight_data_from_arrow_batch sends too much data

Posted by GitBox <gi...@apache.org>.

alamb commented on issue #208:
URL: https://github.com/apache/arrow-rs/issues/208#issuecomment-916422638


   Hi @matthewmturner  -- thanks!
   
   Some additional context is that we encountered this issue while working on IOx -- see details  https://github.com/influxdata/influxdb_iox/issues/1133
   
   @mkmik  worked around the issue in IOx via a heuristic in https://github.com/influxdata/influxdb_iox/commit/82ed5d1708458ec9438406ce1fd19aa0c7d23204
   
   Namely there is code that deep clones the array prior to serialization if some guestimate of buffer size is hit. 
   
   Fixing the underlying problem (and only serializing the part of the data that is needed) would be great


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] matthewmturner commented on issue #208: flight_data_from_arrow_batch sends too much data

Posted by GitBox <gi...@apache.org>.

matthewmturner commented on issue #208:
URL: https://github.com/apache/arrow-rs/issues/208#issuecomment-919341814


   hi @nevi-me - ive been building up a test that i could use to compare the ipc size pre and post the fix.  but i havent been able to produce the expected results.  Code is below:
   
   ```
   pub fn compare_ipc() {
       let arr_data = vec![1, 2, 3, 4, 5];
       let val_data = vec![5, 6, 7, 8, 9];
       let id_arr = Int32Array::from(arr_data);
       let val_arr = Int32Array::from(val_data);
       let id_arr_slice = id_arr.slice(1, 3);
       let val_arr_slice = val_arr.slice(1, 3);
   
       let schema = Schema::new(vec![
           Field::new("id", DataType::Int32, false),
           Field::new("val", DataType::Int32, false),
       ]);
   
       let raw_batch = RecordBatch::try_new(
           Arc::new(schema.clone()),
           vec![Arc::new(id_arr), Arc::new(val_arr)],
       )
       .unwrap();
       println!("{:?}", raw_batch);
   
       let slice_batch =
           RecordBatch::try_new(Arc::new(schema.clone()), vec![id_arr_slice, val_arr_slice]).unwrap();
       println!("{:?}", slice_batch);
   
       println!("Running first test");
       raw_batch
           .columns()
           .iter()
           .zip(slice_batch.columns())
           .for_each(|(a, b)| {
               println!("{:?} : {:?}", a.data(), b.data());
               assert_eq!(a.data_type(), b.data_type());
               assert_eq!(a.data().buffers()[0], b.data().buffers()[0]);
           });
   
       let raw_path = "raw_data.arrow";
       let slice_path = "slice_data.arrow";
   
       {
           let raw_file = File::create(raw_path).unwrap();
           let mut raw_writer = FileWriter::try_new(raw_file, &schema).unwrap();
   
           raw_writer.write(&raw_batch).unwrap();
           raw_writer.finish().unwrap();
       }
       {
           let slice_file = File::create(slice_path).unwrap();
           let mut slice_writer = FileWriter::try_new(slice_file, &schema).unwrap();
   
           slice_writer.write(&slice_batch).unwrap();
           slice_writer.finish().unwrap();
       }
   
       let raw_file = File::open(raw_path).unwrap();
       let slice_file = File::open(slice_path).unwrap();
       let mut raw_reader = FileReader::try_new(raw_file).unwrap();
       let mut slice_reader = FileReader::try_new(slice_file).unwrap();
   
       while let Some(Ok(raw_ipc_batch)) = raw_reader.next() {
           println!("{:?}", raw_ipc_batch);
           while let Some(Ok(slice_ipc_batch)) = slice_reader.next() {
               println!("{:?}", slice_ipc_batch);
               raw_ipc_batch
                   .columns()
                   .iter()
                   .zip(slice_ipc_batch.columns())
                   .for_each(|(a, b)| {
                       println!("{:?} : {:?}", a.data(), b.data());
                       assert_eq!(a.data_type(), b.data_type());
                       assert_eq!(a.data().buffers()[0], b.data().buffers()[0]);
                   });
           }
       }
   }
   ```
   Which produces the following output:
   ```
   RecordBatch { schema: Schema { fields: [Field { name: "id", data_type: Int32, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: None }, Field { name: "val", data_type: Int32, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: None }], metadata: {} }, columns: [PrimitiveArray<Int32>
   [
     1,
     2,
     3,
     4,
     5,
   ], PrimitiveArray<Int32>
   [
     5,
     6,
     7,
     8,
     9,
   ]] }
   RecordBatch { schema: Schema { fields: [Field { name: "id", data_type: Int32, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: None }, Field { name: "val", data_type: Int32, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: None }], metadata: {} }, columns: [PrimitiveArray<Int32>
   [
     2,
     3,
     4,
   ], PrimitiveArray<Int32>
   [
     6,
     7,
     8,
   ]] }
   Running first test
   ArrayData { data_type: Int32, len: 5, null_count: 0, offset: 0, buffers: [Buffer { data: Bytes { ptr: 0x11d606c40, len: 20, data: [1, 0, 0, 0, 2, 0, 0, 0, 3, 0, 0, 0, 4, 0, 0, 0, 5, 0, 0, 0] }, offset: 0 }], child_data: [], null_bitmap: None } : ArrayData { data_type: Int32, len: 3, null_count: 0, offset: 1, buffers: [Buffer { data: Bytes { ptr: 0x11d606c40, len: 20, data: [1, 0, 0, 0, 2, 0, 0, 0, 3, 0, 0, 0, 4, 0, 0, 0, 5, 0, 0, 0] }, offset: 0 }], child_data: [], null_bitmap: None }
   ArrayData { data_type: Int32, len: 5, null_count: 0, offset: 0, buffers: [Buffer { data: Bytes { ptr: 0x11d606d00, len: 20, data: [5, 0, 0, 0, 6, 0, 0, 0, 7, 0, 0, 0, 8, 0, 0, 0, 9, 0, 0, 0] }, offset: 0 }], child_data: [], null_bitmap: None } : ArrayData { data_type: Int32, len: 3, null_count: 0, offset: 1, buffers: [Buffer { data: Bytes { ptr: 0x11d606d00, len: 20, data: [5, 0, 0, 0, 6, 0, 0, 0, 7, 0, 0, 0, 8, 0, 0, 0, 9, 0, 0, 0] }, offset: 0 }], child_data: [], null_bitmap: None }
   RecordBatch { schema: Schema { fields: [Field { name: "id", data_type: Int32, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: None }, Field { name: "val", data_type: Int32, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: None }], metadata: {} }, columns: [PrimitiveArray<Int32>
   [
     1,
     2,
     3,
     4,
     5,
   ], PrimitiveArray<Int32>
   [
     5,
     6,
     7,
     8,
     9,
   ]] }
   RecordBatch { schema: Schema { fields: [Field { name: "id", data_type: Int32, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: None }, Field { name: "val", data_type: Int32, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: None }], metadata: {} }, columns: [PrimitiveArray<Int32>
   [
     null,
     null,
     5,
   ], PrimitiveArray<Int32>
   [
     null,
     null,
     9,
   ]] }
   ArrayData { data_type: Int32, len: 5, null_count: 0, offset: 0, buffers: [Buffer { data: Bytes { ptr: 0x11d607ac0, len: 24, data: [1, 0, 0, 0, 2, 0, 0, 0, 3, 0, 0, 0, 4, 0, 0, 0, 5, 0, 0, 0, 0, 0, 0, 0] }, offset: 0 }], child_data: [], null_bitmap: None } : ArrayData { data_type: Int32, len: 3, null_count: 2, offset: 0, buffers: [Buffer { data: Bytes { ptr: 0x11d704240, len: 12, data: [0, 0, 0, 0, 0, 0, 0, 0, 5, 0, 0, 0] }, offset: 0 }], child_data: [], null_bitmap: Some(Bitmap { bits: Buffer { data: Bytes { ptr: 0x11d7041c0, len: 1, data: [4] }, offset: 0 } }) }
   thread 'main' panicked at 'assertion failed: `(left == right)`
     left: `Buffer { data: Bytes { ptr: 0x11d607ac0, len: 24, data: [1, 0, 0, 0, 2, 0, 0, 0, 3, 0, 0, 0, 4, 0, 0, 0, 5, 0, 0, 0, 0, 0, 0, 0] }, offset: 0 }`,
    right: `Buffer { data: Bytes { ptr: 0x11d704240, len: 12, data: [0, 0, 0, 0, 0, 0, 0, 0, 5, 0, 0, 0] }, offset: 0 }`', src/flight_sends_too_much_data.rs:149:21
   note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
   ```
   Basically, im just trying to compare the buffers from two batches (one batch is a slice of the other) after reading their IPC files and comparing the value buffers.  Given what we are working on i was expecting the data to be the same (i guess the assertion would still fail after reading the IPC files as they would have different pointers but i expected the value arrays to have the same values).  But the value arrays were different(the full `ArrayData` values are above):
   
   ```
   left: `Buffer { data: Bytes { ptr: 0x11d607ac0, len: 24, data: [1, 0, 0, 0, 2, 0, 0, 0, 3, 0, 0, 0, 4, 0, 0, 0, 5, 0, 0, 0, 0, 0, 0, 0] }, offset: 0 }`,
   right: `Buffer { data: Bytes { ptr: 0x11d704240, len: 12, data: [0, 0, 0, 0, 0, 0, 0, 0, 5, 0, 0, 0] }, offset: 0 }`', src/flight_sends_too_much_data.rs:149:21
   ```
   I'm going to keep playing around with this but wanted to get your thoughts on if I am approaching this the right way.
   
   Thanks again for all your help - much appreciated.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org