You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/11/28 21:49:15 UTC

[GitHub] [arrow-rs] aarashy opened a new issue, #3215: Off-by-one buffer size error triggers Panic when constructing RecordBatch from IPC bytes

aarashy opened a new issue, #3215:
URL: https://github.com/apache/arrow-rs/issues/3215

   **Describe the bug**
   <!--
   A clear and concise description of what the bug is.
   -->
   When reading Arrow Bytes, it seems boolean array inputs can trigger panics under certain conditions.
   If my input bytes are bad, I want the arrow API to throw an `Err`, not panic. 
   
   In this case, I don't think there's something wrong with my input bytes - rather, it seems like there's an off-by-one error internally within `arrow-rs` which is causing invariants to be broken. 
   
   The error is being thrown in the `validate` routine:
   https://github.com/apache/arrow-rs/blob/master/arrow-data/src/data.rs#L667-L673
   
   And it's being unwrapped in the caller in `create_primitive_array`, which DOES NOT return a `Result`.
   https://github.com/apache/arrow-rs/blob/master/arrow-ipc/src/reader.rs#L487
   
   Judging by the fact that `unwrap` is used, this function presupposes certain invariants that make it safe to use `unwrap`, but for some inputs, these invariants seem to not be met.
   
   Since the problem appears to be with the length of a `buffer`, I'm tracing the origin of that buffer (per my stack trace) to https://github.com/apache/arrow-rs/blob/master/arrow-ipc/src/reader.rs#L1140 - I might be wrong.
   
   ```
   thread 'tokio-runtime-worker' panicked at 'called `Result::unwrap()` on an `Err` value: InvalidArgumentError("Need at least 321 bytes for bitmap in buffers[0] in array of type Boolean, but got 320")', /usr/local/cargo/registry/src/github.com-1ecc6299db9ec823/arrow-23.0.0/src/ipc/reader.rs:486:14
   stack backtrace:
      0: rust_begin_unwind
                at ./rustc/897e37553bba8b42751c67658967889d11ecd120/library/std/src/panicking.rs:584:5
      1: core::panicking::panic_fmt
                at ./rustc/897e37553bba8b42751c67658967889d11ecd120/library/core/src/panicking.rs:142:14
      2: core::result::unwrap_failed
                at ./rustc/897e37553bba8b42751c67658967889d11ecd120/library/core/src/result.rs:1785:5
      3: arrow::ipc::reader::create_primitive_array
      4: arrow::ipc::reader::create_array
      5: arrow::ipc::reader::read_record_batch
      6: arrow::ipc::reader::StreamReader<R>::maybe_next
      7: arrow::ipc::reader::StreamReader<R>::maybe_next
      8: arrow::ipc::reader::StreamReader<R>::maybe_next
      9: arrow::ipc::reader::StreamReader<R>::maybe_next
     10: arrow::ipc::reader::StreamReader<R>::maybe_next
     11: arrow::ipc::reader::StreamReader<R>::maybe_next
     12: arrow::ipc::reader::StreamReader<R>::maybe_next
     13: arrow::ipc::reader::StreamReader<R>::maybe_next
     14: arrow::ipc::reader::StreamReader<R>::maybe_next
     15: arrow::ipc::reader::StreamReader<R>::maybe_next
     16: arrow::ipc::reader::StreamReader<R>::maybe_next
     17: arrow::ipc::reader::StreamReader<R>::maybe_next
     18: <alloc::vec::Vec<T> as alloc::vec::spec_from_iter::SpecFromIter<T,I>>::from_iter
     19: core::iter::adapters::try_process
   note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
   ```
   
   **To Reproduce**
   <!--
   Steps to reproduce the behavior:
   -->
   I am using the following routine to read IPC bytes. I don't currently have an input `bytes` which triggers this, but I can find one if you think it's critical for you to debug this properly. 
   ```
   pub fn from_ipc_bytes(bytes: &[u8]) -> Result<Vec<RecordBatch>, anyhow::Error> {
       let cursor: Cursor<&[u8]> = Cursor::new(bytes);
       let reader = arrow::ipc::reader::StreamReader::try_new(cursor, None)?;
       let record_batches = reader.collect::<Result<Vec<RecordBatch>, arrow::error::ArrowError>>()?;
       Ok(record_batches)
   }
   ```
   
   
   
   **Expected behavior**
   <!--
   A clear and concise description of what you expected to happen.
   -->
   Either throw an error if there's a problem with my input bytes, or succeed if there is no problem with my input bytes, but do not panic.
   
   
   **Additional context**
   <!--
   Add any other context about the problem here.
   -->
   @alamb - You seem to have written these validation checks, so I wonder if you would understand what might be happening for me. Let me know if it's crucial for me to provide bytes inputs which trigger the panic, but I think the stacktrace and error message here might be enough to go off of.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] alamb commented on issue #3215: Off-by-one buffer size error triggers Panic when constructing RecordBatch from IPC bytes (should return an Error)

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #3215:
URL: https://github.com/apache/arrow-rs/issues/3215#issuecomment-1329956902

   > Do you think it would be crazy for create_primitive_array to return a Result so that I can question-mark the build result and remove the unwraps in that function entirely? 
   
   I don't think that would be crazy at all -- sounds like a good idea to me
   
   > However, there's a bigger question of why the data has only 320 instead of 321 in the first place.
   
   👍 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on issue #3215: Off-by-one buffer size error triggers Panic when constructing RecordBatch from IPC bytes (should return an Error)

Posted by GitBox <gi...@apache.org>.
tustvold commented on issue #3215:
URL: https://github.com/apache/arrow-rs/issues/3215#issuecomment-1332571043

   > Now my last question is, are we okay with the fact that the UnalignedBitChunk indexes into the slice without checking the bounds first, causing a panic?
   
   I think the issue here is that currently `create_dictionary_array` isn't performing validation. I think this is incorrect and should be fixed.
   
   `UnalignedBitChunk` expects that its inputs are already bounds checked, and panics as an invariant has been violated. This I think is fine.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] aarashy commented on issue #3215: Off-by-one buffer size error triggers Panic when constructing RecordBatch from IPC bytes (should return an Error)

Posted by GitBox <gi...@apache.org>.
aarashy commented on issue #3215:
URL: https://github.com/apache/arrow-rs/issues/3215#issuecomment-1331649397

   I removed the unwraps here https://github.com/apache/arrow-rs/pull/3232
   
   I have some bytes which reproduce this error, but the data is private. The bytes were the result of the `apache-arrow` NPM package function `tableToIPC` - https://github.com/apache/arrow/blob/5611f2bd0d6136b005d137a84b50709fc5c813bb/js/src/ipc/serialization.ts#L61
   
   # More info about the error stacktrace above
   Let me share a little bit more about the error in `ArrayData::validate` above. 
   1. The value of `len_plus_offset` is 2565, and `offset` is 0. Thus, 2565 is the number of rows in the the record batch from higher in the callstack (`read_record_batch`). 
   2. 2565 is not divisible by 8 (2565/8 = 320.625) - so it seems correct for the `validate` function to be expecting 321 bytes if we need to fit 2565 bits within byte-boundaries (there is a ceiling function being applied on the division, which seems correct). 
   
   Thus, I really want to understand why the buffer is given only 320 bytes. It seems like instead of correctly zero padding the 2565 bits to reach the next byte-boundary, the bits have truncated at the previous byte-boundary and the byte buffer is 1 byte too small.
   
   I'm struggling to really find the origin of the length of the Array data byte buffers here which would be performing this truncation... I suppose it would be `Message::header_as_dictionary_batch` -> `Message::buffers`?
   
   # Another similar error stacktrace
   I have another callstack which is throwing a similar panic on a different input bytes set. This one is from indexing into a slice out of bounds, again off-by-one.
   
   ```
   thread 'tokio-runtime-worker' panicked at 'range end index 65 out of range for slice of length 64', /usr/local/cargo/registry/src/github.com-1ecc6299db9ec823/arrow-buffer-28.0.0/src/util/bit_chunk_iterator.rs:57:23
   stack backtrace:
      0: rust_begin_unwind
                at ./rustc/897e37553bba8b42751c67658967889d11ecd120/library/std/src/panicking.rs:584:5
      1: core::panicking::panic_fmt
                at ./rustc/897e37553bba8b42751c67658967889d11ecd120/library/core/src/panicking.rs:142:14
      2: core::slice::index::slice_end_index_len_fail_rt
                at ./rustc/897e37553bba8b42751c67658967889d11ecd120/library/core/src/slice/index.rs:76:5
      3: core::slice::index::slice_end_index_len_fail
                at ./rustc/897e37553bba8b42751c67658967889d11ecd120/library/core/src/slice/index.rs:69:9
      4: arrow_buffer::util::bit_chunk_iterator::UnalignedBitChunk::new
      5: arrow_buffer::buffer::immutable::Buffer::count_set_bits_offset
      6: arrow_data::data::ArrayData::new_unchecked
      7: arrow_data::data::ArrayDataBuilder::build_unchecked
      8: arrow_ipc::reader::create_dictionary_array
      9: arrow_ipc::reader::create_array
     10: arrow_ipc::reader::read_record_batch
   ```
   In this case: 
   1. The record batch is of length `518`, which is again not divisible by 8 (64.75). 
   2. The panic is coming from https://github.com/apache/arrow-rs/blob/master/arrow-buffer/src/util/bit_chunk_iterator.rs#L57 - the `UnalignedBitChunk` is being created downstream of the `count_nulls` routine in `ArrayData::new_unchecked`. This callstack doesn't perform validations or return `Result`, so it's awkward for me to add a slice-out-bounds check and return an error. To me this indicates, again, that the wherever the `null_bit_buffer` length is being initialized, it's being truncated incorrectly.
   3. The `data_type` of the array being created here is `Dictionary(Int32, Utf8)`.
   
   
   I wonder if @tustvold, @alamb, or @HaoYang670 would know anything about improperly truncated / padded byte buffers during IPC byte reading. It's possible that my input is missing some padding, but perhaps this crate should generously perform padding which the Javascript apache-arrow crate is neglecting. It's also possible that this is a Javascript arrow package bug, but if we have an opportunity to be more permissive and flexible here, we should take it (and we at least shouldn't have panics),


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] alamb commented on issue #3215: Off-by-one buffer size error triggers Panic when constructing RecordBatch from IPC bytes (should return an Error)

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #3215:
URL: https://github.com/apache/arrow-rs/issues/3215#issuecomment-1344634636

   `label_issue.py` automatically added labels {'arrow'} from #3247


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold closed issue #3215: Off-by-one buffer size error triggers Panic when constructing RecordBatch from IPC bytes (should return an Error)

Posted by GitBox <gi...@apache.org>.
tustvold closed issue #3215: Off-by-one buffer size error triggers Panic when constructing RecordBatch from IPC bytes (should return an Error)
URL: https://github.com/apache/arrow-rs/issues/3215


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on issue #3215: Off-by-one buffer size error triggers Panic when constructing RecordBatch from IPC bytes (should return an Error)

Posted by GitBox <gi...@apache.org>.
tustvold commented on issue #3215:
URL: https://github.com/apache/arrow-rs/issues/3215#issuecomment-1339105225

   I don't think there are further action items for this ticket, so closing. Feel free to reopen if I have missed something


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] alamb commented on issue #3215: Off-by-one buffer size error triggers Panic when constructing RecordBatch from IPC bytes

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #3215:
URL: https://github.com/apache/arrow-rs/issues/3215#issuecomment-1329830491

   Hi @aarashy  -- thank you for the report. 
   
   The validation routine you point to is basically saying that it expects that the size of the underlying buffer for the BooleanArray is at least 321 bytes, but for some reason the data has only 321. 
   
   I agree that the IPC reader should return the error rather than `panic`ing in this case.
   
   Not sure if you are up for proposing a PR to fix the issue (propagate the error) but if you are I would be most appreciative


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] aarashy commented on issue #3215: Off-by-one buffer size error triggers Panic when constructing RecordBatch from IPC bytes (should return an Error)

Posted by GitBox <gi...@apache.org>.
aarashy commented on issue #3215:
URL: https://github.com/apache/arrow-rs/issues/3215#issuecomment-1329858594

   @alamb - I appreciate the prompt response. 
   
   Do you think it would be crazy for `create_primitive_array` to return a `Result` so that I can question-mark the `build` result and remove the `unwrap`s in that function entirely? It seems the only caller of `create_primitive_array` is `create_array` which returns a `Result`, so I think there wouldn't be too much refactoring there.
   
   However, there's a bigger question of why the data has only 320 instead of 321 in the first place. 
   That, I am less sure about. 
   I'll try to get a `bytes` input that triggers this panic to enable a full repro and maybe it would be more obvious then.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] aarashy commented on issue #3215: Off-by-one buffer size error triggers Panic when constructing RecordBatch from IPC bytes (should return an Error)

Posted by GitBox <gi...@apache.org>.
aarashy commented on issue #3215:
URL: https://github.com/apache/arrow-rs/issues/3215#issuecomment-1332567160

   @viirya , thanks for your helpful reply. I've created this issue in the arrow repo to track it there. (https://github.com/apache/arrow/issues/14791)
   
   Now my last question is, are we okay with the fact that the `UnalignedBitChunk` indexes into the slice without checking the bounds first, causing a panic? 
   
   It seems like, in situations where the input arrow bytes are the correct shape but have some subtle bug like this, `arrow-rs` is vulnerable to panic. Do you think it's important to add appropriate checks here instead of using functions like `new_unchecked`? By its name, this function seems to believe that essential invariants like buffer size have already been verified, but in this case they are not.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] viirya commented on issue #3215: Off-by-one buffer size error triggers Panic when constructing RecordBatch from IPC bytes (should return an Error)

Posted by GitBox <gi...@apache.org>.
viirya commented on issue #3215:
URL: https://github.com/apache/arrow-rs/issues/3215#issuecomment-1331769380

   If your IPC payload is generated by apache-arrow NPM package function `tableToIPC`, the size of buffers is produced by that, not from arrow-rs. IPC reader just reads provided buffer with provided offset/length.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org