You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/11/16 13:52:05 UTC

[GitHub] [arrow-rs] msalib opened a new issue, #3123: arithmatic overflow leads to segfault in `concat_batches`

msalib opened a new issue, #3123:
URL: https://github.com/apache/arrow-rs/issues/3123

   **Describe the bug**
   
   I can reliably reproduce a bug where `concat_batches` performs an addition that overflows (see https://github.com/apache/arrow-rs/blob/master/arrow-data/src/transform/utils.rs#L39). In release mode, the result for my application and data is a segfault on linux. When I run the same application in debug mode, I get a panic since checked arithmetic panics in debug mode.
   
   **To Reproduce**
   
   I've tried to minimize this as much as I can. Checkout the repo https://github.com/msalib/broken-arrow and run the following command:
   ```sh
   RUST_BACKTRACE=full cargo run --release -- redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet 
 redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.pa
 rquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redac
 ted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet redacted.parquet
   ```
   
   You should see a panic with backtrace (I'm running on x86-64 linux). That repo includes a Cargo.toml which enables overflow-checks in release builds because I got tired of waiting for slower debug builds. If you run the same command with just one less copy of redacted.parquet, it will succeed without panicing.
   
   **Expected behavior**
   
   The code should not segfault. Either the arithmetic should be checked for overflow so we get a panic or (much more preferable) the overflow should be avoided.
   
   **Additional context**
   
   Here's the full backtrace from my application (code and data not included here because it is proprietary, but we're using color-eyre and tracing so we get more detailed backtrace):
   ```
     10: core::panicking::panic::hb3ad04c589a0e3c8
         at /rustc/a55dd71d5fb0ec5a6a3a9e8c27b2127ba491ce52/library/core/src/panicking.rs:48
     11: <i32 as core::ops::arith::Add>::add::h9c2a7b4ebe56cff6
         at /rustc/a55dd71d5fb0ec5a6a3a9e8c27b2127ba491ce52/library/core/src/ops/arith.rs:126
     12: arrow_data::transform::utils::extend_offsets::{{closure}}::hd8682bb256e4642b
         at /home/msalib/.cargo/registry/src/github.com-1ecc6299db9ec823/arrow-data-24.0.0/src/transform/utils.rs:39
           37 │         // compute the new offset
           38 │         let length = offsets[1] - offsets[0];
           39 >         last_offset = last_offset + length;
           40 │         buffer.push(last_offset);
           41 │     });
     13: core::iter::traits::iterator::Iterator::for_each::call::{{closure}}::h0ad46e165b780cd0
         at /rustc/a55dd71d5fb0ec5a6a3a9e8c27b2127ba491ce52/library/core/src/iter/traits/iterator.rs:828
     14: core::iter::traits::iterator::Iterator::fold::hf19d319b4c63662a
         at /rustc/a55dd71d5fb0ec5a6a3a9e8c27b2127ba491ce52/library/core/src/iter/traits/iterator.rs:2414
     15: core::iter::traits::iterator::Iterator::for_each::hdc528a3ba10831bd
         at /rustc/a55dd71d5fb0ec5a6a3a9e8c27b2127ba491ce52/library/core/src/iter/traits/iterator.rs:831
     16: arrow_data::transform::utils::extend_offsets::h2e5dcad89a1d4f4a
         at /home/msalib/.cargo/registry/src/github.com-1ecc6299db9ec823/arrow-data-24.0.0/src/transform/utils.rs:36
           34 │ ) {
           35 │     buffer.reserve(offsets.len() * std::mem::size_of::<T>());
           36 >     offsets.windows(2).for_each(|offsets| {
           37 │         // compute the new offset
           38 │         let length = offsets[1] - offsets[0];
     17: arrow_data::transform::variable_size::build_extend::{{closure}}::ha20bde03e4f6553e
         at /home/msalib/.cargo/registry/src/github.com-1ecc6299db9ec823/arrow-data-24.0.0/src/transform/variable_size.rs:57
           55 │                 let last_offset = unsafe { get_last_offset(offset_buffer) };
           56 │ 
           57 >                 extend_offsets::<T>(
           58 │                     offset_buffer,
           59 │                     last_offset,
     18: <alloc::boxed::Box<F,A> as core::ops::function::Fn<Args>>::call::h6cf66436bc24cef4
         at /rustc/a55dd71d5fb0ec5a6a3a9e8c27b2127ba491ce52/library/alloc/src/boxed.rs:1949
     19: arrow_data::transform::MutableArrayData::extend::h13a9c5461737bf0f
         at /home/msalib/.cargo/registry/src/github.com-1ecc6299db9ec823/arrow-data-24.0.0/src/transform/mod.rs:603
          601 │         let len = end - start;
          602 │         (self.extend_null_bits[index])(&mut self.data, start, len);
          603 >         (self.extend_values[index])(&mut self.data, index, start, len);
          604 │         self.data.len += len;
          605 │     }
     20: arrow::compute::kernels::concat::concat::h1efb0d4d54ee8ec9
         at /home/msalib/.cargo/registry/src/github.com-1ecc6299db9ec823/arrow-24.0.0/src/compute/kernels/concat.rs:100
           98 │ 
           99 │     for (i, len) in lengths.iter().enumerate() {
          100 >         mutable.extend(i, 0, *len)
          101 │     }
          102 │ 
     21: arrow::compute::kernels::concat::concat_batches::ha80396bf3c3d8b6c
         at /home/msalib/.cargo/registry/src/github.com-1ecc6299db9ec823/arrow-24.0.0/src/compute/kernels/concat.rs:127
          125 │     let mut arrays = Vec::with_capacity(field_num);
          126 │     for i in 0..field_num {
          127 >         let array = concat(
          128 │             &batches
          129 │                 .iter()
   
   ```
   
   Thanks so much! We rely heavily on arrow+parquet and they've made our lives a lot easier!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold closed issue #3123: arithmatic overflow leads to segfault in `concat_batches`

Posted by GitBox <gi...@apache.org>.
tustvold closed issue #3123: arithmatic overflow leads to segfault in `concat_batches`
URL: https://github.com/apache/arrow-rs/issues/3123


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] alamb commented on issue #3123: arithmatic overflow leads to segfault in `concat_batches`

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #3123:
URL: https://github.com/apache/arrow-rs/issues/3123#issuecomment-1327307669

   `label_issue.py` automatically added labels {'arrow'} from #3157


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on issue #3123: arithmatic overflow leads to segfault in `concat_batches`

Posted by GitBox <gi...@apache.org>.
tustvold commented on issue #3123:
URL: https://github.com/apache/arrow-rs/issues/3123#issuecomment-1317492280

   Thank you for the report and reproducer, I plan to look into this next week unless somebody else gets there first. I suspect MutableArrayData is not doing a checked operation when it should be - this should return an error, or at the very least panic.
   
   As for the use of i32, this is a Javaism that unfortunately made its way into the arrow specification. You could try using LargeUtf8 which uses i64, but you may also want to use smaller batches.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] msalib commented on issue #3123: arithmatic overflow leads to segfault in `concat_batches`

Posted by GitBox <gi...@apache.org>.
msalib commented on issue #3123:
URL: https://github.com/apache/arrow-rs/issues/3123#issuecomment-1317341918

   More investigation suggests that this happens when `last_offset = 2147483550` and `lenght = 543`, so we're overflowing an i32, which is an odd choice for an index type....


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org