You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/04/26 12:49:16 UTC

[GitHub] [arrow-rs] alamb opened a new issue #197: String and BinaryArray created from iterators that don't accurately report size can lead to undefined behavior

alamb opened a new issue #197:
URL: https://github.com/apache/arrow-rs/issues/197


   *Note*: migrated from original JIRA: https://issues.apache.org/jira/browse/ARROW-11862
   
   As [~jorgecarleitao] says on  https://github.com/apache/arrow/pull/9588#discussion_r584290701
   
   The (Rust) Iterator spec recommends, but does not require, that the iterator reports a correct length. Consumer that lead to undefined behavior from an incorrect size_hint are the causers of said undefined behavior.
   
   The only case where consumers can trust the iterators' length is when the interator implement unsafe trait TrustedLen. Unfortunately, TrustedLen is still in unstable. For that reason, we have been exposing unsafe Buffer::from_trusted_len_iter and the like for those cases.
   
   So the code should be updated to handle the case where the reported `size_hint` turns out to be incorrect


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] alamb commented on issue #197: String and BinaryArray created from iterators that don't accurately report size can lead to undefined behavior

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #197:
URL: https://github.com/apache/arrow-rs/issues/197#issuecomment-1010277718


   > The only thing remaining for me to do on this issue is to verify the uses of the unsafe `from_trusted_iter` calls.
   
   As a final follow up, I did verify the uses of `from_trusted_iter` in the crate and concluded they are all safe


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] alamb closed issue #197: String and BinaryArray created from iterators that don't accurately report size can lead to undefined behavior

Posted by GitBox <gi...@apache.org>.
alamb closed issue #197:
URL: https://github.com/apache/arrow-rs/issues/197


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] alamb commented on issue #197: String and BinaryArray created from iterators that don't accurately report size can lead to undefined behavior

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #197:
URL: https://github.com/apache/arrow-rs/issues/197#issuecomment-1008319749


   An update: manually audited all the uses of the  `size_hint` in the arrow codebase:
   
   ```shell
   rg -n -H --no-heading -e 'size_hint' $(git rev-parse --show-toplevel || pwd)
   /Users/alamb/Software/arrow-rs/arrow/src/buffer/immutable.rs:308:                let (lower, _) = iterator.size_hint();
   /Users/alamb/Software/arrow-rs/arrow/src/buffer/mutable.rs:387:        let (lower, _) = iterator.size_hint();
   /Users/alamb/Software/arrow-rs/arrow/src/buffer/mutable.rs:433:        let (_, upper) = iterator.size_hint();
   /Users/alamb/Software/arrow-rs/arrow/src/buffer/mutable.rs:475:        let (_, upper) = iterator.size_hint();
   /Users/alamb/Software/arrow-rs/arrow/src/buffer/mutable.rs:524:        let (_, upper) = iterator.size_hint();
   /Users/alamb/Software/arrow-rs/arrow/src/buffer/mutable.rs:621:            let byte_capacity: usize = iterator.size_hint().0.saturating_add(7) / 8;
   /Users/alamb/Software/arrow-rs/arrow/src/buffer/mutable.rs:653:                    iterator.size_hint().0.saturating_add(7) / 8, //convert bit count to byte count, rounding up
   /Users/alamb/Software/arrow-rs/arrow/src/array/array_primitive.rs:338:        let (lower, _) = iter.size_hint();
   /Users/alamb/Software/arrow-rs/arrow/src/array/array_primitive.rs:376:    /// I.e. that `size_hint().1` correctly reports its length.
   /Users/alamb/Software/arrow-rs/arrow/src/array/array_primitive.rs:384:        let (_, upper) = iterator.size_hint();
   /Users/alamb/Software/arrow-rs/arrow/src/array/array_primitive.rs:959:        let (_, upper_size_bound) = value_iter.size_hint();
   /Users/alamb/Software/arrow-rs/arrow/src/array/array_list.rs:153:        let (lower, _) = iterator.size_hint();
   /Users/alamb/Software/arrow-rs/arrow/src/array/array_binary.rs:236:        let (_, data_len) = iter.size_hint();
   /Users/alamb/Software/arrow-rs/arrow/src/array/array_binary.rs:1125:        let (_, upper_size_bound) = value_iter.size_hint();
   /Users/alamb/Software/arrow-rs/arrow/src/array/builder.rs:271:            .size_hint()
   /Users/alamb/Software/arrow-rs/arrow/src/array/builder.rs:777:            .size_hint()
   /Users/alamb/Software/arrow-rs/arrow/src/array/iterator.rs:68:    fn size_hint(&self) -> (usize, Option<usize>) {
   /Users/alamb/Software/arrow-rs/arrow/src/array/iterator.rs:140:    fn size_hint(&self) -> (usize, Option<usize>) {
   /Users/alamb/Software/arrow-rs/arrow/src/array/iterator.rs:214:    fn size_hint(&self) -> (usize, Option<usize>) {
   /Users/alamb/Software/arrow-rs/arrow/src/array/iterator.rs:293:    fn size_hint(&self) -> (usize, Option<usize>) {
   /Users/alamb/Software/arrow-rs/arrow/src/array/iterator.rs:370:    fn size_hint(&self) -> (usize, Option<usize>) {
   /Users/alamb/Software/arrow-rs/arrow/src/util/trusted_len.rs:36:    let (_, upper) = iterator.size_hint();
   /Users/alamb/Software/arrow-rs/arrow/src/util/bit_chunk_iterator.rs:160:    fn size_hint(&self) -> (usize, Option<usize>) {
   /Users/alamb/Software/arrow-rs/arrow/src/array/array_boolean.rs:197:        let (_, data_len) = iter.size_hint();
   /Users/alamb/Software/arrow-rs/arrow/src/array/array_dictionary.rs:179:        let (lower, _) = it.size_hint();
   /Users/alamb/Software/arrow-rs/arrow/src/array/array_dictionary.rs:221:        let (lower, _) = it.size_hint();
   /Users/alamb/Software/arrow-rs/arrow/src/array/array_string.rs:174:        let (_, data_len) = iter.size_hint();
   /Users/alamb/Software/arrow-rs/arrow/src/array/array_string.rs:221:        let (_, data_len) = iter.size_hint();
   /Users/alamb/Software/arrow-rs/arrow/src/array/array_string.rs:548:        let (_, upper_size_bound) = string_iter.size_hint();
   ```
   
   My analysis is that any place where `upper` is used to size a buffer, it is either: 
   1. Advisory for performance to allocate capacity (like `Vec::with_capacity`) but not required for correctness
   2. Used in an `unsafe` block which clearly states it is undefined behavior to call the function with an iterator that does not correctly report its size
   3. I did find one issue: https://github.com/apache/arrow-rs/issues/1144 
   
   For 2. above, the`unsafe` code will panic if the iterator's size was inaccurate (potentially after writing past valid memory) as an additional safeguard
   
   The only thing remaining for me to do on this issue is to verify the uses of the unsafe `from_trusted_iter` calls.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org