You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Yordan Pavlov (Jira)" <ji...@apache.org> on 2021/02/26 20:12:00 UTC

[jira] [Created] (ARROW-11799) [Rust] String and Binary arrays created with incorrect length from unbound iterator

Yordan Pavlov created ARROW-11799:
-------------------------------------

             Summary: [Rust] String and Binary arrays created with incorrect length from unbound iterator
                 Key: ARROW-11799
                 URL: https://issues.apache.org/jira/browse/ARROW-11799
             Project: Apache Arrow
          Issue Type: Improvement
          Components: Rust
    Affects Versions: 3.0.0
            Reporter: Yordan Pavlov
            Assignee: Yordan Pavlov


While looking for a way to make loading array data from parquet files faster, I stumbled on an edge case where string and binary arrays are created with an incorrect length from an iterator with no upper bound.

Here is a simple example:

```
 // iterator that doesn't declare (upper) size bound
        let string_iter = (0..).scan(0usize, |pos, i| { 
            if *pos < 10 {
                *pos += 1;
                Some(Some(format!("value {}", i)))
            }
            else {
                // actually returns up to 10 values
                None
            }
        })
        // limited using take()
        .take(100);

        let (lower_size_bound, upper_size_bound) = string_iter.size_hint();
        assert_eq!(lower_size_bound, 0);
        // the upper bound, defined by take above, is 100
        assert_eq!(upper_size_bound, Some(100));
        let string_array: StringArray = string_iter.collect();
        // but the actual number of items in the array is 10
        assert_eq!(string_array.len(), 10);
```

Fortunately this is easy to fix by using the length of the child offset array and I will be creating a PR for this shortly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)