You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Yordan Pavlov (Jira)" <ji...@apache.org> on 2021/02/26 20:12:00 UTC
[jira] [Created] (ARROW-11799) [Rust] String and Binary arrays
created with incorrect length from unbound iterator
Yordan Pavlov created ARROW-11799:
-------------------------------------
Summary: [Rust] String and Binary arrays created with incorrect length from unbound iterator
Key: ARROW-11799
URL: https://issues.apache.org/jira/browse/ARROW-11799
Project: Apache Arrow
Issue Type: Improvement
Components: Rust
Affects Versions: 3.0.0
Reporter: Yordan Pavlov
Assignee: Yordan Pavlov
While looking for a way to make loading array data from parquet files faster, I stumbled on an edge case where string and binary arrays are created with an incorrect length from an iterator with no upper bound.
Here is a simple example:
```
// iterator that doesn't declare (upper) size bound
let string_iter = (0..).scan(0usize, |pos, i| {
if *pos < 10 {
*pos += 1;
Some(Some(format!("value {}", i)))
}
else {
// actually returns up to 10 values
None
}
})
// limited using take()
.take(100);
let (lower_size_bound, upper_size_bound) = string_iter.size_hint();
assert_eq!(lower_size_bound, 0);
// the upper bound, defined by take above, is 100
assert_eq!(upper_size_bound, Some(100));
let string_array: StringArray = string_iter.collect();
// but the actual number of items in the array is 10
assert_eq!(string_array.len(), 10);
```
Fortunately this is easy to fix by using the length of the child offset array and I will be creating a PR for this shortly.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)