You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/06/02 02:20:27 UTC
[GitHub] [arrow-rs] HaoYang670 commented on issue #1750: Inconsistent Slicing of ArrayData for StructArray
HaoYang670 commented on issue #1750:
URL: https://github.com/apache/arrow-rs/issues/1750#issuecomment-1144347783
Hi @tustvold
Here are some thoughts of mine:
1. (Agree with you) We have to preserve `offset` on parent ArrayData (the struct array) because of the validity buffer.
2. >Slice child data within StructArray when constructing boxed_fields
If what you mean is to slice child arrays in `StructArray::from`:
```rust
fn from(data: ArrayData) -> Self {
let boxed_fields = data
.child_data()
.iter()
.map(|cd| make_array(cd.slice(data.offset(), data.len())))
.collect();
Self { data, boxed_fields }
}
```
I am afraid this cannot solve the problem. Because
- Inconsistent offsets still exist. For example, given a struct array:
```
StructArray (offset = 0, length = 5, field0 = i32, field1 = bool, validity_buffer = ...)
-- Int32Array (offset = 1, length = 5)
-- BooleanArray (offset = 2, length = 5)
```
After calling `array.slice(offset = 1, length = 3)`, we will get a struct array like this
```
StructArray (offset = 1, length = 3, field0 = i32, field1 = bool, validity_buffer = ...)
-- Int32Array (offset = 2, length = 3)
-- BooleanArray (offset = 3, length = 3)
```
We still have 3 `offset`s in parent array and child arrays. And it is not clear whether we have pushed down the new offset to children.
3. My suggestion (workaround)
- Do not push down the offset to child arrays when slicing (aka, just update the offset and null_count of parent array)
- Do not slicing child arrays when building `boxed_fields`. Because this will produce inconsistent offsets
- Slice child arrays when you call `array.column(idx)`. This is also the way of getting a buffer of array now:
Example: getting the offsets of binary array.
```rust
pub fn value_offsets(&self) -> &[OffsetSize] {
unsafe {
std::slice::from_raw_parts(
self.value_offsets.as_ptr().add(self.data.offset()),
self.len() + 1,
)
}
}
```
Slice when getting the child array. (API change, because we own the `ArrayRef` now)
```rust
pub fn column(&self, pos: usize) -> ArrayRef {
self.boxed_fields[pos].slice(self.offset, self.length)
}
```
- Long term suggestion: push down `offset` and `length` to `Buffer`(`Bitmap`).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org