You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/06/02 02:20:27 UTC

[GitHub] [arrow-rs] HaoYang670 commented on issue #1750: Inconsistent Slicing of ArrayData for StructArray

HaoYang670 commented on issue #1750:
URL: https://github.com/apache/arrow-rs/issues/1750#issuecomment-1144347783

   Hi @tustvold 
   Here are some thoughts of mine:
   1. (Agree with you) We have to preserve `offset` on parent ArrayData (the struct array) because of the validity buffer.
   2. >Slice child data within StructArray when constructing boxed_fields
   If what you mean is to slice child arrays in `StructArray::from`:
   ```rust
       fn from(data: ArrayData) -> Self {
           let boxed_fields = data
               .child_data()
               .iter()
               .map(|cd| make_array(cd.slice(data.offset(), data.len())))
               .collect();
   
           Self { data, boxed_fields }
       }
   ```
   I am afraid this cannot solve the problem. Because
   - Inconsistent offsets still exist. For example, given a struct array:
   ```
   StructArray (offset = 0, length = 5, field0 = i32, field1 = bool, validity_buffer = ...)
   -- Int32Array (offset = 1, length = 5)
   -- BooleanArray (offset = 2, length = 5)
   ```
   After calling `array.slice(offset = 1, length = 3)`, we will get a struct array like this
   ```
   StructArray (offset = 1, length = 3, field0 = i32, field1 = bool, validity_buffer = ...)
   -- Int32Array (offset = 2, length = 3)
   -- BooleanArray (offset = 3, length = 3)
   ```
   We still have 3 `offset`s in parent array and child arrays. And it is not clear whether we have pushed down the new offset to children.
   3. My suggestion (workaround)
   - Do not push down the offset to child arrays when slicing (aka, just update the offset and null_count of parent array)
   - Do not slicing child arrays when building `boxed_fields`. Because this will produce inconsistent offsets
   - Slice child arrays when you call `array.column(idx)`. This is also the way of getting a buffer of array now:
   Example: getting the offsets of binary array.
   ```rust
       pub fn value_offsets(&self) -> &[OffsetSize] {
           unsafe {
               std::slice::from_raw_parts(
                   self.value_offsets.as_ptr().add(self.data.offset()),
                   self.len() + 1,
               )
           }
       }
   ```
   Slice when getting the child array. (API change, because we own the `ArrayRef` now)
   ```rust
       pub fn column(&self, pos: usize) -> ArrayRef {
           self.boxed_fields[pos].slice(self.offset, self.length)
       }
   ```
   - Long term suggestion: push down `offset` and `length` to `Buffer`(`Bitmap`).
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org