You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "tustvold (via GitHub)" <gi...@apache.org> on 2023/03/16 22:11:30 UTC

[GitHub] [arrow-rs] tustvold opened a new issue, #1750: Inconsistent Slicing of ArrayData for StructArray

tustvold opened a new issue, #1750:
URL: https://github.com/apache/arrow-rs/issues/1750

   **Problem**
   
   ArrayData::Slice contains a special case for StructArray where it recurses the offset into its children. However, it preserves the offset on the parent ArrayData, in order for the validity buffer to work correctly.
   
   The result is the offset has been pushed down to the child arrays, but is still present on the parent ArrayData. This makes it unclear if an offset should or should not be applied to the child_array.
   
   **Proposal**
   
   There are longer term suggestions around handling offsets in ArrayData differently, but until then I would like to propose:
   
   * Remove the ArrayData::slice special-case
   * Slice child data within `StructArray` when constructing `boxed_fields`
   
   This is consistent with how we handle offsets elsewhere, with ArrayData a dumb container, and the `Array` providing the logic to interpret the offset correctly.
   
   Thoughts @nevi-me 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] HaoYang670 commented on issue #1750: Inconsistent Slicing of ArrayData for StructArray

Posted by GitBox <gi...@apache.org>.
HaoYang670 commented on issue #1750:
URL: https://github.com/apache/arrow-rs/issues/1750#issuecomment-1144479163

   > So in StructArray::from it would inspect the ArrayData, compute slices, and store those as members.
   Implement as this:
   ```rust
       fn from(data: ArrayData) -> Self {
           let boxed_fields = data
               .child_data()
               .iter()
               .map(|cd| make_array(cd.slice(data.offset(), data.len())))
               .collect();
   
           Self { data, boxed_fields }
       }
   ```
   ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] HaoYang670 commented on issue #1750: Inconsistent Slicing of ArrayData for StructArray

Posted by GitBox <gi...@apache.org>.
HaoYang670 commented on issue #1750:
URL: https://github.com/apache/arrow-rs/issues/1750#issuecomment-1144487888

   I don't find `StructArray::slice` in the c++ implementation (I just find the ArrayData::slice https://github.com/apache/arrow/blob/master/cpp/src/arrow/array/data.cc#L94-L110)
   And in Arrow2, everything is pushed down to Buffer and Bitmap


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on issue #1750: Inconsistent Slicing of ArrayData for StructArray

Posted by GitBox <gi...@apache.org>.
tustvold commented on issue #1750:
URL: https://github.com/apache/arrow-rs/issues/1750#issuecomment-1144446613

   > And it is not clear whether we have pushed the new offset to children
   
   The suggestion is never to push offsets down within ArrayData, and so you should never have this ambiguity? 
   
   When constructing an instance of StructArray you "hydrate" the offsets to boxed fields to avoid having to recompute slices on access, but this does not modify the underlying ArrayData?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] jhorstmann commented on issue #1750: Inconsistent Slicing of ArrayData for StructArray

Posted by GitBox <gi...@apache.org>.
jhorstmann commented on issue #1750:
URL: https://github.com/apache/arrow-rs/issues/1750#issuecomment-1148656595

   Sorry for being late to this discussion, the proposed change makes sense to me. If I understand it correctly, this would be a silent behavior change of the `column`/`columns`/`columns_ref` methods on `StructArray`, which previously did not take the struct offset into account. We might want to deprecate those and instead add new methods like `pub fn column_slice(&self, pos: usize) -> &ArrayRef` (exact name up for discussion). 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on issue #1750: Inconsistent Slicing of ArrayData for StructArray

Posted by GitBox <gi...@apache.org>.
tustvold commented on issue #1750:
URL: https://github.com/apache/arrow-rs/issues/1750#issuecomment-1144482742

   I _think_ that should work, but there could be some arcane edge case I've forgotten.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on issue #1750: Inconsistent Slicing of ArrayData for StructArray

Posted by GitBox <gi...@apache.org>.
tustvold commented on issue #1750:
URL: https://github.com/apache/arrow-rs/issues/1750#issuecomment-1144494306

   >  But It will cause the inconsistency between bixed_fields and self.data.child_data, which means StructArray::column(self, 0) != make_array(self.data.child_data()[0])?
   
   We have the same "inconsistency" on other array types, e.g. `GenericStringArray::value_offsets` or `PrimitiveArray::values`. I personally don't think this is a problem.
   
   > I don't find StructArray::slice in the c++ implementation
   
   Yeah this is just a usability thing, the Rust implementation is just calling through to `make_array(self.data_ref().slice(offset, length))`.
   
   > everything is pushed down to Buffer and Bitmap
   
   I would definitely support this change, and it would be the better long-term solution. It has been discussed at length, all that really remains is someone to volunteer to do it :sweat_smile:. The only outstanding sticking point has been handling offsets for BooleanArrays, but I don't imagine there would be resistance to special casing these as a Bitmap in ArrayData...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold closed issue #1750: Inconsistent Slicing of ArrayData for StructArray

Posted by GitBox <gi...@apache.org>.
tustvold closed issue #1750: Inconsistent Slicing of ArrayData for StructArray
URL: https://github.com/apache/arrow-rs/issues/1750


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] alamb commented on issue #1750: Inconsistent Slicing of ArrayData for StructArray

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #1750:
URL: https://github.com/apache/arrow-rs/issues/1750#issuecomment-1144716571

   cc @jhorstmann and @bjchambers who I think have interest in this area


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] nevi-me commented on issue #1750: Inconsistent Slicing of ArrayData for StructArray

Posted by GitBox <gi...@apache.org>.
nevi-me commented on issue #1750:
URL: https://github.com/apache/arrow-rs/issues/1750#issuecomment-1153269508

   I've read this thread a few times, but I'm still hazy on what a good approach is, given how out of the loop I have been for so long. We had discussed with Jorge many moons ago that passing the offset and length to Buffer and Bitmap would be a good solution (as is done in arrow2 like you mention @tustvold).
   
   I haven't written arrow code in very long, so I can't quite remember the details. However, what I recall was having a challenge figuring out what happens in the below scenario.
   
   An array is of type `struct[a]<struct[b]<struct[c]<struct[d]<int32[e]>>>>` and we slice it, what happens when we select `a.b.c`?
   
   The trouble was that if we don't pass down the offset and length to the `ArrayData` of `a`'s children, we'd be bound to always knowing `a`'s offset, which forces us to compute it each time we access `a` or any of its children.
   
   So in principle I favoured pushing down the offset at the time. Which I suppose has led us here:
   
   > ArrayData::Slice contains a special case for StructArray where it recurses the offset into its children. However, it preserves the offset on the parent ArrayData, in order for the validity buffer to work correctly.
   
   ___
   
   >There are longer term suggestions around handling offsets in ArrayData differently, but until then I would like to propose:
   > * Remove the ArrayData::slice special-case
   > * Slice child data within StructArray when constructing boxed_fields
   
   This makes sense to implement as a solution (interim?), but yea perhaps first-prize would be propagating offsets and value lengths to a redesigned Buffer and Bitmap
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on issue #1750: Inconsistent Slicing of ArrayData for StructArray

Posted by GitBox <gi...@apache.org>.
tustvold commented on issue #1750:
URL: https://github.com/apache/arrow-rs/issues/1750#issuecomment-1144475684

   Aah, I see the confusion. I'm suggesting storing slices of the child data as boxed_fields on StructArray. So in StructArray::from it would inspect the ArrayData, compute slices, and store those as members. I am not suggesting StructArray::from make any modifications to ArrayData.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] jhorstmann commented on issue #1750: Inconsistent Slicing of ArrayData for StructArray

Posted by GitBox <gi...@apache.org>.
jhorstmann commented on issue #1750:
URL: https://github.com/apache/arrow-rs/issues/1750#issuecomment-1149087530

   > > which previously did not take the struct offset into account.
   > 
   > Is this not a bug?
   
   Difficult to say without a spec or comment describing the expected behavior. Someone might rely on the current behavior and acccess a StructArray using something like `a.column(0).downcast().value(a.offset() + i)`. I didn't see any such usages in the arrow-rs codebase. Our codebase makes heavy use of slices, but we do not use StructArray. If we consider the current behavior a bug, then we should clearly hightlight the change in the release notes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold closed issue #1750: Inconsistent Slicing of ArrayData for StructArray

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold closed issue #1750: Inconsistent Slicing of ArrayData for StructArray
URL: https://github.com/apache/arrow-rs/issues/1750


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] HaoYang670 commented on issue #1750: Inconsistent Slicing of ArrayData for StructArray

Posted by GitBox <gi...@apache.org>.
HaoYang670 commented on issue #1750:
URL: https://github.com/apache/arrow-rs/issues/1750#issuecomment-1144484557

   This will not modify the parent ArrayData, which is great. 
   But It will cause the inconsistency between `bixed_fields` and `self.data.child_data`, which means `StructArray::column(self, 0) !=  make_array(self.data.child_data()[0])`?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on issue #1750: Inconsistent Slicing of ArrayData for StructArray

Posted by GitBox <gi...@apache.org>.
tustvold commented on issue #1750:
URL: https://github.com/apache/arrow-rs/issues/1750#issuecomment-1149061109

   > which previously did not take the struct offset into account.
   
   Is this not a bug?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on issue #1750: Inconsistent Slicing of ArrayData for StructArray

Posted by GitBox <gi...@apache.org>.
tustvold commented on issue #1750:
URL: https://github.com/apache/arrow-rs/issues/1750#issuecomment-1186004486

   Closing in favor of removing the offsets entirely (#1799)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on issue #1750: Inconsistent Slicing of ArrayData for StructArray

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on issue #1750:
URL: https://github.com/apache/arrow-rs/issues/1750#issuecomment-1523620281

   Closed by #4061 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] HaoYang670 commented on issue #1750: Inconsistent Slicing of ArrayData for StructArray

Posted by GitBox <gi...@apache.org>.
HaoYang670 commented on issue #1750:
URL: https://github.com/apache/arrow-rs/issues/1750#issuecomment-1144347783

   Hi @tustvold 
   Here are some thoughts of mine:
   1. (Agree with you) We have to preserve `offset` on parent ArrayData (the struct array) because of the validity buffer.
   2. >Slice child data within StructArray when constructing boxed_fields
   If what you mean is to slice child arrays in `StructArray::from`:
   ```rust
       fn from(data: ArrayData) -> Self {
           let boxed_fields = data
               .child_data()
               .iter()
               .map(|cd| make_array(cd.slice(data.offset(), data.len())))
               .collect();
   
           Self { data, boxed_fields }
       }
   ```
   I am afraid this cannot solve the problem. Because
   - Inconsistent offsets still exist. For example, given a struct array:
   ```
   StructArray (offset = 0, length = 5, field0 = i32, field1 = bool, validity_buffer = ...)
   -- Int32Array (offset = 1, length = 5)
   -- BooleanArray (offset = 2, length = 5)
   ```
   After calling `array.slice(offset = 1, length = 3)`, we will get a struct array like this
   ```
   StructArray (offset = 1, length = 3, field0 = i32, field1 = bool, validity_buffer = ...)
   -- Int32Array (offset = 2, length = 3)
   -- BooleanArray (offset = 3, length = 3)
   ```
   We still have 3 `offset`s in parent array and child arrays. And it is not clear whether we have pushed down the new offset to children.
   3. My suggestion (workaround)
   - Do not push down the offset to child arrays when slicing (aka, just update the offset and null_count of parent array)
   - Do not slicing child arrays when building `boxed_fields`. Because this will produce inconsistent offsets
   - Slice child arrays when you call `array.column(idx)`. This is also the way of getting a buffer of array now:
   Example: getting the offsets of binary array.
   ```rust
       pub fn value_offsets(&self) -> &[OffsetSize] {
           unsafe {
               std::slice::from_raw_parts(
                   self.value_offsets.as_ptr().add(self.data.offset()),
                   self.len() + 1,
               )
           }
       }
   ```
   Slice when getting the child array. (API change, because we own the `ArrayRef` now)
   ```rust
       pub fn column(&self, pos: usize) -> ArrayRef {
           self.boxed_fields[pos].slice(self.offset, self.length)
       }
   ```
   - Long term suggestion: push down `offset` and `length` to `Buffer`(`Bitmap`).
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] HaoYang670 commented on issue #1750: Inconsistent Slicing of ArrayData for StructArray

Posted by GitBox <gi...@apache.org>.
HaoYang670 commented on issue #1750:
URL: https://github.com/apache/arrow-rs/issues/1750#issuecomment-1144348957

   cc @nevi-me @alamb @viirya


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] HaoYang670 commented on issue #1750: Inconsistent Slicing of ArrayData for StructArray

Posted by GitBox <gi...@apache.org>.
HaoYang670 commented on issue #1750:
URL: https://github.com/apache/arrow-rs/issues/1750#issuecomment-1144471150

   >The suggestion is never to push offsets down within ArrayData
   
   Although we don't push down the offset within `ArrayData::slice` (Based on your 1st suggestion `Remove the ArrayData::slice special-case`), we do push it down within `StructArray::from` (Based on your 2nd suggestion `Slice child data within StructArray when constructing boxed_fields`) because we must slice on each `child_data` to construct the correct child array.
   
   As a result, I guess your suggestions (slicing child arrays within `StructArray::from()`) and the current implementation (slicing the child data within `ArrayData::slice()`) are equivalent. I am not 100% sure.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on issue #1750: Inconsistent Slicing of ArrayData for StructArray

Posted by GitBox <gi...@apache.org>.
tustvold commented on issue #1750:
URL: https://github.com/apache/arrow-rs/issues/1750#issuecomment-1153858639

   > The trouble was that if we don't pass down the offset and length to the ArrayData of a's children, we'd be bound to always knowing a's offset, which forces us to compute it each time we access a or any of its children.
   
   I think caching the sliced array on `StructArray::boxed_fields` would avoid this, and is consistent with what we do for PrimtiveArray / StringArray, etc... ?
   
   > but yea perhaps first-prize would be propagating offsets and value lengths to a redesigned Buffer and Bitmap
   
   Agreed, and I think the first step of this is removing the remaining places that "leak" offsets into the Array-level interfaces :smile: 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org