You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Weston Pace <we...@gmail.com> on 2023/05/05 17:45:23 UTC
[Format] Is it legal to have a struct array with a shorter length than its children?
We allow arrays to have a shorter length than their buffers. Is it also
legal for a struct array to have a shorter length than its child arrays?
For example, in C++, I can create this today by slicing a struct array:
```
std::shared_ptr<StructArray> my_array =
std::dynamic_pointer_cast<StructArray>(array);
ASSERT_EQ(my_array->length(), 4);
ASSERT_EQ(my_array->field(0)->length(), 4);
auto sliced = std::dynamic_pointer_cast<StructArray>(my_array->Slice(2));
ASSERT_EQ(sliced->length(), 2);
// Note: StructArray::field pushes its offset and length into the created
array
ASSERT_EQ(sliced->field(0)->length(), 2);
// However, the actually ArrayData objects show the truth
ASSERT_EQ(sliced->data()->child_data[0]->length, 4);
// Our validator thinks this is ok
ASSERT_OK(sliced->ValidateFull());
```
The only reference I can find in the spec is this:
> Struct: a nested layout consisting of a collection of named child fields
each
> having the same length but possibly different types.
This seems to suggest that the C++ implementation is doing something
incorrect.
I'm asking because I've started to encounter some issues relating to
this[1][2] and I'm not sure if the struct array itself is the issue or the
fact that we aren't expecting these kinds of struct arrays is the problem.
[1] https://github.com/apache/arrow/issues/35450
[2] https://github.com/apache/arrow/issues/35452
Re: [Format] Is it legal to have a struct array with a shorter length than its children?
Posted by Micah Kornfield <em...@gmail.com>.
>
> I think the `the same length` means the length of the struct array, this is
> similar in the case of RecordBatch where the `num_rows` of a RecordBatch
> can be different to the length of its fields.
I didn't think this is true and I thought we had prior discussions on the
mailing list suggesting that all top level arrays should have the same
length in the IPC format.
ASSERT_EQ(sliced->length(), 2);
> // Note: StructArray::field pushes its offset and length into the created
> array
> ASSERT_EQ(sliced->field(0)->length(), 2);
From an IPC perspective this is what I would expect (the buffers themselves
could be longer than needed but they are sliced by the length), IIRC, isn't
ArrayData an internal C++ detail?
Thanks,
Micah
On Sat, May 6, 2023 at 7:34 AM Antoine Pitrou <an...@python.org> wrote:
>
> I agree with Gang. The fact that a struct field may be backed by a
> physically larger C++ ArrayData is irrelevant, as long as it's logically
> interpreted as having "the same length".
>
> (however, this implementation detail should ideally not leak into IPC or
> C Data exports)
>
> Regards
>
> Antoine.
>
>
> Le 06/05/2023 à 03:31, Gang Wu a écrit :
> > IMHO, this is valid. As you have demonstrated in the example, a sliced
> > struct array will result in a length shorter than its child arrays. This
> > kind
> > of flexibility can make it easy to reuse child arrays within the struct
> > array.
> >
> >> Struct: a nested layout consisting of a collection of named child fields
> >> each having the same length but possibly different types.
> >
> > I think the `the same length` means the length of the struct array, this
> is
> > similar in the case of RecordBatch where the `num_rows` of a RecordBatch
> > can be different to the length of its fields.
> >
> > Best,
> > Gang
> >
> >
> > On Sat, May 6, 2023 at 1:45 AM Weston Pace <we...@gmail.com>
> wrote:
> >
> >> We allow arrays to have a shorter length than their buffers. Is it also
> >> legal for a struct array to have a shorter length than its child arrays?
> >> For example, in C++, I can create this today by slicing a struct array:
> >>
> >> ```
> >> std::shared_ptr<StructArray> my_array =
> >> std::dynamic_pointer_cast<StructArray>(array);
> >> ASSERT_EQ(my_array->length(), 4);
> >> ASSERT_EQ(my_array->field(0)->length(), 4);
> >> auto sliced =
> std::dynamic_pointer_cast<StructArray>(my_array->Slice(2));
> >> ASSERT_EQ(sliced->length(), 2);
> >> // Note: StructArray::field pushes its offset and length into the
> created
> >> array
> >> ASSERT_EQ(sliced->field(0)->length(), 2);
> >> // However, the actually ArrayData objects show the truth
> >> ASSERT_EQ(sliced->data()->child_data[0]->length, 4);
> >> // Our validator thinks this is ok
> >> ASSERT_OK(sliced->ValidateFull());
> >> ```
> >>
> >> The only reference I can find in the spec is this:
> >>
> >>> Struct: a nested layout consisting of a collection of named child
> fields
> >> each
> >>> having the same length but possibly different types.
> >>
> >> This seems to suggest that the C++ implementation is doing something
> >> incorrect.
> >>
> >> I'm asking because I've started to encounter some issues relating to
> >> this[1][2] and I'm not sure if the struct array itself is the issue or
> the
> >> fact that we aren't expecting these kinds of struct arrays is the
> problem.
> >>
> >> [1] https://github.com/apache/arrow/issues/35450
> >> [2] https://github.com/apache/arrow/issues/35452
> >>
> >
>
Re: [Format] Is it legal to have a struct array with a shorter length than its children?
Posted by Antoine Pitrou <an...@python.org>.
I agree with Gang. The fact that a struct field may be backed by a
physically larger C++ ArrayData is irrelevant, as long as it's logically
interpreted as having "the same length".
(however, this implementation detail should ideally not leak into IPC or
C Data exports)
Regards
Antoine.
Le 06/05/2023 à 03:31, Gang Wu a écrit :
> IMHO, this is valid. As you have demonstrated in the example, a sliced
> struct array will result in a length shorter than its child arrays. This
> kind
> of flexibility can make it easy to reuse child arrays within the struct
> array.
>
>> Struct: a nested layout consisting of a collection of named child fields
>> each having the same length but possibly different types.
>
> I think the `the same length` means the length of the struct array, this is
> similar in the case of RecordBatch where the `num_rows` of a RecordBatch
> can be different to the length of its fields.
>
> Best,
> Gang
>
>
> On Sat, May 6, 2023 at 1:45 AM Weston Pace <we...@gmail.com> wrote:
>
>> We allow arrays to have a shorter length than their buffers. Is it also
>> legal for a struct array to have a shorter length than its child arrays?
>> For example, in C++, I can create this today by slicing a struct array:
>>
>> ```
>> std::shared_ptr<StructArray> my_array =
>> std::dynamic_pointer_cast<StructArray>(array);
>> ASSERT_EQ(my_array->length(), 4);
>> ASSERT_EQ(my_array->field(0)->length(), 4);
>> auto sliced = std::dynamic_pointer_cast<StructArray>(my_array->Slice(2));
>> ASSERT_EQ(sliced->length(), 2);
>> // Note: StructArray::field pushes its offset and length into the created
>> array
>> ASSERT_EQ(sliced->field(0)->length(), 2);
>> // However, the actually ArrayData objects show the truth
>> ASSERT_EQ(sliced->data()->child_data[0]->length, 4);
>> // Our validator thinks this is ok
>> ASSERT_OK(sliced->ValidateFull());
>> ```
>>
>> The only reference I can find in the spec is this:
>>
>>> Struct: a nested layout consisting of a collection of named child fields
>> each
>>> having the same length but possibly different types.
>>
>> This seems to suggest that the C++ implementation is doing something
>> incorrect.
>>
>> I'm asking because I've started to encounter some issues relating to
>> this[1][2] and I'm not sure if the struct array itself is the issue or the
>> fact that we aren't expecting these kinds of struct arrays is the problem.
>>
>> [1] https://github.com/apache/arrow/issues/35450
>> [2] https://github.com/apache/arrow/issues/35452
>>
>
Re: [Format] Is it legal to have a struct array with a shorter length than its children?
Posted by Gang Wu <us...@gmail.com>.
IMHO, this is valid. As you have demonstrated in the example, a sliced
struct array will result in a length shorter than its child arrays. This
kind
of flexibility can make it easy to reuse child arrays within the struct
array.
> Struct: a nested layout consisting of a collection of named child fields
> each having the same length but possibly different types.
I think the `the same length` means the length of the struct array, this is
similar in the case of RecordBatch where the `num_rows` of a RecordBatch
can be different to the length of its fields.
Best,
Gang
On Sat, May 6, 2023 at 1:45 AM Weston Pace <we...@gmail.com> wrote:
> We allow arrays to have a shorter length than their buffers. Is it also
> legal for a struct array to have a shorter length than its child arrays?
> For example, in C++, I can create this today by slicing a struct array:
>
> ```
> std::shared_ptr<StructArray> my_array =
> std::dynamic_pointer_cast<StructArray>(array);
> ASSERT_EQ(my_array->length(), 4);
> ASSERT_EQ(my_array->field(0)->length(), 4);
> auto sliced = std::dynamic_pointer_cast<StructArray>(my_array->Slice(2));
> ASSERT_EQ(sliced->length(), 2);
> // Note: StructArray::field pushes its offset and length into the created
> array
> ASSERT_EQ(sliced->field(0)->length(), 2);
> // However, the actually ArrayData objects show the truth
> ASSERT_EQ(sliced->data()->child_data[0]->length, 4);
> // Our validator thinks this is ok
> ASSERT_OK(sliced->ValidateFull());
> ```
>
> The only reference I can find in the spec is this:
>
> > Struct: a nested layout consisting of a collection of named child fields
> each
> > having the same length but possibly different types.
>
> This seems to suggest that the C++ implementation is doing something
> incorrect.
>
> I'm asking because I've started to encounter some issues relating to
> this[1][2] and I'm not sure if the struct array itself is the issue or the
> fact that we aren't expecting these kinds of struct arrays is the problem.
>
> [1] https://github.com/apache/arrow/issues/35450
> [2] https://github.com/apache/arrow/issues/35452
>