You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Weston Pace <we...@gmail.com> on 2023/05/05 17:45:23 UTC

[Format] Is it legal to have a struct array with a shorter length than its children?

We allow arrays to have a shorter length than their buffers.  Is it also
legal for a struct array to have a shorter length than its child arrays?
For example, in C++, I can create this today by slicing a struct array:

```
  std::shared_ptr<StructArray> my_array =
std::dynamic_pointer_cast<StructArray>(array);
  ASSERT_EQ(my_array->length(), 4);
  ASSERT_EQ(my_array->field(0)->length(), 4);
  auto sliced = std::dynamic_pointer_cast<StructArray>(my_array->Slice(2));
  ASSERT_EQ(sliced->length(), 2);
  // Note: StructArray::field pushes its offset and length into the created
array
  ASSERT_EQ(sliced->field(0)->length(), 2);
  // However, the actually ArrayData objects show the truth
  ASSERT_EQ(sliced->data()->child_data[0]->length, 4);
  // Our validator thinks this is ok
  ASSERT_OK(sliced->ValidateFull());
```

The only reference I can find in the spec is this:

> Struct: a nested layout consisting of a collection of named child fields
each
> having the same length but possibly different types.

This seems to suggest that the C++ implementation is doing something
incorrect.

I'm asking because I've started to encounter some issues relating to
this[1][2] and I'm not sure if the struct array itself is the issue or the
fact that we aren't expecting these kinds of struct arrays is the problem.

[1] https://github.com/apache/arrow/issues/35450
[2] https://github.com/apache/arrow/issues/35452

Re: [Format] Is it legal to have a struct array with a shorter length than its children?

Posted by Micah Kornfield <em...@gmail.com>.
>
> I think the `the same length` means the length of the struct array, this is
> similar in the case of RecordBatch where the `num_rows` of a RecordBatch
> can be different to the length of its fields.


 I didn't think this is true and I thought we had prior discussions on the
mailing list suggesting that all top level arrays should have the same
length in the IPC format.

  ASSERT_EQ(sliced->length(), 2);
>   // Note: StructArray::field pushes its offset and length into the created
> array
>   ASSERT_EQ(sliced->field(0)->length(), 2);

From an IPC perspective this is what I would expect (the buffers themselves
could be longer than needed but they are sliced by the length), IIRC, isn't
ArrayData an internal C++ detail?

Thanks,
Micah



On Sat, May 6, 2023 at 7:34 AM Antoine Pitrou <an...@python.org> wrote:

>
> I agree with Gang. The fact that a struct field may be backed by a
> physically larger C++ ArrayData is irrelevant, as long as it's logically
> interpreted as having "the same length".
>
> (however, this implementation detail should ideally not leak into IPC or
> C Data exports)
>
> Regards
>
> Antoine.
>
>
> Le 06/05/2023 à 03:31, Gang Wu a écrit :
> > IMHO, this is valid. As you have demonstrated in the example, a sliced
> > struct array will result in a length shorter than its child arrays. This
> > kind
> > of flexibility can make it easy to reuse child arrays within the struct
> > array.
> >
> >> Struct: a nested layout consisting of a collection of named child fields
> >> each having the same length but possibly different types.
> >
> > I think the `the same length` means the length of the struct array, this
> is
> > similar in the case of RecordBatch where the `num_rows` of a RecordBatch
> > can be different to the length of its fields.
> >
> > Best,
> > Gang
> >
> >
> > On Sat, May 6, 2023 at 1:45 AM Weston Pace <we...@gmail.com>
> wrote:
> >
> >> We allow arrays to have a shorter length than their buffers.  Is it also
> >> legal for a struct array to have a shorter length than its child arrays?
> >> For example, in C++, I can create this today by slicing a struct array:
> >>
> >> ```
> >>    std::shared_ptr<StructArray> my_array =
> >> std::dynamic_pointer_cast<StructArray>(array);
> >>    ASSERT_EQ(my_array->length(), 4);
> >>    ASSERT_EQ(my_array->field(0)->length(), 4);
> >>    auto sliced =
> std::dynamic_pointer_cast<StructArray>(my_array->Slice(2));
> >>    ASSERT_EQ(sliced->length(), 2);
> >>    // Note: StructArray::field pushes its offset and length into the
> created
> >> array
> >>    ASSERT_EQ(sliced->field(0)->length(), 2);
> >>    // However, the actually ArrayData objects show the truth
> >>    ASSERT_EQ(sliced->data()->child_data[0]->length, 4);
> >>    // Our validator thinks this is ok
> >>    ASSERT_OK(sliced->ValidateFull());
> >> ```
> >>
> >> The only reference I can find in the spec is this:
> >>
> >>> Struct: a nested layout consisting of a collection of named child
> fields
> >> each
> >>> having the same length but possibly different types.
> >>
> >> This seems to suggest that the C++ implementation is doing something
> >> incorrect.
> >>
> >> I'm asking because I've started to encounter some issues relating to
> >> this[1][2] and I'm not sure if the struct array itself is the issue or
> the
> >> fact that we aren't expecting these kinds of struct arrays is the
> problem.
> >>
> >> [1] https://github.com/apache/arrow/issues/35450
> >> [2] https://github.com/apache/arrow/issues/35452
> >>
> >
>

Re: [Format] Is it legal to have a struct array with a shorter length than its children?

Posted by Antoine Pitrou <an...@python.org>.
I agree with Gang. The fact that a struct field may be backed by a 
physically larger C++ ArrayData is irrelevant, as long as it's logically 
interpreted as having "the same length".

(however, this implementation detail should ideally not leak into IPC or 
C Data exports)

Regards

Antoine.


Le 06/05/2023 à 03:31, Gang Wu a écrit :
> IMHO, this is valid. As you have demonstrated in the example, a sliced
> struct array will result in a length shorter than its child arrays. This
> kind
> of flexibility can make it easy to reuse child arrays within the struct
> array.
> 
>> Struct: a nested layout consisting of a collection of named child fields
>> each having the same length but possibly different types.
> 
> I think the `the same length` means the length of the struct array, this is
> similar in the case of RecordBatch where the `num_rows` of a RecordBatch
> can be different to the length of its fields.
> 
> Best,
> Gang
> 
> 
> On Sat, May 6, 2023 at 1:45 AM Weston Pace <we...@gmail.com> wrote:
> 
>> We allow arrays to have a shorter length than their buffers.  Is it also
>> legal for a struct array to have a shorter length than its child arrays?
>> For example, in C++, I can create this today by slicing a struct array:
>>
>> ```
>>    std::shared_ptr<StructArray> my_array =
>> std::dynamic_pointer_cast<StructArray>(array);
>>    ASSERT_EQ(my_array->length(), 4);
>>    ASSERT_EQ(my_array->field(0)->length(), 4);
>>    auto sliced = std::dynamic_pointer_cast<StructArray>(my_array->Slice(2));
>>    ASSERT_EQ(sliced->length(), 2);
>>    // Note: StructArray::field pushes its offset and length into the created
>> array
>>    ASSERT_EQ(sliced->field(0)->length(), 2);
>>    // However, the actually ArrayData objects show the truth
>>    ASSERT_EQ(sliced->data()->child_data[0]->length, 4);
>>    // Our validator thinks this is ok
>>    ASSERT_OK(sliced->ValidateFull());
>> ```
>>
>> The only reference I can find in the spec is this:
>>
>>> Struct: a nested layout consisting of a collection of named child fields
>> each
>>> having the same length but possibly different types.
>>
>> This seems to suggest that the C++ implementation is doing something
>> incorrect.
>>
>> I'm asking because I've started to encounter some issues relating to
>> this[1][2] and I'm not sure if the struct array itself is the issue or the
>> fact that we aren't expecting these kinds of struct arrays is the problem.
>>
>> [1] https://github.com/apache/arrow/issues/35450
>> [2] https://github.com/apache/arrow/issues/35452
>>
> 

Re: [Format] Is it legal to have a struct array with a shorter length than its children?

Posted by Gang Wu <us...@gmail.com>.
IMHO, this is valid. As you have demonstrated in the example, a sliced
struct array will result in a length shorter than its child arrays. This
kind
of flexibility can make it easy to reuse child arrays within the struct
array.

> Struct: a nested layout consisting of a collection of named child fields
> each having the same length but possibly different types.

I think the `the same length` means the length of the struct array, this is
similar in the case of RecordBatch where the `num_rows` of a RecordBatch
can be different to the length of its fields.

Best,
Gang


On Sat, May 6, 2023 at 1:45 AM Weston Pace <we...@gmail.com> wrote:

> We allow arrays to have a shorter length than their buffers.  Is it also
> legal for a struct array to have a shorter length than its child arrays?
> For example, in C++, I can create this today by slicing a struct array:
>
> ```
>   std::shared_ptr<StructArray> my_array =
> std::dynamic_pointer_cast<StructArray>(array);
>   ASSERT_EQ(my_array->length(), 4);
>   ASSERT_EQ(my_array->field(0)->length(), 4);
>   auto sliced = std::dynamic_pointer_cast<StructArray>(my_array->Slice(2));
>   ASSERT_EQ(sliced->length(), 2);
>   // Note: StructArray::field pushes its offset and length into the created
> array
>   ASSERT_EQ(sliced->field(0)->length(), 2);
>   // However, the actually ArrayData objects show the truth
>   ASSERT_EQ(sliced->data()->child_data[0]->length, 4);
>   // Our validator thinks this is ok
>   ASSERT_OK(sliced->ValidateFull());
> ```
>
> The only reference I can find in the spec is this:
>
> > Struct: a nested layout consisting of a collection of named child fields
> each
> > having the same length but possibly different types.
>
> This seems to suggest that the C++ implementation is doing something
> incorrect.
>
> I'm asking because I've started to encounter some issues relating to
> this[1][2] and I'm not sure if the struct array itself is the issue or the
> fact that we aren't expecting these kinds of struct arrays is the problem.
>
> [1] https://github.com/apache/arrow/issues/35450
> [2] https://github.com/apache/arrow/issues/35452
>