You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Antoine Pitrou <an...@python.org> on 2019/11/21 15:51:16 UTC

Unions: storing type_ids or type_codes?

Hello,

There's some ambiguity whether a union array's "types" buffer stores
physical child ids, or logical type codes.

Some of our C++ tests assume the former:
https://github.com/apache/arrow/blob/master/cpp/src/arrow/array_union_test.cc#L107-L123

Some of our C++ tests assume the latter:
https://github.com/apache/arrow/blob/master/cpp/src/arrow/array_union_test.cc#L311-L326
https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/json_simple_test.cc#L943-L955

Critically, no validation of union data is currently implemented in C++
(ARROW-6157).  I can't parse the Java source code.

Regards

Antoine.


Re: Unions: storing type_ids or type_codes?

Posted by Wes McKinney <we...@gmail.com>.
hi Antoine,

The latter is correct, or at least what is intended in the specification.

For example, if the type metadata indices codes [0, 5, 10], then the
"types" buffer should contain values selected from these values rather
than physical child indexes (which would be [0, 1, 2] in this case)

Thanks

On Thu, Nov 21, 2019 at 9:51 AM Antoine Pitrou <an...@python.org> wrote:
>
>
> Hello,
>
> There's some ambiguity whether a union array's "types" buffer stores
> physical child ids, or logical type codes.
>
> Some of our C++ tests assume the former:
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/array_union_test.cc#L107-L123
>
> Some of our C++ tests assume the latter:
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/array_union_test.cc#L311-L326
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/json_simple_test.cc#L943-L955
>
> Critically, no validation of union data is currently implemented in C++
> (ARROW-6157).  I can't parse the Java source code.
>
> Regards
>
> Antoine.
>

Re: Unions: storing type_ids or type_codes?

Posted by Wes McKinney <we...@gmail.com>.
> Wes, is the intended usage of type_ids to allow a producer to pass a
subset columns of unions without modifying the type codes?

Yes

On Tue, Nov 26, 2019 at 8:08 PM Fan Liya <li...@gmail.com> wrote:
>
> Hi Antoine,
>
> For Java, the physical child id is the same as the logical type code, as
> the index of each child vector is the code (ordinal) of the vector's minor
> type.
> This leads to a problem, that only a single vector for each type can exist
> in a union vector, so strictly speaking, the Java implementation is not
> consistent with the Arrow specification. (This is indicated by Micah long
> ago).
>
> Best,
> Liya Fan
>
>
> On Tue, Nov 26, 2019 at 9:59 PM Francois Saint-Jacques <
> fsaintjacques@gmail.com> wrote:
>
> > It seems that the array_union_test.cc does the latter, look at how
> > `expected_types` is constructed. I opened
> > https://issues.apache.org/jira/browse/ARROW-7265 .
> >
> > Wes, is the intended usage of type_ids to allow a producer to pass a
> > subset columns of unions without modifying the type codes?
> >
> > François
> >
> >
> > On Thu, Nov 21, 2019 at 10:51 AM Antoine Pitrou <an...@python.org>
> > wrote:
> > >
> > >
> > > Hello,
> > >
> > > There's some ambiguity whether a union array's "types" buffer stores
> > > physical child ids, or logical type codes.
> > >
> > > Some of our C++ tests assume the former:
> > >
> > https://github.com/apache/arrow/blob/master/cpp/src/arrow/array_union_test.cc#L107-L123
> > >
> > > Some of our C++ tests assume the latter:
> > >
> > https://github.com/apache/arrow/blob/master/cpp/src/arrow/array_union_test.cc#L311-L326
> > >
> > https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/json_simple_test.cc#L943-L955
> > >
> > > Critically, no validation of union data is currently implemented in C++
> > > (ARROW-6157).  I can't parse the Java source code.
> > >
> > > Regards
> > >
> > > Antoine.
> > >
> >

Re: Unions: storing type_ids or type_codes?

Posted by Fan Liya <li...@gmail.com>.
Hi Antoine,

For Java, the physical child id is the same as the logical type code, as
the index of each child vector is the code (ordinal) of the vector's minor
type.
This leads to a problem, that only a single vector for each type can exist
in a union vector, so strictly speaking, the Java implementation is not
consistent with the Arrow specification. (This is indicated by Micah long
ago).

Best,
Liya Fan


On Tue, Nov 26, 2019 at 9:59 PM Francois Saint-Jacques <
fsaintjacques@gmail.com> wrote:

> It seems that the array_union_test.cc does the latter, look at how
> `expected_types` is constructed. I opened
> https://issues.apache.org/jira/browse/ARROW-7265 .
>
> Wes, is the intended usage of type_ids to allow a producer to pass a
> subset columns of unions without modifying the type codes?
>
> François
>
>
> On Thu, Nov 21, 2019 at 10:51 AM Antoine Pitrou <an...@python.org>
> wrote:
> >
> >
> > Hello,
> >
> > There's some ambiguity whether a union array's "types" buffer stores
> > physical child ids, or logical type codes.
> >
> > Some of our C++ tests assume the former:
> >
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/array_union_test.cc#L107-L123
> >
> > Some of our C++ tests assume the latter:
> >
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/array_union_test.cc#L311-L326
> >
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/json_simple_test.cc#L943-L955
> >
> > Critically, no validation of union data is currently implemented in C++
> > (ARROW-6157).  I can't parse the Java source code.
> >
> > Regards
> >
> > Antoine.
> >
>

Re: Unions: storing type_ids or type_codes?

Posted by Francois Saint-Jacques <fs...@gmail.com>.
It seems that the array_union_test.cc does the latter, look at how
`expected_types` is constructed. I opened
https://issues.apache.org/jira/browse/ARROW-7265 .

Wes, is the intended usage of type_ids to allow a producer to pass a
subset columns of unions without modifying the type codes?

François


On Thu, Nov 21, 2019 at 10:51 AM Antoine Pitrou <an...@python.org> wrote:
>
>
> Hello,
>
> There's some ambiguity whether a union array's "types" buffer stores
> physical child ids, or logical type codes.
>
> Some of our C++ tests assume the former:
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/array_union_test.cc#L107-L123
>
> Some of our C++ tests assume the latter:
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/array_union_test.cc#L311-L326
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/json_simple_test.cc#L943-L955
>
> Critically, no validation of union data is currently implemented in C++
> (ARROW-6157).  I can't parse the Java source code.
>
> Regards
>
> Antoine.
>